694738 – Text synthesis of missing appearances in the PDF interpreter does not handle UTF16BE

Bug 694738 - Text synthesis of missing appearances in the PDF interpreter does not handle UTF16BE

Summary: Text synthesis of missing appearances in the PDF interpreter does not handle ...

Status:	RESOLVED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Interpreter (show other bugs)
Version:	9.07
Hardware:	PC All

Importance:	P4 enhancement
Assignee:	Ken Sharp

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-10-24 02:15 UTC by andrusha
Modified:	2014-06-07 03:49 UTC (History)
CC List:	0 users

See Also:
Customer:
Word Size:	---

Attachments
Example bogus file. (634.11 KB, application/pdf) 2013-10-24 02:15 UTC, andrusha	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description andrusha 2013-10-24 02:15:19 UTC

Created attachment 10352 [details]
Example bogus file.

I have PDF with PDF form.
Form's fields are filled with UTF-16BE strings.

Ghostscript ignore prefix 0xFEFF and uses wrong symbols for strings.

PDFDocEncoding is a superset of the ISO Latin 1.
ISO Latin 1 doesn't have Russian symbols.

According to http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf

Page chapter: 3.8 Common Data Structures

Text strings are encoded in either PDFDocEncoding or Unicode character encoding.
For text strings encoded in Unicode, the ﬁrst two bytes must be 254 followed by 255, representing the Unicode byte order marker, U+FEFF

The remainder of the string consists of Unicode character codes, according to the UTF-16 encoding speciﬁed in the Unicode standard, version 2.0.

That is what I have.


Another part of standard also says:

The encoding to be used for any FDF ﬁeld value or option (V or Opt in the ﬁeld dictionary; see Table 8.72 on page 564) that is a string and does not begin with the Unicode pre- ﬁx U+FEFF. (See implementation note 92 in Appendix H.) Default
value: PDFDocEncoding.

Comment 1 Ken Sharp 2013-10-24 02:19:05 UTC

The actual problem is that the text synthesis code (Tform) does not handle strings in UTF16BE.

Comment 2 Ken Sharp 2013-11-08 09:56:55 UTC

This commit : 1cb2458772321dc86117cb45b5b28a1423ccf9b7 fixes the problem for me but I'm a little concerned that it is simply masking a deeper problem. If you still get the same result with other files please reopen the bug and attach a new failing file.

Comment 3 Ken Sharp 2013-11-08 09:58:01 UTC

Oops, sorry, wrong bug :-(

Comment 4 Ken Sharp 2014-06-07 03:49:55 UTC

commit 33fb85045c2590ac58a723ea2abcfbde505e53d1 resolves this bug. We now strip the BOM before printing the text.

The result is not the same as the Acrobat display, but this is because Acrobat
ignores the appearances form the form and creates its own version.

The text is not the same as Acrobat because we do use the DA (Default Appearance) and the font in use is not appropriately encoded for use with UTF16 encoded text. If we instead use our own fallback font, and a Identity-UTF16-H CMap, the text matches the Acrobat display, which to me is pretty conclusive evidence that the font is incorrect.