Bug 691523 - TrueType text in PDF/A files (CID Font) will not translate to Unicode
Summary: TrueType text in PDF/A files (CID Font) will not translate to Unicode
Status: RESOLVED WONTFIX
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 8.71
Hardware: PC Windows Vista
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-03 15:49 UTC by mw
Modified: 2010-08-13 18:53 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
PostScript source file (29.82 KB, application/postscript)
2010-08-03 15:49 UTC, mw
Details
Output of GS 8.70 (5.91 KB, application/pdf)
2010-08-03 15:51 UTC, mw
Details
Output of GS 8.71 (5.91 KB, application/pdf)
2010-08-03 15:51 UTC, mw
Details

Note You need to log in before you can comment on or make changes to this bug.
Description mw 2010-08-03 15:49:46 UTC
Created attachment 6600 [details]
PostScript source file

I am converting a PS file with embedded TrueType fonts into PDF/A.
The resulting PDF/A shows no obvious errors, but when copying text from the file (i.E. with Adobe Reader) the copied chars will not translate properly to Unicode.

I attach test files which contain the text ABCD-XYZ

Behaviour differs by version:

Up to GhostScript version 8.70, the hyphen will translate to (hex) 0500 (in this example 5 is the offset of this glyph in the font subset).

In GhostScript version 8.71 the above text translates to (hex) 0100 0200 0300 0400 0500 0600 0700 0800.

If -dPDFA is not defined, all looks fine.

Looks like there is a major problem in the conversion of TrueType fonts to TrueType CID fonts.

I add a sample PS file as attachment.

Please convert with:

gswin32c.exe -sOutputFile=out.pdf -sDEVICE=pdfwrite -dPDFA -dNOPAUSE -c save pop .setpdfwrite -f in.ps -c quit
Comment 1 mw 2010-08-03 15:51:02 UTC
Created attachment 6601 [details]
Output of GS 8.70
Comment 2 mw 2010-08-03 15:51:30 UTC
Created attachment 6602 [details]
Output of GS 8.71
Comment 3 Ken Sharp 2010-08-03 16:00:45 UTC
The PostScript file does not contain a GlyphName2Unicode CMap, this is what pdfwrite uses to create and embed a ToUnicode CMap. In addition the font is subset, so it uses a non-standard encoding.

In the absence of either a non-standard Encoding, or a GlyphNamesToUnicode CMap, there is currently no way for pdfwrite to construct a ToUnicode CMap for the output.

Adobe Acrobat requires ToUnicode information for copy/paste/search operations, in the absence of this information none of these operations work. Adobe re-encodes the font so that a standard encoding will work, but we have no plans to do so at present.

The PDF file is, as you note, correct for printing and viewing.
Comment 4 mw 2010-08-03 20:43:57 UTC
Ken,

thank you for the answer.

Please help me understand.

Please explain, why/how GS 8.70 is able to translate the "normal" characters to unicode (what is different for the hyphen?) and why this works for all the characters if -dPDFA is omitted. Why did this behaviour change in GS 8.71?

As you might have noticed, the PS file has been created with a Windows PS driver. Is there anything which can be done to get the information that GhostScript needs to embed a proper encoding?

Thank you,
Markus
Comment 5 Ken Sharp 2010-08-04 07:17:14 UTC
(In reply to comment #4)

> Please explain, why/how GS 8.70 is able to translate the "normal" characters to
> unicode (what is different for the hyphen?) and why this works for all the
> characters if -dPDFA is omitted.

PDF/A makes certain operations invalid. For instance it is no longer legal to embed a font subset, we must embed the whole font (even if it is already a subset, we can't tell that).

> Why did this behaviour change in GS 8.71?

I'd have to dig back through the revisions to find out, but my guess would be that it was to fix a bug causing us to produceinvalid PDF/A files. There have been a number of changes recently in order to make GS produce PDF/A files which conform to the specification.

 
> As you might have noticed, the PS file has been created with a Windows PS
> driver. Is there anything which can be done to get the information that
> GhostScript needs to embed a proper encoding?

Normally I would expect the Windows driver to include the GlyphNames2Unicode table, but its possible you are using an older version of the PostScript driver. The one supplied in later versions of Windows is sourced from Adobe and seems to produce this information.
Comment 6 mw 2010-08-04 19:00:49 UTC
(In reply to comment #3)
> The PostScript file does not contain a GlyphName2Unicode CMap, this is what
> pdfwrite uses to create and embed a ToUnicode CMap. 

I compared the files from GS 8.70 (with proper translation to Unicode for all characters except the hyphen) and from GS 8.71 (no translation to unicode at all).

They both contain a ToUnicode CMap, in fact the content of this object is the only difference between the files regarding the embedded font.

Where does this ToUnicode CMap come from in the two cases?
Looks like this is the place where the change occurs.

Does this help locating the change?
Comment 7 Ken Sharp 2010-08-05 11:14:43 UTC
(In reply to comment #6)
> (In reply to comment #3)
> > The PostScript file does not contain a GlyphName2Unicode CMap, this is what
> > pdfwrite uses to create and embed a ToUnicode CMap. 
> 
> I compared the files from GS 8.70 (with proper translation to Unicode for all
> characters except the hyphen) and from GS 8.71 (no translation to unicode at
> all).
> 
> They both contain a ToUnicode CMap, in fact the content of this object is the
> only difference between the files regarding the embedded font.
> 
> Where does this ToUnicode CMap come from in the two cases?
> Looks like this is the place where the change occurs.
> 
> Does this help locating the change?

No. 

Using the HEAD revision of Ghostscript the results are essentially identical for me to those from 8.70. There *is* a bug in the 8.70 output, the ToUnicode CMap uses single bytes for the entries, instead of padding with 00 to produce 2 byte entries. All entries in a ToUnicode CMap must be 2 bytes.

The same is true of the 8.71 and 8.70 PDF files you have supplied. You haven't said what version of Reader you are using, for me Acrobat (Professional) 9 will not permit copy/paste of the text from a PDF/A file. Acrobat 7 will, but won't from the 8.71 file you have supplied (BTW neither of these files are PDF/A valid files). It will from the file created by the current version of GS.

So its entirely unclear to me how you are testing this, but as far as I can tell the bug is in 8.70, and 8.71 and the soon to be released 9.0 are as good as ever.

I would suggest you try the current version of GS.
Comment 8 mw 2010-08-06 18:05:26 UTC
(In reply to comment #7)

Ken,

thank you for the answer.

> You haven't
> said what version of Reader you are using, for me Acrobat (Professional) 9 
> will not permit copy/paste of the text from a PDF/A file. 

Strange. 
I tried Reader 8, Reader 9 and Acrobat Pro 9. All of them will copy the text.
The copied text is ABCD(0x0500)XYZ for the PDF/A files from GS 8.63 (as a example for earlier versions) and GS 8.70.
For the GS 8.71 file it is (0x0100)(0x0200)(0x0300)(0x0400)(0x0500)(0x0600)(0x0700).

> (BTW neither of these files are PDF/A valid files)

I know that, but I wanted to make things simple regarding the ommandline to create the files.

> It will from the file created by the current version of GS.
> So its entirely unclear to me how you are testing this, but as far as I can
> tell the bug is in 8.70, 

This is definitely not the case (at least it is not *only* in 8.70) because GS 8.63 behaves like GS 8.70.

> I would suggest you try the current version of GS.

I will and I will post my results.

Thank you,
Markus
Comment 9 mw 2010-08-13 18:53:05 UTC
(In reply to comment #7)

OK, I checked.

Here are my results:

The problem in 8.70 is, that font->procs.decode_glyph() will return GS_NO_CHAR for the /hyphen. Strange enough, the function is able to map all other glyphs to unicode but not the hyphen. Renaming /hyphen to /minus in the PS file fixes the problem, but the whole thing seems a bit weird to me.

Is there an explanation, why decode_glyph() fails only for /hyphen?

The problem in 8.71 is, that, regardless what Technical Note 5411 says, Adobe Reader 9 seems not to like 2 byte keys in ToUnicode CMaps. As soon as the keys are 2 byte long, translation of copied text to unicode fails. Reverting the changes in pdf_add_ToUnicode fixes that problem.