Bug 691862 - Unable to copy text from the converted PDF
Summary: Unable to copy text from the converted PDF
Status: NOTIFIED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: master
Hardware: PC Windows 7
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
: 692073 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-01-03 21:20 UTC by Mike
Modified: 2011-10-02 02:35 UTC (History)
1 user (show)

See Also:
Customer: 631
Word Size: ---


Attachments
PostScript file (18.43 KB, application/postscript)
2011-01-03 21:21 UTC, Mike
Details
PDF file produced with GS Head (4.56 KB, application/pdf)
2011-01-03 21:24 UTC, Mike
Details
PDF file produced with GSv8.72 (4.32 KB, application/pdf)
2011-01-03 21:26 UTC, Mike
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mike 2011-01-03 21:20:36 UTC
When converting the attached PostScript file into PDF with GS Head, text from the resulting PDF file cannot be copied - i get __ instead of the actual text.
It works fine with gs8.72 and older gs9.0 build.
Comment 1 Mike 2011-01-03 21:21:30 UTC
Created attachment 7081 [details]
PostScript file
Comment 2 Mike 2011-01-03 21:24:32 UTC
Created attachment 7082 [details]
PDF file produced with GS Head
Comment 3 Mike 2011-01-03 21:26:37 UTC
Created attachment 7083 [details]
PDF file produced with GSv8.72
Comment 4 Ken Sharp 2011-01-04 09:39:49 UTC
Hmm.....

it seems the Adobe documentation lies (or more generously is inconsistent). The CMap tech note (5014) says that entries are not zero padded, so values less than 256 are emitted as single bytes, values 256->65535 are two bytes etc. However the ToUnicode CMap tech note (5411) says:

"Because a “ToUnicode” mapping file is used to covert from CIDs (which begin at decimal 0, which is expressed as 0x0000 in hexadecimal notation) to Unicode code points, the following “codespacerange” definition, without exception, shall always be used: 1 begincodespacerange  <0000> <FFFF>endcodespacerange"

(This is somewhat restrictive, since CIDs can exceed 2 bytes, even though UTF-16 can't, I could forsee a need to map high CIDs to lower UTF-16 values)

Finally, the PDF Reference (1.7) says:

"The CMap file must contain begincodespacerange and endcodespacerangeoperators that are consistent with the encoding that the font uses. In particular, for a simple font, the codespace must be one byte long."

So the PDF Reference conflicts with the tech note which it references!

In fact none of the above seems to be quite what Acrobat actually does. 

It seems that Acrobat does not care what size (in bytes) the codespacerange is, no matter what kind of font is present. However it *does* care what size the bfrange entries are. For simple fonts the bfrange entries must be single bytes, for CIDFonts the bfrange entries must be two bytes. Deviation in either case leads to files which Acrobat cannot process and either causes errors or incorrect text when copying and pasting.

A fix which writes the codespacerange and bfrange depending on the type of the
font is now in testing.
Comment 5 Ken Sharp 2011-01-04 10:54:51 UTC
Fixed in revision 11993, patch here:

http://ghostscript.com/pipermail/gs-cvs/2011-January/012082.html
Comment 6 Ken Sharp 2011-03-17 08:37:44 UTC
*** Bug 692073 has been marked as a duplicate of this bug. ***