Bug 691862

Summary:	Unable to copy text from the converted PDF
Product:	Ghostscript	Reporter:	Mike <mlungu777>
Component:	PDF Writer	Assignee:	Ken Sharp <ken.sharp>
Status:	NOTIFIED FIXED
Severity:	normal	CC:	wilkinsonAU
Priority:	P4
Version:	master
Hardware:	PC
OS:	Windows 7
Customer:	631	Word Size:	---
Attachments:	PostScript file PDF file produced with GS Head PDF file produced with GSv8.72

Description Mike 2011-01-03 21:20:36 UTC

When converting the attached PostScript file into PDF with GS Head, text from the resulting PDF file cannot be copied - i get __ instead of the actual text.
It works fine with gs8.72 and older gs9.0 build.

Comment 1 Mike 2011-01-03 21:21:30 UTC

Created attachment 7081 [details]
PostScript file

Comment 2 Mike 2011-01-03 21:24:32 UTC

Created attachment 7082 [details]
PDF file produced with GS Head

Comment 3 Mike 2011-01-03 21:26:37 UTC

Created attachment 7083 [details]
PDF file produced with GSv8.72

Comment 4 Ken Sharp 2011-01-04 09:39:49 UTC

Hmm.....

it seems the Adobe documentation lies (or more generously is inconsistent). The CMap tech note (5014) says that entries are not zero padded, so values less than 256 are emitted as single bytes, values 256->65535 are two bytes etc. However the ToUnicode CMap tech note (5411) says:

"Because a “ToUnicode” mapping file is used to covert from CIDs (which begin at decimal 0, which is expressed as 0x0000 in hexadecimal notation) to Unicode code points, the following “codespacerange” definition, without exception, shall always be used: 1 begincodespacerange <0000> <FFFF>endcodespacerange"

(This is somewhat restrictive, since CIDs can exceed 2 bytes, even though UTF-16 can't, I could forsee a need to map high CIDs to lower UTF-16 values)

Finally, the PDF Reference (1.7) says:

"The CMap file must contain begincodespacerange and endcodespacerangeoperators that are consistent with the encoding that the font uses. In particular, for a simple font, the codespace must be one byte long."

So the PDF Reference conflicts with the tech note which it references!

In fact none of the above seems to be quite what Acrobat actually does.

It seems that Acrobat does not care what size (in bytes) the codespacerange is, no matter what kind of font is present. However it *does* care what size the bfrange entries are. For simple fonts the bfrange entries must be single bytes, for CIDFonts the bfrange entries must be two bytes. Deviation in either case leads to files which Acrobat cannot process and either causes errors or incorrect text when copying and pasting.

A fix which writes the codespacerange and bfrange depending on the type of the
font is now in testing.

Comment 5 Ken Sharp 2011-01-04 10:54:51 UTC

Fixed in revision 11993, patch here:

http://ghostscript.com/pipermail/gs-cvs/2011-January/012082.html

Comment 6 Ken Sharp 2011-03-17 08:37:44 UTC

*** Bug 692073 has been marked as a duplicate of this bug. ***