691907 – PDFs with TrueType fonts from Windows PostScript files not searchable

Bug 691907 - PDFs with TrueType fonts from Windows PostScript files not searchable

Summary: PDFs with TrueType fonts from Windows PostScript files not searchable

Status:	RESOLVED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Writer (show other bugs)
Version:	master
Hardware:	PC Windows XP

Importance:	P4 normal
Assignee:	Ken Sharp

URL:
Keywords:

Duplicates (1):	691910 (view as bug list)
Depends on:
Blocks:

Reported:	2011-01-25 08:40 UTC by SaGS
Modified:	2011-02-01 14:02 UTC (History)
CC List:	1 user (show)

See Also:
Customer:
Word Size:	---

Attachments
Suggested patch. (755 bytes, patch) 2011-01-25 08:45 UTC, SaGS	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description SaGS 2011-01-25 08:40:41 UTC

Since revision 11735, PDFs obtained by converting a PostScript file generated by the Windows device drivers are not searchable anymore. The condition is for the PostScript file to use TrueType fonts embedded as Type42, not converted to Type1 or substituted with device fonts (like Arial->Helvetica). For a sample file, use attachment #6694 [details]. The Ghostscript command line for the conversion is just ‘gswin32c -sDEVICE=pdfwrite -o outfile.pdf infile.pdf’.

Comment 1 SaGS 2011-01-25 08:43:30 UTC

Before revision 11735, PDF TrueType fonts had an Encoding entry. Because the Windows PS driver assigns standard glyph names, the Encoding allowed Adobe Reader/ etc to infer the Unicode equivalents of the characters and thus searching/ copying the text functioned. Now, this method is not available anymore. In the absence of a ToUnicode CMap, Adobe Reader uses character codes as they are, but these are not meaningful because the Windows driver defines the fonts incrementally, so charcodes are 1, 2, ... in the order the characters are used for the 1st time in the printed output. A ToUnicode CMap would solve the problem, but Unfortunately Ghostscript’s pdfwrite driver does not always output one.

(sorry, this should have been part of comment #0.)

Comment 2 SaGS 2011-01-25 08:45:24 UTC

Created attachment 7151 [details]
Suggested patch.

pdfwrite: Always output the ToUnicode CMap for TrueType fonts. Once SVN revision 11735 removed the Encoding entry, the ToUnicode CMap is necessary, independent of the set of characters used by the font, to ensure text using these fonts is searchable.  This is important when converting PostScript files generated by Windows device drivers, which download TTF fonts incrementally and so charcodes are not meaningful (they are 1, 2, 3, ..., in the order these characters are used for the 1st time in the print stream).

Additional note: during testing, I found there are some problems with 
‘Resource\Decoding\Unicode’. This file is incomplete, it’s missing at least the name ‘hyphen’ for U+002D and others too, so ‘-’ (ASCII 0x2D) is still not searchable. I’ll open a separate bug report for this, when I’ll have a suitable patch.

Comment 3 Ken Sharp 2011-01-26 07:53:26 UTC

*** Bug 691910 has been marked as a duplicate of this bug. ***

Comment 4 SaGS 2011-01-30 20:29:49 UTC

From comment #2:
> Additional note: during testing, I found there are some problems with 
> ‘Resource\Decoding\Unicode’. ... I’ll open a separate bug report for this, 
> when I’ll have a suitable patch.

Done: Bug #691918 ‘Fixes for the Unicode Decoding resource(s)’. (Note 100% sure I selected the right ‘component’.)

Comment 5 Ken Sharp 2011-02-01 09:39:00 UTC

(In reply to comment #2)
> Created an attachment (id=7151) [details]
> Suggested patch.
> 
> pdfwrite: Always output the ToUnicode CMap for TrueType fonts. 

Looming at this I'm not sure why we ever bother to *not* embed a ToUnicode CMap. It looks like the code is doing much the same test as for a symbolic TrueType font, trying to determine if a font contains only glyphs from the Adobe standard Latin set.

In itself this isn't enough, the glyphs also have to be encoded at the same positions, and that condition is one of the reasons that TrueType fonts are always emitted as symbolic.

It looks like this is some kind of optimisation, only emitting the ToUnicode CMap when we think we are going to need one, because the Encoding isn't standard. I think it might be more useful to remove this function altogether and always emit a ToUnicode CMap. Its true that the CMap may be incorrect, but I don't think its likely to be worse than no CMap.

Well, we're coming up to a release, so I think its best to take the less extensive route. I'll look into this some more and do some testing.

Comment 6 Ken Sharp 2011-02-01 14:02:18 UTC

(In reply to comment #2)
> Created an attachment (id=7151) [details]
> Suggested patch.

Patch adopted and committed as revision 12088:

http://ghostscript.com/pipermail/gs-cvs/2011-February/012204.html

Because we are fast approaching a release I've chosen to take the conservative approach and adopt the patch as-is, and not embed ToUnicode CMaps for all fonts regardless.

Thanks for the patch!