Bug 690440 - PDF using Edwardian Script ITC font displays garbled text using "text extract"
Summary: PDF using Edwardian Script ITC font displays garbled text using "text extract"
Status: NOTIFIED INVALID
Alias: None
Product: Artifex GSview
Classification: Unclassified
Component: General (show other bugs)
Version: unspecified
Hardware: PC Windows XP
: P4 normal
Assignee: Russell Lang
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-04-21 10:22 UTC by John Beale
Modified: 2009-04-21 12:36 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
Postscript file which exhibits bug when converted to PDF (in ZIP file) (22.83 KB, application/octet-stream)
2009-04-21 11:58 UTC, John Beale
Details

Note You need to log in before you can comment on or make changes to this bug.
Description John Beale 2009-04-21 10:22:31 UTC
This may be a bad font problem, or something in gsview or ghostscript.  A simple
text PDF using the "Edwardian Script ITC" font, generated by Ghostscript 8.63
displays properly in gsview32.exe (version 4.9 2007-11-18) and also in Adobe
Reader 8.1.4.  The displayed text reads "First sample sentence.  Second attempt."

Bug: using the "text extract" function on this PDF in gsview, we obtain the
following text: "Y|Üáà átÅÑÄx áxÇàxÇvxA fxvÉÇw tààxÅÑàA"

I can't really blame gsview, because I get exactly the same string using
text-copy in Adobe Reader, and I also get that garbled text displayed when I
import the PDF into Inkscape 0.46+devel r21167, built Apr 17 2009.

You can download the specific PDF I'm talking about here:
http://launchpadlibrarian.net/25800877/test3.pdf

This may not be a bug in ghostscript/gsview, but I'd love to know why this
happens and if there is a workaround.
Comment 1 Ken Sharp 2009-04-21 10:50:35 UTC
The font in question is a TrueType font embedded as a subset without a ToUnicode
CMap, and using a custom encoding. For example /Y (capital Y) is encoded at
position 1. In addition the glyph names in the encoding are not what one would
expect, I would expect to see /F, /i /r, /s, /t and so on. Instead I see /Y /bar
/Udieresis /aacute etc.

So there is no Unicode information, and the encoding is non standard. In this
case Acrobat falls back to translating the glyph names into their ASCII
equivalents (when possible). Using the Encoding to map from the character codes
to the glyph names we see that we get /Y /bar /Udieresis /aacute /agrave /space
/aacute /t and so on, which matches what you get when you copy and paste.

Its impossible to tell from the PDF file why the file was created this way, one
would have to guess that the file was created from a PostScript file which had
re-encoded the font like this, so that the PDF file had to be made the same way.

I don't see a bug here, possibly (given that the PDF file was created by GS
8.63) there is a bug in pdfwrite which caused the encoding oddness, btu that
can't be determined without seeing the PostScript file.
Comment 2 John Beale 2009-04-21 11:58:50 UTC
Created attachment 4961 [details]
Postscript file which exhibits bug when converted to PDF (in ZIP file)

Attached PS file (in ZIP) displays bug after conversion to PDF. File generated
by MS Office Word 2003 printing to MS Publisher Imagesetter (with
printer>advanced PS option "optimize for portability")
Comment 3 John Beale 2009-04-21 12:36:02 UTC
Have confirmed behavior is due to inadquate PS file generation. Same document
with same font, generated in Open Office 3 using "Export to PDF" works 100% ok.