689014 – Characters missing reading PDF file

Bug 689014 - Characters missing reading PDF file

Summary: Characters missing reading PDF file

Status:	NOTIFIED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Interpreter (show other bugs)
Version:	8.56
Hardware:	All All

Importance:	P2 normal
Assignee:	Alex Cherepanov

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-11-28 09:42 UTC by Marcos H. Woehrmann
Modified:	2011-09-22 16:17 UTC (History)
CC List:	1 user (show)

See Also:
Customer:	580
Word Size:	---

Attachments
PDF that contains ö (Latin small letter O with diaeresis) (238.10 KB, application/pdf) 2007-04-11 06:02 UTC, Mark Warbington	Details
Output missing ö character (second line in top box) (53.38 KB, image/gif) 2007-04-11 06:07 UTC, Mark Warbington	Details
input output comparision (14.66 KB, image/gif) 2007-04-11 06:21 UTC, Mark Warbington	Details
patch (503 bytes, patch) 2007-06-25 18:35 UTC, Alex Cherepanov	Details \| Diff
experimental patch (1.90 KB, patch) 2007-07-10 20:54 UTC, Alex Cherepanov	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marcos H. Woehrmann 2006-11-28 09:42:02 UTC

When converting the attached PDF file with Ghostscript to PPM certain characters are missing from the 
output.  For example, the name 'JOSEPH' is 'OSEPH'.  I've duplicated this problem with gs-8.00, gs-8.54, 
and gs-head and they all fail the same way.  Acrobat 5.0 and Apple Preview open the file correctly.

The command line I used is:  gs -sDEVICE=ppmraw -sOutputFile=test.ppm missing_chars.pdf

Note that this file contains confidential information, please delete the file when is no longer needed and 
do not add it to the regression test.

Comment 1 Marcos H. Woehrmann 2006-11-28 10:41:46 UTC

Created attachment 2638 [details]
missing_chars.pdf

Comment 2 Ray Johnston 2006-12-13 10:57:59 UTC

The problem in the PDF interpreter is that this file has an embedded TTF subset
of the /Helvetica font which is entered into the FontDirectory as Helvetica (the
subset is missing the '@', 'J', 'V' and 'X' among other glyphs).

When a subsequent "Tf" comes along that references the /Helv font which is
defined as a Type1 font (not a subset), we are picking up the subset that is
installed instead of loading the complete Type 1 font.

Thus the problem stems from putting subsets in the FontDirectory under the
regular BaseFont name.

Comment 3 Craig Reed 2007-03-09 10:10:10 UTC

What is being done about this bug? When is the anticipated fix going to take 
place?

Thanks....

Comment 4 Mark Warbington 2007-04-11 06:02:05 UTC

Created attachment 2875 [details]
PDF that contains ö (Latin small letter O with diaeresis)

This PDF contains ö characters (Latin small letter "O" with diaeresis or "two
dots" above the letter) that are missing (not substituted or anything just
completely missing) when viewing or converting through Ghostscript (all recent
releases tested).  See hungarian.gif for example of Ghostscript output (second
line in top box) and compare.gif for a side by side comparison of the PDF input
versus the Ghostscript output.

Comment 5 Mark Warbington 2007-04-11 06:07:08 UTC

Created attachment 2876 [details]
Output missing ö character (second line in top box)

This is the output from Ghostscript (converted to GIF with extra pages removed)
demonstrating that the ö characters are missing from the second line of the top
box.  See compare.gif for a side-by-side comparison of the PDF input and
resulting Ghostscript output.

Comment 6 Mark Warbington 2007-04-11 06:21:07 UTC

Created attachment 2877 [details]
input output comparision

This is a side-by-side comparison of the PDF input and Ghostscript output with
the missing characters indicated.

Comment 7 Alex Cherepanov 2007-04-12 12:30:15 UTC

I'm working on this proglem.
The problem is caused by PDF interpreter font cache and looking up
the fonts in the cache by name. The fix will be ready soon.

Comment 8 Mark Warbington 2007-05-22 13:21:31 UTC

Hello Alex.

Will a fix come in the form of a patch or will it be incorporated into the 
next full released.

Thank you.

Mark Warbington

Comment 9 Alex Cherepanov 2007-06-25 18:35:04 UTC

Created attachment 3078 [details]
patch

Undefine the font that may be defined in memory before attempting to
resolve a font name into a font. This guarantees that the font will be resolved

into an external resource.

The patch causes no differences on the Comparefiles test.
However, it prevents in-memory re-definitions of the font
resources, which may be undesirable.

Comment 10 Alex Cherepanov 2007-07-10 08:45:19 UTC

The font file from the comment #4 is incorrect.
The page resource dictionary points directly to a font stream. The font resource
is not referenced from anywhere, its font descriptor doesn't point to the
font stream.

Acrobat Reader 5 or lower display the file similar to Ghostscript.
Acrobat Reader 8 recovers the intended appearance of the file.

This problem is unrelated to the problem, demonstrated by the file from the
attachment #1 [details] and fixed by the patch from the attachment #9 [details].

It would be great to cover yet another case of PDF abuse but the results cannot
be guaranteed. The font resource contains important information about the
encoding and widths of the characters, but there's no link to the font resource
from any object that belongs to the page.

Comment 11 Alex Cherepanov 2007-07-10 09:43:57 UTC

Please disregard the comment #10. I misunderstood the file structure.

Comment 12 Alex Cherepanov 2007-07-10 20:54:01 UTC

Created attachment 3176 [details]
experimental patch

The font file from the comment #4 is correct, but it uses new glyph names that
we don't yet have in our fonts. The same glyphs are available under different
names.

Ghostscript is not alone. Acrobat Reader 5 or lower display the file similar to
Ghostscript, but Acrobat Reader 8 shows the file correctly.

This patch tries to load the glyph using the backward-compatible name when
the primary search fails.

The patch is not ready for the production use. Probably, glyph aliases should
be created when the font is loaded to avoid any problems in PDF generation. I'm

posting the patch to code review.

Comment 13 leonardo 2007-08-06 14:51:00 UTC

Alex, .type1build is executed with pdfwrite, so I guess the patch will 
associate aliased glyphs with the original glyph names. Not sure though. Please 
test with pdfwrite. 

BTW, A better way would be to fix Encoding when writing a PDF, to make the 
result to be more portable and trick independent.

Comment 14 leonardo 2007-08-06 23:47:47 UTC

The comment #13 is partially incorrect. pdfwrite will copy fonts and encodings, 
so the result will have same problem as the input. Alex, please test for sure. 
I think it's acceptable for now since Adobe can handle such documenmts.

Comment 15 Alex Cherepanov 2008-06-28 08:04:34 UTC

The bug that caused missing characters in the sample #1 has been
fixed some time ago.

This patch makes /?dblacute and /?hungarumlaut glyph names equivalent in Type 1
fonts. It adds a missing glyph when the font is loaded if another glyph is
defined.
See: http://ghostscript.com/pipermail/gs-cvs/2008-June/008374.html

This fixes the file from the attachment #4 [details].
Regression testing shows no differences.

Comment 16 Piotr Strzelczyk 2009-03-13 05:53:41 UTC

Patch included into gs_type1.ps which ,,doubles'' some chars is hard to be
disabled. It breaks the output of pf2afm GS script (AFM has more glyphs than
PFB). To solve this problem, /t1_glyph_equivalence should be global, writable
array or some parameter (e.g. .add_equivalent_glyphs) may be added.

Comment 17 Alex Cherepanov 2009-06-13 07:35:38 UTC

Export t1_glyph_equivalence table, which provides alternative glyph names.
Modify pf2afm.ps to disable glyph aliasing and generate AFM files that
match the font.

The following patch has been committed as a rev. 9792.
http://ghostscript.com/pipermail/gs-cvs/2009-June/009423.html
Regression testing shows no differences.

Comment 18 Marcos H. Woehrmann 2011-09-18 21:45:46 UTC

Changing customer bugs that have been resolved more than a year ago to closed.

Comment 19 Marcos H. Woehrmann 2011-09-22 16:17:49 UTC

The content of attachment 2638 [details] has been deleted by
    Marcos H. Woehrmann <marcos.woehrmann@artifex.com>
who provided the following reason:

Customer requested the file be deleted when no longer needed.

The token used to delete this attachment was generated at 2011-09-22 09:17:29 PDT.