Bug 695897 - Text missing reading PDF file
Summary: Text missing reading PDF file
Status: NOTIFIED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Interpreter (show other bugs)
Version: master
Hardware: PC All
: P1 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-04-01 14:18 UTC by Marcos H. Woehrmann
Modified: 2015-04-23 20:47 UTC (History)
0 users

See Also:
Customer: 780
Word Size: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marcos H. Woehrmann 2015-04-01 14:18:34 UTC
The customer reports and I've verified that some of the text in the attached PDF file is not rendered when the file is opened by Ghostscript.  Specifically the barcode below the address and the column headings are missing.  Other software that I've tested, including muPDF and Acrobat, read the file without problem.

The command line I'm using for testing:

  bin/gs -sDEVICE=ppmraw -o test.ppm ./RSP-6718_Original_Invoice.pdf

My attempts to simplify this file have been unsuccessful; any changes I make to the file causes the missing text to appear when rendered by Ghostscript.
Comment 2 Ken Sharp 2015-04-02 03:35:58 UTC
(In reply to Marcos H. Woehrmann from comment #0)
> The customer reports and I've verified that some of the text in the attached
> PDF file is not rendered when the file is opened by Ghostscript. 
> Specifically the barcode below the address and the column headings are
> missing.  Other software that I've tested, including muPDF and Acrobat, read
> the file without problem.

But that doesn't get away from the fact that the PDF file is invalid. THe file contains nested BT operators (ie there is a BT inside a BT/ET pair). THIs is specifically forbidden in the PDF reference manual.

We do actually have code in place to detect this already, but its defeated by the fact that the extra 'BT' is wrapped up in a gsave/grestore pair.


> My attempts to simplify this file have been unsuccessful; any changes I make
> to the file causes the missing text to appear when rendered by Ghostscript.

That's because when Acrobat rewrites the modified file, it silently cleans up the nested BT operators.

We may have reached the limits of what it is possible to fix with this particular file.
Comment 3 Ken Sharp 2015-04-02 04:00:37 UTC
The file has a second glaring error. Towards the end of the content stream it executes a Do (image) operator inside a BT/ET pair. This is also illegal and causes us to emit the image at the current text location instead of the 'correct' image location.

Again, this may not be possible to fix (without breaking other, also invalid, files).
Comment 4 Ken Sharp 2015-04-06 04:03:25 UTC
The first error is 'fixed' in commit 76c20780b2148e56ffcb6944d910d5a04f4f96a9
The second error is 'fixed' in commit 54f502f35b12fd889a47e048b15d92bd8ca66d55

*Both* commits are required, and note that the second commit slightly modifies the first, so exercise caution if cherry-picking.

As the commit log says, it should be emphasised that this file badly contravenes the PDF specification. Both the problems exhibited are due to the file being created in ways specifically prohibited in the PDF reference.

If at all possible the creator of the software that produced this PDF file should be notified, and encouraged to fix these problems.