Bug 696017 - Portions of page missing
Summary: Portions of page missing
Status: NOTIFIED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Interpreter (show other bugs)
Version: master
Hardware: PC Linux
: P1 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-06-01 11:15 UTC by Marcos H. Woehrmann
Modified: 2015-06-28 23:19 UTC (History)
0 users

See Also:
Customer: 780
Word Size: ---


Attachments
valid PDF file, rectangle is outside text block (2.57 KB, application/pdf)
2015-06-06 12:19 UTC, Ken Sharp
Details
invalid PDF file, rectangle inside text block (2.57 KB, application/pdf)
2015-06-06 12:19 UTC, Ken Sharp
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marcos H. Woehrmann 2015-06-01 11:15:27 UTC
The customer reports and I've verified that when the attached PDF is rendered by Ghostscript master, portions of the expected output are missing (see attached screenshot.jpg).  All versions of Acrobat I've tried (11, 10, and 8) render the complete page, though other software, including muPDF and Apple Preview are also missing portions of the output (but not the same portions that Ghostscript misses).  

The command line I'm using for testing:

  bin/gs -sDEVICE=ppmraw -o test.ppm ,/RSP-6718_Original_Invoice.pdf
Comment 3 Marcos H. Woehrmann 2015-06-01 11:23:51 UTC
I realize that this file is very broken (see bug 695897), however it was suggested that it may be possible to read the file as expected; the exact quote was "Again, this may not be possible to fix (without breaking other, also invalid, files).".
Comment 4 Ken Sharp 2015-06-06 12:19:05 UTC
Created attachment 11726 [details]
valid PDF file, rectangle is outside text block
Comment 5 Ken Sharp 2015-06-06 12:19:40 UTC
Created attachment 11727 [details]
invalid PDF file, rectangle inside text block
Comment 6 Ken Sharp 2015-06-06 12:30:06 UTC
The file is, as noted previously, very invalid. We had previously addressed teh problems with ext blocks inside text blocks, and with images being drawn inside text blocks, however the missing lines are caused by linework being drawn inside a text block (all of these cases contravene the PDF specification).

We can (and a future commit will) fix the missing lines.

The final problem with the missing blocks of text at the bottom of the page is due to the way that Acrobat apparently works internally. PDF content streams are processed sequentially, each object is drawn as it is defined. In this case we have a situation where text is drawn, and in the same text block an image is drawn which lies on top of the text.

What we would expect to happen (and does with Ghostscript, Apple Preview, MuPDF, poppler, pdf.js and every other PDF consumer we've tried) is that the text should be obscured by the image.

Acrobat, however, seems to draw the image *before* drawing the text, even though the image comes after the text in the PDF file. From this it would seem that Acrobat is delaying emitting the text until it reaches the 'ET', but draws other objects (which should, after all, not be present) immediately.

The attached files valid.pdf and invalid.pdf demostrate this (though using a black rectangle instead of an image. The only difference between the two is whether the rectangle is drawn inside or otuside the text block.

Attempting to emulate his behaviour would involve some considerable rewriting of the PDF interpreter and likely negative performance implications for text rendering in all cases. In other words we would reduce the performance only in order to emulate undocumented behaviour of Acrobat when faced with an invliad PDF file.

I'm reluctant to try and do this at will take some time to implement, likely lead to performance degradation and will undoubtedly introduce other bugs.
Comment 7 Ken Sharp 2015-06-16 02:20:39 UTC
Commit 72ca9b670f70cfaad1a299f891d03a313143cc3c resolves the missing strokes (and caters for other path constructors like 're').

This has been quite difficult to work around, taking several days to get working without breaking anything else.

We don't really support nested text blocks (because the spec says quite unambiguously that they are illegal) and simply terminate the earlier one if we find a BT inside a BT. Further changes in this area are likely to be nigh impossible without breaking other files. We would have to rewrite the text handling.

I intend to have one attempt at mimicking the bizarre Acrobat behaviour where it images text only at the ET, meaning the z-ordering of objects is wrong. If I can't make that work reasonably easily I'm going to give up on it.
Comment 8 Ken Sharp 2015-06-18 07:13:50 UTC
I've come to the conclusion that the insane Z-ordering performed by Acrobat is impossible for us to reproduce. In order to do so we would somehow have to arrange to execute text blocks which contain non-text operations twice. The first time through we'd draw everything, then we'd go back and only draw the text operations, thus ensuring they come out on top.

This means repositioning the current execution pointer. But we can't do that for a compressed stream (eg Flate) because we would have to go right back to the first compressed byte and decompress until we reached the point we want to get to.

The way our PDF interpreter works that's essentially impossible. We could code it, but it would mean rewriting the entire interpreter so that every stream can be rewound, or so that every stream is a reusable stream. That's probably a couple of years of work, with a might long bug tail afterwards. And the easy answer of using a reusable stream throughout would probably have negative performance implications.

The PDF file is invalid, and we now do the best we can with it. That's as good as its going to get.