The customer reports and I've verified that when the attached PDF is rendered by Ghostscript master, portions of the expected output are missing (see attached screenshot.jpg). All versions of Acrobat I've tried (11, 10, and 8) render the complete page, though other software, including muPDF and Apple Preview are also missing portions of the output (but not the same portions that Ghostscript misses). The command line I'm using for testing: bin/gs -sDEVICE=ppmraw -o test.ppm ,/RSP-6718_Original_Invoice.pdf
I realize that this file is very broken (see bug 695897), however it was suggested that it may be possible to read the file as expected; the exact quote was "Again, this may not be possible to fix (without breaking other, also invalid, files).".
Created attachment 11726 [details] valid PDF file, rectangle is outside text block
Created attachment 11727 [details] invalid PDF file, rectangle inside text block
The file is, as noted previously, very invalid. We had previously addressed teh problems with ext blocks inside text blocks, and with images being drawn inside text blocks, however the missing lines are caused by linework being drawn inside a text block (all of these cases contravene the PDF specification). We can (and a future commit will) fix the missing lines. The final problem with the missing blocks of text at the bottom of the page is due to the way that Acrobat apparently works internally. PDF content streams are processed sequentially, each object is drawn as it is defined. In this case we have a situation where text is drawn, and in the same text block an image is drawn which lies on top of the text. What we would expect to happen (and does with Ghostscript, Apple Preview, MuPDF, poppler, pdf.js and every other PDF consumer we've tried) is that the text should be obscured by the image. Acrobat, however, seems to draw the image *before* drawing the text, even though the image comes after the text in the PDF file. From this it would seem that Acrobat is delaying emitting the text until it reaches the 'ET', but draws other objects (which should, after all, not be present) immediately. The attached files valid.pdf and invalid.pdf demostrate this (though using a black rectangle instead of an image. The only difference between the two is whether the rectangle is drawn inside or otuside the text block. Attempting to emulate his behaviour would involve some considerable rewriting of the PDF interpreter and likely negative performance implications for text rendering in all cases. In other words we would reduce the performance only in order to emulate undocumented behaviour of Acrobat when faced with an invliad PDF file. I'm reluctant to try and do this at will take some time to implement, likely lead to performance degradation and will undoubtedly introduce other bugs.
Commit 72ca9b670f70cfaad1a299f891d03a313143cc3c resolves the missing strokes (and caters for other path constructors like 're'). This has been quite difficult to work around, taking several days to get working without breaking anything else. We don't really support nested text blocks (because the spec says quite unambiguously that they are illegal) and simply terminate the earlier one if we find a BT inside a BT. Further changes in this area are likely to be nigh impossible without breaking other files. We would have to rewrite the text handling. I intend to have one attempt at mimicking the bizarre Acrobat behaviour where it images text only at the ET, meaning the z-ordering of objects is wrong. If I can't make that work reasonably easily I'm going to give up on it.
I've come to the conclusion that the insane Z-ordering performed by Acrobat is impossible for us to reproduce. In order to do so we would somehow have to arrange to execute text blocks which contain non-text operations twice. The first time through we'd draw everything, then we'd go back and only draw the text operations, thus ensuring they come out on top. This means repositioning the current execution pointer. But we can't do that for a compressed stream (eg Flate) because we would have to go right back to the first compressed byte and decompress until we reached the point we want to get to. The way our PDF interpreter works that's essentially impossible. We could code it, but it would mean rewriting the entire interpreter so that every stream can be rewound, or so that every stream is a reusable stream. That's probably a couple of years of work, with a might long bug tail afterwards. And the easy answer of using a reusable stream throughout would probably have negative performance implications. The PDF file is invalid, and we now do the best we can with it. That's as good as its going to get.