696886 – make the pdf interpreter to be less restrictive

Bug 696886 - make the pdf interpreter to be less restrictive

Summary: make the pdf interpreter to be less restrictive

Status:	RESOLVED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Interpreter (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	P4 enhancement
Assignee:	Ken Sharp

URL:
Keywords:

Depends on:
Blocks:

Reported:	2016-06-29 17:12 UTC by Nonsmoker
Modified:	2016-07-04 02:32 UTC (History)
CC List:	1 user (show)

See Also:
Customer:
Word Size:	---

Attachments
file with broken bookmarks (29.18 MB, application/pdf) 2016-06-30 11:47 UTC, Nonsmoker	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Nonsmoker 2016-06-29 17:12:07 UTC

Hi there,


I'm using a software called djvudigital to convert pdf files to .djvu-files. Djvudigital in turn uses ghostscript. Apparently, ghostscript is a little restrictive when it comes to recovering errors - making pdf-files for me less useful.

As my technical expertise is very limited, I'll point you to the ticket that I created for the djvudigital devs - I hope that explains it:

https://sourceforge.net/p/djvu/feature-requests/91/

Let me know if you need the original pdf file.


cheers

Comment 1 Ken Sharp 2016-06-30 00:21:58 UTC

(In reply to Nonsmoker from comment #0)
> Hi there,
> 
> 
> I'm using a software called djvudigital to convert pdf files to .djvu-files.
> Djvudigital in turn uses ghostscript. Apparently, ghostscript is a little
> restrictive when it comes to recovering errors - making pdf-files for me
> less useful.
> 
> As my technical expertise is very limited, I'll point you to the ticket that
> I created for the djvudigital devs - I hope that explains it:
> 
> https://sourceforge.net/p/djvu/feature-requests/91/
> 
> Let me know if you need the original pdf file.

We *always* need the original file. Note that Ghostscript's PDF interpreter is already very liberal in its interpretation of damaged/broken/invalid PDF files, so the summary "make the pdf interpreter to be less restrictive" doesn't really make any sense, it does not describe any kind of bug at all.

I cannot see from the DejaVu thread what the problem is or might be.

If you want this looked at you will have to supply an example file and a Ghostscript command line to reproduce the problem. I'd also recommend trying current code from our Git repository.

Comment 2 Nonsmoker 2016-06-30 02:42:17 UTC

> We *always* need the original file.

I can't share it openly - I'll send you'll have a mail shortly.

> it does not describe any kind of bug at all.

That's why it is marked as "enhancement" :-)

> I cannot see from the DejaVu thread what the problem is or might be.

 
> ...and a Ghostscript command line to reproduce the problem.

There's little I can do beside reiterating the things that already have been written on sourceforge. But your wish is my command:


Originally, I wanted to convert a pdf-file that has some broken bookmarks in question into .djvu . I can use the bookmarks (in the pdf-file) in both Acrobat and Foxit Reader without problems. In Evince, the problematic bookmarks are being shown, but the jump to the destination fails. Converted to .djvu those bookmarks are being left out entirely. Originally I thought that this is a problem with djvudigital and complained about it to the Djvulibre devs.

I thought that the link targets that djvudigital chokes on look like this: /Dest[8438 0 R/FitH]
But as per §8.2.1 of PDF Reference 1.7, a /FitH requires a parameter, which is missing.

So this PDF file appears to be broken. But arguably djvudigital could do a better job at error recovery - as I thought.

The DjvuLibre dev however suggested that this behaviour has it's roots in ghostscript. This is what he did:

The pdf interpreter in ghostscript converts the outline (and lots of pdf info) into pdfmarks
Here is how you can see the pdfmarks in the pdf file ‘stanley.pdf’:
$ gs -q -dNODISPLAY -dDOPDFMARKS -c '/pdfmark { ] ([) print { ( ) print ===only } forall ( pdfmark\n) print } bind def' -dNOPAUSE stanley.pdf -c quit
And it starts like this:
**** Warning: Outline has invalid link that was discarded.
….
**** Warning: Outline has invalid link that was discarded.
**** Warning: Outline has invalid link that was discarded.
[ /Page 1 /View [/FitH 782.0] /Title (Cover) /OUT pdfmark
[ /Page 2 /View [/FitH 817.0] /Title (Gunstream\220s Anatomy & Physiology) /OUT pdfmark
[ /Page 4 /View [/FitH 819.0] /Title (ABOUT THE AUTHORS) /OUT pdfmark
[ /Page 5 /View [/FitH 818.0] /Title (CONTENTS) /OUT pdfmark
[ /Page 8 /View [/FitH 819.0] /Title (PREFACE) /OUT pdfmark
[ /Keywords () /DOCINFO pdfmark
[ /CropBox [0.0 0.0 609.84 780.96] /PAGE pdfmark
[ /CropBox [0.0 0.0 612.0 783.0] /PAGE pdfmark
[ /BleedBox [0.0 0.0 612.0 783.0] /PAGE pdfmark
[ /TrimBox [0.0 0.0 612.0 783.0] /PAGE pdfmark
[ /CropBox [36.0 32.976 648.0 815.976] /PAGE pdfmark

The warnings pertain to the incorrect outline entries.
Basically Ghostscript eliminates them before the gsdjvu driver gets a chance to see them.
One would have to modify the pdf interpreter to be less restrictive. Maybe changing the ‘exec’ line 1919 in http://git.ghostscript.com/?p=ghostpdl.git;a=blob;f=Resource/Init/pdf_main.ps;h=7d552e7d9f11eb90cb286788a775e4a5ba2ad661;hb=HEAD#l1919 by ‘stopped pop’ would make gs more permissive.

There's nothing that I can add personally, as I have zero knowledge in Ghostscript.

cheers

Comment 3 Ken Sharp 2016-06-30 02:51:48 UTC

(In reply to Nonsmoker from comment #2)

> I can't share it openly - I'll send you'll have a mail shortly.

I hope its nice and small large files are painful to deal with especially via email.

We can mark attachments private so that only Artifex staff can see them, it would be better to work that way.


> I thought that the link targets that djvudigital chokes on look like this:
> /Dest[8438 0 R/FitH]
> But as per §8.2.1 of PDF Reference 1.7, a /FitH requires a parameter, which
> is missing.

cf Bug #696838, this seems to be another of the same ilk.


> One would have to modify the pdf interpreter to be less restrictive. Maybe
> changing the ‘exec’ line 1919 in
> http://git.ghostscript.com/?p=ghostpdl.git;a=blob;f=Resource/Init/pdf_main.
> ps;h=7d552e7d9f11eb90cb286788a775e4a5ba2ad661;hb=HEAD#l1919 by ‘stopped pop’
> would make gs more permissive.

No, that's not going to happen. Broken links will be discarded, anything else is a truly horrible idea. In any event I don't believe that would produce a Dest in the output anyway.

Comment 4 Nonsmoker 2016-06-30 11:47:40 UTC

Created attachment 12661 [details]
file with broken bookmarks

Comment 5 Nonsmoker 2016-06-30 11:49:11 UTC

I was able to reduce the file size to 30 mb without repairing the file. That's all I can do.

Comment 6 Ken Sharp 2016-07-04 02:32:08 UTC

As expected this is related to Bug #696838.

The problem is that the PDF Reference states that some elements of the array can be 'null' without stating whether this means the null object, or just missing.

Commit 810ce1e302af7d12e08650ccf0d88407b04a0d46 now supports missing elements
in FitH, FitV, FitBH and FitBV. The FitR documentation states that null parameters have 'undefined results', so this one remains unchanged.

At 577 pages the specimen file is much too large to comprehensively check; a sample of a couple of links looks correct though. Note that contrary to the specification Acrobat does not appear to use the 'current value' when a parameter is null, however the original file behaves in exactly the same way.