Bug 703448 - Ghostscript can't read files that poppler, mupdf and Firefox and others can read
Summary: Ghostscript can't read files that poppler, mupdf and Firefox and others can read
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Interpreter (show other bugs)
Version: 9.53.3
Hardware: PC Linux
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-31 17:59 UTC by Rogério Theodoro de Brito
Modified: 2021-02-01 16:31 UTC (History)
2 users (show)

See Also:
Customer:
Word Size: ---


Attachments
output from ghostscript (39.44 KB, text/plain)
2021-01-31 17:59 UTC, Rogério Theodoro de Brito
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rogério Theodoro de Brito 2021-01-31 17:59:16 UTC
Created attachment 20531 [details]
output from ghostscript

Dear people,

I have numerous files that I got from a major scientific publisher and that I can't run OCRMyPDF on.

OCRMyPDF offloads some tasks to ghostscript to help in the process of OCR'ing, but, with the files that I have, all that I get are stack traces from ghostscript, like the following:

$ gv foo.pdf 
   **** Error: Something went wrong while checking for recursion in the Page tree. Giving up checking.
               This PDF file may not terminate, if there is a loop in the Pages tree.
Error: /typecheck in --gt--
Operand stack:
   --nostringval--   24   10869687   20   24   1511   --dict:8/15(L)--   1109   --nostringval--   1120   --nostringval--   --dict:4/4(L)--   --dict:3/3(L)--   --dict:10/10(L)--   --dict:10/10(L)--   --dict:10/10(L)--   --dict:10/10(L)--   --dict:6/6(L)--   --dict:10/10(L)--   --dict:6/6(L)--   --dict:10/10(L)--
(...)
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1990   1   3   %oparray_pop   1989   1   3   %oparray_pop   1977   1   3   %oparray_pop   1833   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--
Dictionary stack:
   --dict:734/1123(ro)(G)--   --dict:1/20(G)--   --dict:85/200(L)--   --dict:85/200(L)--   --dict:133/256(ro)(G)--   --dict:317/325(ro)(G)--   --dict:25/32(L)--
Current allocation mode is local
Current file position is 2010
$


It doesn't matter if I invoke ghostscript via gv or via, say, as a simple command like:

gs -dBATCH -dNOPAUSE -dSAFER -sDEVICE=pdfwrite -sOutputFile=bar.pdf foo.pdf

Since the files in question are copyrighted, I can provide them privately. Just let me know and I will do as instructed.

I'm using Debian's testing packaged version of ghostscript (currently, 9.53.3~dfsg-6).


Thanks for any help,

Rogério Brito.
Comment 1 Peter Cherepanov 2021-01-31 18:31:56 UTC
Please attach your PDF file. Little can be done about the problem without reproducing it on the developer's side.
Comment 2 Ken Sharp 2021-01-31 19:32:58 UTC
(In reply to Rogério Theodoro de Brito from comment #0)

> It doesn't matter if I invoke ghostscript via gv or via, say, as a simple
> command like:
> 
> gs -dBATCH -dNOPAUSE -dSAFER -sDEVICE=pdfwrite -sOutputFile=bar.pdf foo.pdf
> 
> Since the files in question are copyrighted, I can provide them privately.
> Just let me know and I will do as instructed.

It would be preferable if you could source a PDF file which you can share. Perhaps you can ask the publisher if they can find one for you.

If you absolutely can't (or won't) do that, then you can email the file to me and I will attach it here marked as private, which prevents non-Artifex staff from viewing it. Please select the simplest file possible.
Comment 3 Rogério Theodoro de Brito 2021-01-31 21:10:07 UTC
(In reply to Ken Sharp from comment #2)
> (In reply to Rogério Theodoro de Brito from comment #0)
> > gs -dBATCH -dNOPAUSE -dSAFER -sDEVICE=pdfwrite -sOutputFile=bar.pdf foo.pdf
> 
> If you absolutely can't (or won't) do that, then you can email the file to
> me and I will attach it here marked as private, which prevents non-Artifex
> staff from viewing it. Please select the simplest file possible.

Just sent you (and Peter) a link with the file. If more information is needed, please let me know.


Thanks,

Rogério Brito.
Comment 5 Ken Sharp 2021-02-01 16:31:33 UTC
The problem is, as might be expected, that the PDF file is invalid. It has an ObjStm, a compressed set of object definitions, where one of the objects returns a mark object.

This isn't legal in PDF and ObjStms, and seriously confuses our PDF interpreter counting the objects that it got back from a ObjStm. We need to count them because we have found other broken PDF files where the number of objects contained in the ObjStm is not the same as the number of objects that the stream declares that it contains.....

Anyway, I've made a commit here: 41130dd35b2dc43b07600b51d7c9fab466e8bf6c
which works around the broken file without failing on any of the other myriad invalid files we've seen.

I am puzzled why you are running OCRmyPDF on this file, since it has already been OCR'ed.