Bug 692757

Summary:	cannot read >2GB pdf file
Product:	MuPDF	Reporter:	Hin-Tak Leung <htl10>
Component:	mupdf	Assignee:	MuPDF bugs <mupdf-bugs>
Status:	RESOLVED FIXED
Severity:	enhancement	CC:	bruce.edge, robin.watts, tor.andersson
Priority:	P4
Version:	unspecified
Hardware:	PC
OS:	Linux
Customer:		Word Size:	---

Description Hin-Tak Leung 2011-12-23 01:00:09 UTC

mupdf/build/debug/pdfinfo file.pdf
file.pdf:
warning: cannot lseek: Success
+ pdf/pdf_xref.c:60: pdf_read_start_xref(): cannot find startxref
| pdf/pdf_xref.c:477: pdf_load_xref(): cannot read startxref
\ pdf/pdf_xref.c:687: pdf_open_xref_with_stream(): trying to repair


mupdf/build/debug/mupdf file.pdf
warning: cannot lseek: Resource temporarily unavailable
+ pdf/pdf_xref.c:60: pdf_read_start_xref(): cannot find startxref
| pdf/pdf_xref.c:477: pdf_load_xref(): cannot read startxref
\ pdf/pdf_xref.c:687: pdf_open_xref_with_stream(): trying to repair

from fitz/stm_open.c
========================================================
static void seek_file(fz_stream *stm, int offset, int whence)
{
	int n = lseek(*(int*)stm->state, offset, whence);
	if (n < 0)
		fz_warn("cannot lseek: %s", strerror(errno));
========================================================

It would seem that mupdf cannot seek beyond 2GB, due to 32-bit int limit?

The file itself is 3.5GB - xpdf can open it.

Comment 1 Hin-Tak Leung 2011-12-24 23:01:44 UTC

I am sure somebody at Artifex has some big pdf's >2GB, but here is a
recipe for creating one, on the typical linux/unix system: basically using the media-embedding feature of pdf to embed some big movies inside to push it over 2GB:

https://bugs.freedesktop.org/show_bug.cgi?id=44085#c6

Comment 2 Robin Watts 2012-01-25 14:25:12 UTC

For the record, I spent some time looking into this recently.

The 'easy' route to do this is to move from 32 to 64bit offset values within the code. This would allow us to access (effectively) unlimited size documents. The downsides to this are that standard file access functions can't be used any more (they operate on ints/longs), and that we bloat the memory usage as all objects have larger offsets in them.

The 'hard' route would be to change to use unsigned offsets within the code; this would only get us from 2 to 4 Gig, and would cause significant pain in certain functions.

I suspect if we do this, we'll pick the 'easy' route. But I can't see us doing this until we actually see such a file (or have a report of a customer/potential customer using such a file).

Comment 3 Robin Watts 2012-01-25 14:25:37 UTC

Downgrading to enhancement.

Comment 4 Hin-Tak Leung 2012-01-25 16:46:35 UTC

(In reply to comment #2)
 But I can't see us doing
> this until we actually see such a file

Granted it is rare, but I have such a file - an encyclopedia kind of document with figures, etc. Although I cannot share it (and would be also technically painful to do so, bandwidth/size-wise to copy), hence I looked into the LaTeX-based recipe to make one on a typical linux box, or whereas LaTeX runs.

Comment 5 Ray Johnston 2012-01-25 18:49:11 UTC

Note that ghostscript has 'gp_*64' functions to use the 64-bit functions
when the platform supports it. These work on 32-bit builds on linux, mac os/x
and windows. Doing something similar with mupdf builds would probably be
fairly easy given that the platform specific functions are "known". I
suggest an approach similar to gs -- that the code always calls the 64-bit
functions, but these may be hooked to 32-bit with a wrapper that returns
errors for unsupportable values if 64-bit is not supported on that platform.

Comment 6 Bruce Edge 2012-09-12 06:03:44 UTC

Any update on this bug? 
I'm running into a similar problem with gs in that it's also failing on >2BG files:

%> ls -l /tmp/giant.pdf
-rw-r--r-- 1 qa staff 2328769430 2012-09-11 17:27 /tmp/giant.pdf

%> gs -dNOPAUSE -sDEVICE=jpeg -r144 -sOutputFile=giant-p%03d.jpg /tmp/giant.pdf                                                                                                             
GPL Ghostscript 9.06 (2012-08-08)
Copyright (C) 2012 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.   **** Error: Cannot find a 'startxref' anywhere in the file.   **** Warning:  An error occurred while reading an XREF table.   **** The file has been damaged.  This may have been caused
   **** by a problem while converting or transfering the file.
   **** Ghostscript will attempt to recover the data.
Error: /rangecheck in --run--
Operand stack:
   post_eof_count   -1966197866
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1910   1   3   %oparray_pop   1909   1   3
   %oparray_pop   1893   1   3   %oparray_pop   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--
Dictionary stack:
   --dict:1169/1684(ro)(G)--   --dict:1/20(G)--   --dict:82/200(L)--   --dict:82/200(L)--   --dict:109/127(ro)(G)--   --dict:293/300(ro)(G)--   --dict:20/31(L)--
Current allocation mode is local
GPL Ghostscript 9.06: Unrecoverable error, exit code 1

2GB isn't as big as it used to be.