Bug 688283

Summary: 'trailer' without 'startxref'
Product: Ghostscript Reporter: SaGS <sags5495>
Component: PDF InterpreterAssignee: Alex Cherepanov <alex>
Status: NOTIFIED FIXED    
Severity: enhancement    
Priority: P3    
Version: master   
Hardware: PC   
OS: All   
Customer: Word Size: ---
Attachments: Suggested patch.
Additional test case.
Updated patch (against -r8312).
Modified patch

Description SaGS 2005-08-24 13:50:18 UTC
Sample file:
    I cannot attach the sample file here (not being mine), but it can be 
    obtained from Adobe's site.
(1) Get "Creating a Custom Dynamic Stamp" (lc_custom_stamp_tip.pdf) from 
    http://partners.adobe.com/public/developer/tips/index.html .
(2) This file contains another PDF as a file attachement 
    (content_feedback_form.pdf, 155248 bytes, producer: Adobe Designer 7.0). 
    This is the PDF that shows the problem mentioned below.

The sample file mentioned above contains a trailer dictionary that is NOT 
followed by the token "startxref". On one hand, it is true that only the 
last "startxref" and "%%EOF" are usefull. On the other hand, the 
PDF Reference describes the "File trailer" as a block consisting of token 
"trailer" + trailer dictionary + token "startxref" + number (pointing to 
tha start of the "classic" xref table) + "%%EOF". So, I consider the 
sample PDF as being wrong.

However, Ghostscript's behaviour in this case can be improved. Current 
HEAD ends reading the xref+trailer only when encountering "startxref", so 
in this case it considers the xref as invalid and tries to rebuild it.

After applying the proposed patch, reading the xref+trailer stops at the 
closing ">>". There is still a check for "startxref" and a warning when 
it is missing, but the rebuild logic is not triggered (and not needed).

The change is important because the rebuild logic currently does NOT
handle compressed object streams. The sample file is a hybrid-reference 
PDF. The defective trailer exists only because /XRefStm has to be in an 
update section, not because the file was really updated. If the rebuild 
logic is triggered, it misses objects accessible only to PDF1.5+ viewers.
Comment 1 SaGS 2005-08-24 13:55:16 UTC
Created attachment 1651 [details]
Suggested patch.

The attached patch fixes only the problem mentione din this report. For 
the sample file to be processed 100% correctly, the following patches need 
to be applied too:
  #688150 "ReusableStreamDecode available but failing at languagelevel 2".
  #688152 "'Undefined in get' and extra trailer keys with pdfwrite.ps 
	  and PDF1.5+"
  #688282 "Xref-streams ignored in hybrid-ref PDFs"
Comment 2 SaGS 2006-10-22 02:58:59 UTC
Because of a patch commited to tolerate "startxref" misspelled as 
"startref" (without the "x"), GNU patch partly rejects the diff 
in comment #1.

Fix: the "/startref ..." line introduced by the mentioned commit 
may go away, because now scanning ends at the closing ">>", not 
at "startxref". Patched GS gives a warning for any misspelling or 
absence of the token "startxref".

An updated version is file Bug687796-r7114IJ-to-r7114IJH.diff.txt 
in attachment #2548 [details] (from bug #687796 comment #25, "pdfinflt.ps 
creates an incomplete PDF"). That particular diff needs some other 
patches from there to be applied before it. Not that the fixes are 
interdependent, but the modified lines are physically too close 
for GNU patch to find the context it needs and apply the diffs.
Comment 3 SaGS 2007-10-25 22:42:40 UTC
Created attachment 3500 [details]
Additional test case.

The PDF 1.7 Reference, section F.2.3 "First-Page Cross-Reference 
Table and Trailer (Part 3)", 2nd paragraph on page 1031 states:

   "The first-page trailer may optionally end with startxref, an 
    integer, and %%EOF, just as in an ordinary trailer. This 
    information is ignored."

(note the word "optionally"), so in such a case a PDF interpreter 
cannot rely on anything particular (like "startxref"/ "startref"/ 
"xref") to follow the trailer dictionary. Now consider a linearized 
PDF being incrementally updated, and you get that in the general 
case a "previous trailer" may lack the "startxref". It's possible 
Acrobat does a similar thing when creating hybrid-ref PDFs.

Attached PDF is such a file that "legally" lacks a "startxref". 
(Although originally generated by Distiller, it was binary-edited 
to overwrite the "startxref + number + %%EOF" with spaces.)

Current GS (SVN rev 8312) fails to display this file because it scans 
beyond the end of the trailer dictionary. Reader displays it fine.

  - The 1st time the trailer dict is red, the error triggers an 
    unnecessary rebuild. This is described by the current report 
    and addressed by the suggested patch.

  - During the rebuild, it fails again with an "Error: /undefined in 
    /BXlevel" or an "Error: /undefined in --run--", depending on 
    how you run GS (*). Please see bug #687796 comment #25 point (I).

(*) Don't know why, from some time most (all?) errors that appear 
    while executing GS from the command line (platform: WIN32) are 
    signaled as "... in --run--" or "... in --execute--". When run 
    by GSView, the errors are more informative. Is this OK, or a bug?
Comment 4 SaGS 2007-10-25 22:43:43 UTC
Created attachment 3501 [details]
Updated patch (against -r8312).

- The warning about missing "startxref" removed, since this keyword 
  is not always required. This also avoids getting the "This file had 
  errors that were repaired or ignored. ..." message for this case.

- Diff regenerated to be applicable to current TRUNK.

It undoes the change in rev 8312 and the addition of "/startref ..." 
from rev 6979 too, as these are not usefull anymore.
Comment 5 Alex Cherepanov 2007-10-27 22:06:00 UTC
Created attachment 3505 [details]
Modified patch

Thanks for the patch, SaGS.

I think the patch can be committed but I'd like to do 2 minor changes.
1. Use 'def' instead of 'store'. There's no other dictionaries on the stack
   but 'store' suggests the opposite.
2. Move the initialization of dictlevelcount to pdfopenfile where many other
   variables get initialized.

Do you agree ?
Regression testing shows no differences.
Comment 6 SaGS 2007-10-27 23:03:40 UTC
If you prefer these modifications, and no problem appeared with the modified 
code, it's OK with me.

I'll just explain here why I coded like this:

For (1): In general I prefer "store" for MODIFYING values, being more robust - 
it does not depend [too much] on the dict stack contents, and will immediately 
catch the error if the correct dict is missing from the dict stack.

For (2): I think it's better to INITIALISE (not necessarily 
CREATE) "dictlevelcount" just before scanning for the trailer. What if there's 
an error in the trailer and "dictlevecount" remains > 0? The rebuild is 
triggered, but for the rebuild it will start with the wrong value. However, 
this won't matter too much because it's expected to fail anyway (nobody fixes 
the trailer in the meantime!).
Comment 7 Alex Cherepanov 2007-10-28 07:32:28 UTC
The patch from the comment #4 with "store" changed to "def" is committed
as a rev. 8327.

Although "store" may be more robust, it greatly extends the scope that the
programmer has to consider reading the code. This is similar to use of globals
in C instead of static variables. The rest of Ghostscript code uses "store"
only when it's needed.