708234 – `pstopdf`swallows outline entries

Bug 708234 - `pstopdf`swallows outline entries

Summary: `pstopdf`swallows outline entries

Status:	RESOLVED INVALID

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Writer (show other bugs)
Version:	master
Hardware:	PC Linux

Importance:	P2 normal
Assignee:	Default assignee

URL:
Keywords:

Duplicates (1):	708244 (view as bug list)
Depends on:
Blocks:

Reported:	2025-01-04 05:46 UTC by Werner Lemberg
Modified:	2025-01-11 11:58 UTC (History)
CC List:	1 user (show)

See Also:
Customer:
Word Size:	---

Attachments
input PDF file (20.70 MB, application/x-xz) 2025-01-04 05:46 UTC, Werner Lemberg	Details
original outline (72.48 KB, image/png) 2025-01-04 05:47 UTC, Werner Lemberg	Details
outline after ps2pdf (67.90 KB, image/png) 2025-01-04 05:47 UTC, Werner Lemberg	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Werner Lemberg 2025-01-04 05:46:46 UTC

Created attachment 26339 [details]
input PDF file

[commit da32171815f62653d90da4e9e0302f2e57ab3bd3 from 2025-Jan-02]

If I run

```
ps2pdf notation.pdf
```

on the attached PDF (sorry for the big input file), entries in the outline are missing, as shown in the two images.

Comment 1 Werner Lemberg 2025-01-04 05:47:32 UTC

Created attachment 26340 [details]
original outline

Comment 2 Werner Lemberg 2025-01-04 05:47:53 UTC

Created attachment 26341 [details]
outline after ps2pdf

Comment 3 Werner Lemberg 2025-01-04 05:50:59 UTC

The problem does not happen with gs 9.52, so it seems this is a regression in the new PDF engine.

Comment 4 Ken Sharp 2025-01-04 11:10:52 UTC

Using Ghostscript (current HEAD of Git) produces this:

The following warnings were encountered at least once while processing this file:
        A problem was encountered trying to preserve the Outlines

   **** This file had errors that were repaired or ignored.
   **** The file was produced by:
   **** >>>> LuaTeX-1.18.0 <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

So clearly there was a problem, and Ghostscript told you so. A debug build gives some additional information, and a debug build run with -dPDFDEBUG and PDFSTOPONERROR gives more, though I wouldn't expect anyone outside the Ghostscript team to do that or, perhaps, understand the reams of output from such a large input file.

However, the last output before going into an error is:

 <<
 /S /GoTo /D (Instrument-specific markup)
 >>
Graphics library error -21 (undefined) in function 'pdfi_doc_trailer'd:\ghostpdl\pdf\pdf_doc.c(1877)'.
        setting pdfi warning 55 - A problem was encountered trying to preserve the Outlines.

So we can see there's a problem with the named destination (Instrument-specific markup) and the problem is that the named destination isn't defined.

Decompressing the file (to get a 75MB output file) we track through the named Destinations tree starting from the Catalog dictionary (object 45221) which contains the Names tree:

  /Names 45220 0 R

Object 45220 :

45220 0 obj
<<
  /Dests 45219 0 R
>>
endobj

So that's the Dests Key of the Names tree, the named destinations, and it is object 45219:

45219 0 obj
<<
  /Kids [ 45217 0 R 45218 0 R ]
  /Limits [ (-1) (paper variables for widths and margins) ]
>>
endobj

So that node has two child nodes, no leaf entries and has two strings as the limiting entries.

I won't track through all the nodes here, we eventually end up at object 45201:

45201 0 obj
<<
  /Names [ (Gregorian accidentals and key signatures) 1731 0 R 
    (Gregorian articulation signs) 1733 0 R (Gregorian chant contexts) 
    1729 0 R (Gregorian clefs) 1730 0 R (Gregorian square neume ligatures) 
    1840 0 R (Grid lines) 1231 0 R (Grouping staves) 1206 0 R 
    (Guile predicates) 2649 0 R (Guitar) 1542 0 R (Harmonics) 
    1530 0 R (Harp) 1523 0 R (Harp pedals) 1525 0 R (Hidden notes) 
    1223 0 R (Hiding staves) 1212 0 R (Horizontal spacing) 2190 0 R 
    (Horizontal spacing overview) 2191 0 R (Horizontal spacing paper variables) 
    2158 0 R (How to prevent sharing of music expressions) 2012 0 R 
    (Hufnagel glyphs) 2509 0 R (Improvisation) 905 0 R (Incipits) 
    1849 0 R (Including LilyPond files) 2004 0 R (Indicating harmonics and dampened notes) 
    1544 0 R (Indicating position and barring) 1543 0 R (Indicating power chords) 
    1545 0 R (Input modes) 1875 0 R (Input structure) 1876 0 R 
    (Inside the staff) 1219 0 R (Instantiating new staves) 1205 0 R 
    (Instrument-specific markup) 2623 0 R (Instrument-specific scripts) 
    2633 0 R (Instrument names) 1214 0 R ]
  /Limits [ (Gregorian accidentals and key signatures) (Instrument names) ]
>>
endobj

 The key we are looking for is 'Instrument-specific scripts' and we can see that it is defined in the array of names.

However.... The key point here is the Limits array (which is a required entry). Note that the upper limit is the string 'Instrument names'. According to the spec the strings are compared 'lexically', which simply means that the byte values of each string element are compared.

The upper limit has a *space* after 'Instrument', which is value 0x20, but the string we are looking for has a '-' after 'Instrument' and that is byte value 0x45.

0x45 is greater than 0x20, so the string we are searching for is 'greater' than the top limit, and therefore we can skip checking the array contents because Limits array in the the dictionary, in effect, tells us that the string *can't* be in this array. So we move on to the next node in the tree. Eventually we have checked the entire tree and not found the named destination, so we raise an undefined error.

It is quite likely that this is new behaviour, the old code was written in PostScript and the new code is written, completely from scratch, in C. It is possible that the old code did not check the Limits array in named Destinations.

If I disable the limit checking (by hacking the code, not a user option) then the file runs to completion and the output file (using Acrobat to check) appears to contain the entire Outlines tree. Obviously with such a complex Outlines entry I could be missing something, but previously it was missing the Index, disabling the Limit check results in the Index being present.

Similarly altering the object in the decompressed file to :

45201 0 obj
<<
  /Names [ (Gregorian accidentals and key signatures) 1731 0 R 
...
...
    (Instrument-specific markup) 2623 0 R (Instrument-specific scripts) 
    2633 0 R (Instrument names) 1214 0 R ]
  /Limits [ (Gregorian accidentals and key signatures) (Instrument-specific scripts) ]
>>
endobj

allows that particular destination to be found, however there appear to be other instances of the problem with other nodes of the named Destination tree.

So the problem here seems to be in the original file. It is unfortunate that Acrobat doesn't raise an error with the original file but then, it so rarely does.

Comment 5 Ken Sharp 2025-01-04 11:47:26 UTC

It occurred to me, after my earlier comment, that of course Acrobat doesn't complain about the Named destination; it would only complain when it needed to process the named destination tree to find the destination.

So I opened the original file in Acrobat, went to page 'xii' and then A.12.6 'Instrument-specific markup' and clicked the link.

Nothing happens.

Try 'conditional markup' or 'Accordian registers' and clicking the link takes you to the relevant page.

The same applies to 'Instrument-specific scripts' in Section A.15.

So obviously the link is broken in the original file, because the Named Destination isn't present in the Dests Name tree, because the Limits are incorrect.

Comment 6 Werner Lemberg 2025-01-04 12:49:52 UTC

Thanks a lot for the very detailed analysis!  How do you uncompress the original file?  I tried `pdftk ... uncompress`, and I don't get the same object IDs, which looks strange to me.

Similarly, I can't repeat your problem with the A.12.6 link: both okular and evince take me to the right page.

BTW, `ps2pdf` from current git did *not* produce any warning, so ghostscript told me nothing :-)  Maybe this could be improved somehow?

Comment 7 Ken Sharp 2025-01-04 13:54:30 UTC

(In reply to Werner Lemberg from comment #6)
> Thanks a lot for the very detailed analysis!  How do you uncompress the
> original file?

I used MuPDF, there are other tools which will do the smae job I'm sure.


> Similarly, I can't repeat your problem with the A.12.6 link: both okular and
> evince take me to the right page.

I specifically used Acrobat as the 'de facto' standard.

Clearly any consumer can simply ignore the Limits array, at the cost of processing every string in the Names array in (potentially) every branch and leaf in the tree whenever a named destination needs to be dereferenced. The Limits array is intended to reduce that overhead by allowing consumers to skip nodes and leaves which don't contain the target.


> BTW, `ps2pdf` from current git did *not* produce any warning, so ghostscript
> told me nothing :-)  Maybe this could be improved somehow?

Yes, by not using the ps2pdf shell script.

For starters I'm on Windows, not Linux, which uses a different script (though also one which pipes the back channel to null). In addition I believe the existing ps2pdf shell script limits the PDF output version to 1.4, which will mean the pdfwrite device will be unable to use certain features. In particular it won't use XRef streams or ObjStms which will cause the output file to be larger.

There really is nothing to be gained by using the shell script over simply using Ghostscript directly, and plenty to be lost. Unless, of course, you actually want not to be told when errors occur.

Changing the way that the script works would obviously be a breaking change for anyone who does want that behaviour, so I won't be altering it. Frankly I wish people would stop using it.

Comment 8 Ken Sharp 2025-01-11 11:58:26 UTC

*** Bug 708244 has been marked as a duplicate of this bug. ***