Bug 702556 - pdfwrite produces an invalid pdf file
Summary: pdfwrite produces an invalid pdf file
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.52
Hardware: PC Linux
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-07-11 05:49 UTC by William Bader
Modified: 2020-07-15 16:54 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
the original PDF 4244201-1.pdf (12.88 MB, application/pdf)
2020-07-11 05:49 UTC, William Bader
Details
pdftops -level3 output x3.ps (19.56 MB, application/postscript)
2020-07-11 05:51 UTC, William Bader
Details
bad x3.pdf from /u/gnu/gs9.52/gs -sDEVICE=pdfwrite -o x3.pdf x3.ps (2.25 MB, application/pdf)
2020-07-11 05:52 UTC, William Bader
Details
viewing x3.pdf in gv shows the first screen, then pauses, then the second (113.88 KB, image/gif)
2020-07-11 05:55 UTC, William Bader
Details
minimised test file (one glyph) (22.16 KB, application/postscript)
2020-07-15 15:26 UTC, Ken Sharp
Details

Note You need to log in before you can comment on or make changes to this bug.
Description William Bader 2020-07-11 05:49:48 UTC
Created attachment 19435 [details]
the original PDF 4244201-1.pdf

I have a PDF that when I convert it to ps with poppler pdftops and then convert the resulting ps back to PDF with gs, gs produces an invalid PDF that displays incorrectly in gs and that crashes atril.
It happens with a range of poppler versions and a range of gs versions.
I think that the poppler output is good because it views ok and because gs shows no errors viewing it or converting it to PDF.
I have some example command lines below.
I am going to attach the original PDF and then (if they fit) the files from the first example.

/usr/local/bin/pdftops -level3 4244201-1.pdf x3.ps ; /u/gnu/gs9.52/gs -sDEVICE=pdfwrite -o x3.pdf x3.ps

/usr/bin/pdftops -level3 4244201-1.pdf x3.ps ; /u/gnu/gs9.52/gs -sDEVICE=pdfwrite -o x3.pdf x3.ps

/usr/bin/pdftops -level3 4244201-1.pdf x3.ps ; /usr/bin/gs -sDEVICE=pdfwrite -o x3.pdf x3.ps

/usr/bin/gs is ghostscript-9.27-7.fc31.x86_64
/u/gnu/gs9.52/gs is gs 9.52 with the txtwrite patch
/usr/bin/pdftops is poppler-0.73.0-16.fc31.x86_64
/usr/local/bin/pdftops is 0.90.0
Comment 1 William Bader 2020-07-11 05:51:03 UTC
Created attachment 19436 [details]
pdftops -level3 output x3.ps
Comment 2 William Bader 2020-07-11 05:52:05 UTC
Created attachment 19437 [details]
bad x3.pdf from /u/gnu/gs9.52/gs -sDEVICE=pdfwrite -o x3.pdf x3.ps
Comment 3 William Bader 2020-07-11 05:55:08 UTC
Created attachment 19438 [details]
viewing x3.pdf in gv shows the first screen, then pauses, then the second
Comment 4 William Bader 2020-07-12 03:10:50 UTC
Comparing the ps produced by pdftops -level2 (which gs handles correctly) and -level3:
images L2 /LZWDecode filter, L3 /FlateDecode filter
fonts
    L2
    /pdfMakeFont {
      4 3 roll findfont
      4 2 roll matrix scale makefont
      dup length dict begin 
        { 1 index /FID ne { def } { pop pop } ifelse } forall
        /Encoding exch def
        currentdict
      end
      definefont pop   
    } def
    L3 
    /pdfMakeFont16L3 {
      1 index /CIDFont resourcestatus {
        pop pop 1 index /CIDFont findresource /CIDFontType known
      } {
        false
      }  ifelse
      {
        0 eq { /Identity-H } { /Identity-V } ifelse
        exch 1 array astore composefont pop
      } {
        pdfMakeFont16
      } ifelse
    } def 

The L3 ps produced for 4244201-1.pdf has an image with /FlateDecode in /DeviceCMYK.
I have another test PDF where pdftops -level3 produces an image with  /FlateDecode in /DeviceRGB, and that works OK.
I suspect that fonts are not the problem because other people would have noticed by now, and a font problem would probably cause a font error instead of most of the image being overwritten with black.
My guess is that it has something to do with a /FlateDecode in /DeviceCMYK.
Comment 5 William Bader 2020-07-13 16:55:03 UTC
I built poppler-0.90.1 with cmake -DENABLE_ZLIB=0 (to eliminate the use of FlateDecode). Even though the ps from poppler pdftops -level3 became 27% larger, the pdf from gs pdfwrite remained the same size (with 74 bytes different) and still displayed incorrectly, so the problem is something other than using FlateDecode on images.
Another difference is that the pdftops -level3 output adds lines like
false opm
where its prolog sets
/opm { dup /pdfOPM exch def /setoverprintmode where{pop setoverprintmode}{pop}ifelse  } def
but replacing it with
/opm { pop } def
doesn't fix the problem with gs pdfwrite.
Comment 6 Ken Sharp 2020-07-15 14:45:04 UTC
(In reply to William Bader from comment #0)

> I have a PDF that when I convert it to ps with poppler pdftops and then
> convert the resulting ps back to PDF with gs, gs produces an invalid PDF
> that displays incorrectly in gs and that crashes atril.

The PDF is not invalid. Its not correct but its perfectly valid. I can't comment on atril, presumably it has a bug.


(In reply to William Bader from comment #4)

> The L3 ps produced for 4244201-1.pdf has an image with /FlateDecode in
> /DeviceCMYK.
> I have another test PDF where pdftops -level3 produces an image with 
> /FlateDecode in /DeviceRGB, and that works OK.

Nothing to do with the problem.

> I suspect that fonts are not the problem because other people would have
> noticed by now, and a font problem would probably cause a font error instead
> of most of the image being overwritten with black.
> My guess is that it has something to do with a /FlateDecode in /DeviceCMYK.

No, its very clearly the fonts. If you set level 2 output (and if that means baseline level 2 output) then CIDFonts are not supported; these were added in, I think, version 2016 of the Adobe interpreter (2000 indicates level 2, 3000 indicates level 3). So I imagine that's why it works if you output level 2 PostScript instead of level 3, there will be no CIDFonts.

There are up to 5 Font Matrix entries possible here, the type 0 CID-Keyed instance of the font, the CIDFont which is used by the type 0 font, each of the descendant fonts of the CIDFont and then, because these are CFF CIDFonts, the CFF font and each of the descendant fonts in CFF FDArray. These matrices may or may not be present, are substituted with a default [0.001 0 0 0.001 0 0] matrix if omitted in most places, and must all be multiplied together in order to achieve the correct size output.

Of course, in general most of these matrices are defined as the default or the identity matrix, its not common to see these defined any other way. Where they are defined differntly, the differences are generally in the FDArray entries. In large part that's what the entries in the FDArray are for.

That's PostScript of course, in PDF there is no type 0 font, and the CIDFont may not have a FontMatrix.

For reasons best known to itself, the Poppler PostScript output moves the FontMatrix from the CFF font to the CIDFont, and replaces the FontMatrix of the CFF font with the identity matrix. So this is where the code is unusual, we'd normally expect to see the modified array in the FDArray, not the CIDFont.

When writing a PDF file the pdfwrite device cannot write a CIDFont that way, so its forced to move the FotnMatrix back where it originally was. Unfortunately there was one case where we did not write out the FontMatrix, and should have. This resulted in a missing default matrix, it appears Acrobat replaces that with the standard default matrix (which is why the output from pdfwrite displays correctly in Acrobat). Its not at all clear that this is correct, and there are comments in our code noting that the documentation is itself unclear on this point with PDF files.

commit 3786f7cb0c4ccf3442beafdf186dbc6835da8ae3 fixes this without altering any of the existing test files which use non-standard FontMatrix entries to achieve effects such as artificially oblique fonts.

As I've noted before PostScript and PDF are not the same and I strongly advise *NOT* converting PDF files to PostScript and back to PDF. If you have a PDF as input, and want a PDF as output, there's usually no reason to create PostScript in the middle.
Comment 7 Ken Sharp 2020-07-15 15:26:17 UTC
Created attachment 19456 [details]
minimised test file (one glyph)

Minimised test file as added to test repository
Comment 8 William Bader 2020-07-15 15:38:13 UTC
Thanks!

I applied the patch to gdevpsf2.c and can confirm that it fixes the problem for me.

At least for now, I have a PS-based workflow. It was only a coincidence in this example that a PDF input came from an external source, and I tried converting the output to PDF for email.

>For reasons best known to itself, the Poppler PostScript output moves the FontMatrix from the CFF font to the CIDFont, and replaces the FontMatrix of the CFF font with the identity matrix.

Is that worth changing in poppler?
grepping for FontMatrix in poppler, it has a number of places where it writes "/FontMatrix [1 0 0 1 0 0] def\n".
Comment 9 Ken Sharp 2020-07-15 15:53:23 UTC
(In reply to William Bader from comment #8)

> Is that worth changing in poppler?

Its not incorrect. Its unusual but its perfectly valid. Given how complicated the inheritance of Font Matrices is with CIDFonts and CFF CIDFonts in PostScript I would be very wary of attempting to change it.

> grepping for FontMatrix in poppler, it has a number of places where it
> writes "/FontMatrix [1 0 0 1 0 0] def\n".

Which is also valid. As long as the matrix algebra all works out there's nothing intrinsically wrong with writing the identity matrix at any point. Changing any of those would mean following the whole code path and as I tried to say in my comment, this is a very complicated area. Not helped by having PostScript and PDF differ :-(

I'd be very wary of trying to change anything, especially since there's nothing actually wrong with the output as it stands.
Comment 10 William Bader 2020-07-15 16:54:44 UTC
Thanks for the reply. I won't touch poppler.