Bug 703486

Summary: PDF containing just a JPEG converted to PDF/A-1b fails rule 6.7.3-1
Product: Ghostscript Reporter: Eric Companie <eric.companie>
Component: PDF WriterAssignee: Ken Sharp <ken.sharp>
Status: RESOLVED FIXED    
Severity: normal CC: sphinx.pinastri
Priority: P4    
Version: master   
Hardware: PC   
OS: Linux   
Customer: Word Size: ---
Attachments: PDF made by ImageMagick 7 which fails to be converted to a PDF/A
PDF made by ImageMagick 6 which is fine
Simple JPEG put in a PDF with convert then put in a PDF/A with gs
Auxiliary pdfa.ps used to build the PDF/A (provided by Ken Sharp)

Description Eric Companie 2021-02-06 00:03:34 UTC
Created attachment 20570 [details]
PDF made by ImageMagick 7 which fails to be converted to a PDF/A

A JPEG put in a PDF with convert from ImageMagick 7 converted to a PDF/A by Ghostscript fails rule 6.7.3-1 according to veraPDF. Interestingly, the same procedure but using convert from ImageMagick 6 produces a PDF/A which is fine. Something is different between the 2 PDFs but what? Could Ghostscript get rid of this error? Is there an additional option to pass to convert? I can always stick to ImageMagick 6...

Ubuntu 20.04
Ghostscript 9.54.0 compiled from ghostpdl.git
ImageMagick 7.0.10-61 Q16 x86_64 2021-01-30 compiled from the master branch
veraPDF 1.16.1

With ImageMagick 7.0.10-61 Q16 x86_64 2021-01-30:

$ convert fox.jpg badfox.pdf # ImageMagick 7
$ gs -q -dBATCH -dNOPAUSE -dSAFER -sDEVICE=pdfwrite -dAutoRotatePages=/None -sColorConversionStrategy=RGB -dAutoFilterColorImages=true -dAutoFilterGrayImages=true -dPDFA=1 -dPDFACompatibilityPolicy=1 -o pdfa.pdf pdfa.ps badfox.pdf
$ verapdf -v --format text -f 1b pdfa.pdf
  FAIL 6.7.3-1

With ImageMagick 6.9.10-23 Q16 x86_64 20190101:

$ convert fox.jpg goodfox.pdf # ImageMagick 6
$ gs -q -dBATCH -dNOPAUSE -dSAFER -sDEVICE=pdfwrite -dAutoRotatePages=/None -sColorConversionStrategy=RGB -dAutoFilterColorImages=true -dAutoFilterGrayImages=true -dPDFA=1 -dPDFACompatibilityPolicy=1 -o pdfa.pdf pdfa.ps goodfox.pdf
$ verapdf -v --format text -f 1b pdfa.pdf
PASS

Attachments: fox.jpg badfox.pdf goodfox.pdf
Comment 1 Eric Companie 2021-02-06 00:04:29 UTC
Created attachment 20571 [details]
PDF made by ImageMagick 6 which is fine
Comment 2 Eric Companie 2021-02-06 00:05:23 UTC
Created attachment 20572 [details]
Simple JPEG put in a PDF with convert then put in a PDF/A with gs
Comment 3 Eric Companie 2021-02-06 00:08:34 UTC
Created attachment 20573 [details]
Auxiliary pdfa.ps used to build the PDF/A (provided by Ken Sharp)
Comment 4 Peter Cherepanov 2021-02-06 03:23:45 UTC
The good file has /Title attribute in UTF16 format. Ghostscript detects it and issues a warning:
UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
The Title attribute that was set in pdfa.ps remain unchanged.

The bad file has /Title attribute in UTF16 format that lacks BOM marker.  Ghostscript does not recognize it as UTF16 and passes intact to the output file, where non-printable characters are not expected.

While Ken decides what to do about the problem: recognize UTF16 without BOM marker, or convert simple UTF16 to ASCII, or both, you can override the /Title attribute as following:

gs ... pdfa.ps badfox.pdf -c "[/Title(New title)/DOCINFO pdfmark"
Comment 5 Ken Sharp 2021-02-08 08:12:05 UTC
(In reply to Peter Cherepanov from comment #4)

> While Ken decides what to do about the problem: recognize UTF16 without BOM
> marker, or convert simple UTF16 to ASCII, or both

It is not, unfortunately, as simple as that. The documented method for comparing the Document Information dictionary /Title with the XMP title tag is very limited and fails for quite a range of values.

The only solution is to drop the Title in a wider range of values which is done in commit 4ab5dd6c004a252e64f26d6238799004f70d4a35
Comment 6 Ken Sharp 2021-02-08 16:23:02 UTC
I've been looking at the ImageMagick code, and it looks to me like their code for creating a /Title in a PDF file is wrong. I could be mistaken because it's not code I'm familiar with, and I don't have a copy of the current source here to debug through.

However, from perusing the code it looks like they decide whether they are producing a PDF/A file; if they are then they write the Title as UTF-16BE. This is entirely valid, but it's guaranteed to fail PDF/A validation, as a UTF-16BE string cannot possibly match a UTF-8 string byte-for-byte comparison.

If they are not producing a PDF/A file then they (seem to) write the /Title as a UTF-8 string. Which is also incorrect, because the /Title must be either UTF-16BE or PDFDocEncoding, UTF-8 isn't valid.

It would be nice to raise this issue with the ImageMagick developers if you have the ability to do that.