Created attachment 20570 [details] PDF made by ImageMagick 7 which fails to be converted to a PDF/A A JPEG put in a PDF with convert from ImageMagick 7 converted to a PDF/A by Ghostscript fails rule 6.7.3-1 according to veraPDF. Interestingly, the same procedure but using convert from ImageMagick 6 produces a PDF/A which is fine. Something is different between the 2 PDFs but what? Could Ghostscript get rid of this error? Is there an additional option to pass to convert? I can always stick to ImageMagick 6... Ubuntu 20.04 Ghostscript 9.54.0 compiled from ghostpdl.git ImageMagick 7.0.10-61 Q16 x86_64 2021-01-30 compiled from the master branch veraPDF 1.16.1 With ImageMagick 7.0.10-61 Q16 x86_64 2021-01-30: $ convert fox.jpg badfox.pdf # ImageMagick 7 $ gs -q -dBATCH -dNOPAUSE -dSAFER -sDEVICE=pdfwrite -dAutoRotatePages=/None -sColorConversionStrategy=RGB -dAutoFilterColorImages=true -dAutoFilterGrayImages=true -dPDFA=1 -dPDFACompatibilityPolicy=1 -o pdfa.pdf pdfa.ps badfox.pdf $ verapdf -v --format text -f 1b pdfa.pdf FAIL 6.7.3-1 With ImageMagick 6.9.10-23 Q16 x86_64 20190101: $ convert fox.jpg goodfox.pdf # ImageMagick 6 $ gs -q -dBATCH -dNOPAUSE -dSAFER -sDEVICE=pdfwrite -dAutoRotatePages=/None -sColorConversionStrategy=RGB -dAutoFilterColorImages=true -dAutoFilterGrayImages=true -dPDFA=1 -dPDFACompatibilityPolicy=1 -o pdfa.pdf pdfa.ps goodfox.pdf $ verapdf -v --format text -f 1b pdfa.pdf PASS Attachments: fox.jpg badfox.pdf goodfox.pdf
Created attachment 20571 [details] PDF made by ImageMagick 6 which is fine
Created attachment 20572 [details] Simple JPEG put in a PDF with convert then put in a PDF/A with gs
Created attachment 20573 [details] Auxiliary pdfa.ps used to build the PDF/A (provided by Ken Sharp)
The good file has /Title attribute in UTF16 format. Ghostscript detects it and issues a warning: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO The Title attribute that was set in pdfa.ps remain unchanged. The bad file has /Title attribute in UTF16 format that lacks BOM marker. Ghostscript does not recognize it as UTF16 and passes intact to the output file, where non-printable characters are not expected. While Ken decides what to do about the problem: recognize UTF16 without BOM marker, or convert simple UTF16 to ASCII, or both, you can override the /Title attribute as following: gs ... pdfa.ps badfox.pdf -c "[/Title(New title)/DOCINFO pdfmark"
(In reply to Peter Cherepanov from comment #4) > While Ken decides what to do about the problem: recognize UTF16 without BOM > marker, or convert simple UTF16 to ASCII, or both It is not, unfortunately, as simple as that. The documented method for comparing the Document Information dictionary /Title with the XMP title tag is very limited and fails for quite a range of values. The only solution is to drop the Title in a wider range of values which is done in commit 4ab5dd6c004a252e64f26d6238799004f70d4a35
I've been looking at the ImageMagick code, and it looks to me like their code for creating a /Title in a PDF file is wrong. I could be mistaken because it's not code I'm familiar with, and I don't have a copy of the current source here to debug through. However, from perusing the code it looks like they decide whether they are producing a PDF/A file; if they are then they write the Title as UTF-16BE. This is entirely valid, but it's guaranteed to fail PDF/A validation, as a UTF-16BE string cannot possibly match a UTF-8 string byte-for-byte comparison. If they are not producing a PDF/A file then they (seem to) write the /Title as a UTF-8 string. Which is also incorrect, because the /Title must be either UTF-16BE or PDFDocEncoding, UTF-8 isn't valid. It would be nice to raise this issue with the ImageMagick developers if you have the ability to do that.