Summary: | PDF containing just a JPEG converted to PDF/A-1b fails rule 6.7.3-1 | ||
---|---|---|---|
Product: | Ghostscript | Reporter: | Eric Companie <eric.companie> |
Component: | PDF Writer | Assignee: | Ken Sharp <ken.sharp> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | sphinx.pinastri |
Priority: | P4 | ||
Version: | master | ||
Hardware: | PC | ||
OS: | Linux | ||
Customer: | Word Size: | --- | |
Attachments: |
PDF made by ImageMagick 7 which fails to be converted to a PDF/A
PDF made by ImageMagick 6 which is fine Simple JPEG put in a PDF with convert then put in a PDF/A with gs Auxiliary pdfa.ps used to build the PDF/A (provided by Ken Sharp) |
Description
Eric Companie
2021-02-06 00:03:34 UTC
Created attachment 20571 [details]
PDF made by ImageMagick 6 which is fine
Created attachment 20572 [details]
Simple JPEG put in a PDF with convert then put in a PDF/A with gs
Created attachment 20573 [details]
Auxiliary pdfa.ps used to build the PDF/A (provided by Ken Sharp)
The good file has /Title attribute in UTF16 format. Ghostscript detects it and issues a warning: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO The Title attribute that was set in pdfa.ps remain unchanged. The bad file has /Title attribute in UTF16 format that lacks BOM marker. Ghostscript does not recognize it as UTF16 and passes intact to the output file, where non-printable characters are not expected. While Ken decides what to do about the problem: recognize UTF16 without BOM marker, or convert simple UTF16 to ASCII, or both, you can override the /Title attribute as following: gs ... pdfa.ps badfox.pdf -c "[/Title(New title)/DOCINFO pdfmark" (In reply to Peter Cherepanov from comment #4) > While Ken decides what to do about the problem: recognize UTF16 without BOM > marker, or convert simple UTF16 to ASCII, or both It is not, unfortunately, as simple as that. The documented method for comparing the Document Information dictionary /Title with the XMP title tag is very limited and fails for quite a range of values. The only solution is to drop the Title in a wider range of values which is done in commit 4ab5dd6c004a252e64f26d6238799004f70d4a35 I've been looking at the ImageMagick code, and it looks to me like their code for creating a /Title in a PDF file is wrong. I could be mistaken because it's not code I'm familiar with, and I don't have a copy of the current source here to debug through. However, from perusing the code it looks like they decide whether they are producing a PDF/A file; if they are then they write the Title as UTF-16BE. This is entirely valid, but it's guaranteed to fail PDF/A validation, as a UTF-16BE string cannot possibly match a UTF-8 string byte-for-byte comparison. If they are not producing a PDF/A file then they (seem to) write the /Title as a UTF-8 string. Which is also incorrect, because the /Title must be either UTF-16BE or PDFDocEncoding, UTF-8 isn't valid. It would be nice to raise this issue with the ImageMagick developers if you have the ability to do that. |