703486 – PDF containing just a JPEG converted to PDF/A-1b fails rule 6.7.3-1

Bug 703486 - PDF containing just a JPEG converted to PDF/A-1b fails rule 6.7.3-1

Summary: PDF containing just a JPEG converted to PDF/A-1b fails rule 6.7.3-1

Status:	RESOLVED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Writer (show other bugs)
Version:	master
Hardware:	PC Linux

Importance:	P4 normal
Assignee:	Ken Sharp

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-02-06 00:03 UTC by Eric Companie
Modified:	2021-02-08 16:23 UTC (History)
CC List:	1 user (show)

See Also:
Customer:
Word Size:	---

Attachments
PDF made by ImageMagick 7 which fails to be converted to a PDF/A (32.12 KB, application/pdf) 2021-02-06 00:03 UTC, Eric Companie	Details
PDF made by ImageMagick 6 which is fine (34.48 KB, application/pdf) 2021-02-06 00:04 UTC, Eric Companie	Details
Simple JPEG put in a PDF with convert then put in a PDF/A with gs (30.19 KB, image/jpeg) 2021-02-06 00:05 UTC, Eric Companie	Details
Auxiliary pdfa.ps used to build the PDF/A (provided by Ken Sharp) (2.60 KB, application/postscript) 2021-02-06 00:08 UTC, Eric Companie	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Eric Companie 2021-02-06 00:03:34 UTC

Created attachment 20570 [details]
PDF made by ImageMagick 7 which fails to be converted to a PDF/A

A JPEG put in a PDF with convert from ImageMagick 7 converted to a PDF/A by Ghostscript fails rule 6.7.3-1 according to veraPDF. Interestingly, the same procedure but using convert from ImageMagick 6 produces a PDF/A which is fine. Something is different between the 2 PDFs but what? Could Ghostscript get rid of this error? Is there an additional option to pass to convert? I can always stick to ImageMagick 6...

Ubuntu 20.04
Ghostscript 9.54.0 compiled from ghostpdl.git
ImageMagick 7.0.10-61 Q16 x86_64 2021-01-30 compiled from the master branch
veraPDF 1.16.1

With ImageMagick 7.0.10-61 Q16 x86_64 2021-01-30:

$ convert fox.jpg badfox.pdf # ImageMagick 7
$ gs -q -dBATCH -dNOPAUSE -dSAFER -sDEVICE=pdfwrite -dAutoRotatePages=/None -sColorConversionStrategy=RGB -dAutoFilterColorImages=true -dAutoFilterGrayImages=true -dPDFA=1 -dPDFACompatibilityPolicy=1 -o pdfa.pdf pdfa.ps badfox.pdf
$ verapdf -v --format text -f 1b pdfa.pdf
  FAIL 6.7.3-1

With ImageMagick 6.9.10-23 Q16 x86_64 20190101:

$ convert fox.jpg goodfox.pdf # ImageMagick 6
$ gs -q -dBATCH -dNOPAUSE -dSAFER -sDEVICE=pdfwrite -dAutoRotatePages=/None -sColorConversionStrategy=RGB -dAutoFilterColorImages=true -dAutoFilterGrayImages=true -dPDFA=1 -dPDFACompatibilityPolicy=1 -o pdfa.pdf pdfa.ps goodfox.pdf
$ verapdf -v --format text -f 1b pdfa.pdf
PASS

Attachments: fox.jpg badfox.pdf goodfox.pdf

Comment 1 Eric Companie 2021-02-06 00:04:29 UTC

Created attachment 20571 [details]
PDF made by ImageMagick 6 which is fine

Comment 2 Eric Companie 2021-02-06 00:05:23 UTC

Created attachment 20572 [details]
Simple JPEG put in a PDF with convert then put in a PDF/A with gs

Comment 3 Eric Companie 2021-02-06 00:08:34 UTC

Created attachment 20573 [details]
Auxiliary pdfa.ps used to build the PDF/A (provided by Ken Sharp)

Comment 4 Peter Cherepanov 2021-02-06 03:23:45 UTC

The good file has /Title attribute in UTF16 format. Ghostscript detects it and issues a warning:
UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
The Title attribute that was set in pdfa.ps remain unchanged.

The bad file has /Title attribute in UTF16 format that lacks BOM marker.  Ghostscript does not recognize it as UTF16 and passes intact to the output file, where non-printable characters are not expected.

While Ken decides what to do about the problem: recognize UTF16 without BOM marker, or convert simple UTF16 to ASCII, or both, you can override the /Title attribute as following:

gs ... pdfa.ps badfox.pdf -c "[/Title(New title)/DOCINFO pdfmark"

Comment 5 Ken Sharp 2021-02-08 08:12:05 UTC

(In reply to Peter Cherepanov from comment #4)

> While Ken decides what to do about the problem: recognize UTF16 without BOM
> marker, or convert simple UTF16 to ASCII, or both

It is not, unfortunately, as simple as that. The documented method for comparing the Document Information dictionary /Title with the XMP title tag is very limited and fails for quite a range of values.

The only solution is to drop the Title in a wider range of values which is done in commit 4ab5dd6c004a252e64f26d6238799004f70d4a35

Comment 6 Ken Sharp 2021-02-08 16:23:02 UTC

I've been looking at the ImageMagick code, and it looks to me like their code for creating a /Title in a PDF file is wrong. I could be mistaken because it's not code I'm familiar with, and I don't have a copy of the current source here to debug through.

However, from perusing the code it looks like they decide whether they are producing a PDF/A file; if they are then they write the Title as UTF-16BE. This is entirely valid, but it's guaranteed to fail PDF/A validation, as a UTF-16BE string cannot possibly match a UTF-8 string byte-for-byte comparison.

If they are not producing a PDF/A file then they (seem to) write the /Title as a UTF-8 string. Which is also incorrect, because the /Title must be either UTF-16BE or PDFDocEncoding, UTF-8 isn't valid.

It would be nice to raise this issue with the ImageMagick developers if you have the ability to do that.