Bug 706781 - German Umlauts in Info Dict not possible when generating PDF/A
Summary: German Umlauts in Info Dict not possible when generating PDF/A
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 10.01.1
Hardware: All All
: P4 enhancement
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-06-14 10:51 UTC by stefan
Modified: 2023-06-28 14:02 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
Umlauts PDF/A Test (22.89 KB, application/octet-stream)
2023-06-14 11:21 UTC, stefan
Details
Test PDF with Umlauts passes PDF/A-3b validation (21.79 KB, application/octet-stream)
2023-06-14 11:48 UTC, stefan
Details
Test PDF/A-1b and validation report (21.87 KB, application/octet-stream)
2023-06-14 12:27 UTC, stefan
Details
PDFA-1b Chinese (13.35 KB, application/pdf)
2023-06-14 13:07 UTC, stefan
Details
Patch that fixes the issue (8.49 KB, patch)
2023-06-15 12:56 UTC, stefan
Details | Diff
Second version of the patch (8.50 KB, patch)
2023-06-27 10:27 UTC, stefan
Details | Diff
Patch v3 that fixes the issue (9.16 KB, text/plain)
2023-06-27 11:03 UTC, stefan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description stefan 2023-06-14 10:51:43 UTC
In gdevpdfm.c, there is the following code:

if (p->size > 9 && memcmp(p->data, "(\\376\\377", 9) == 0)
	abort = true;
else {
	int j;
	for (j = 0;j < p->size;j++)
	{
		if (p->data[j] == '\\' || p->data[j] > 0x7F || p->data[j] < 0x20)
		{
			abort = true;
			break;
		}
	}
}


The first line memcmp(p->data, "(\\376\\377", 9) disables unicode at all. In the second block, you abort when there is an \ and chars > 0x7F and < 0x20.

If we specify a title via /DOCINFO pdfmark on command line like /Title <f6e4fc>, which represents öäü encoded with PDFDocEncoding as a hex string, then this is internally translated to (\366\344\374). This is then passed on the function above in p->data. The check above then sets the abort flag because of the p->data[j] == '\\' check.

If the abort flag is set, we then see messages like this on command line:
"Text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO"

So basically, only ascii can be used here effectively.

Is this really the case and is this intentional or is it a bug?
How can we pass German umlauts for use in DOCINFO and XMP?
Comment 1 stefan 2023-06-14 11:21:08 UTC
Created attachment 24407 [details]
Umlauts PDF/A Test

The files in the  zip container show the problem. Run the test.bat and check the outputs and the resulting out.pdf file. The title is not applied, only the Author which is plain ascii.
Comment 2 Ken Sharp 2023-06-14 11:24:55 UTC
(In reply to stefan from comment #0)
 
> If the abort flag is set, we then see messages like this on command line:
> "Text string detected in DOCINFO cannot be represented in XMP for PDF/A1,
> discarding DOCINFO"
> 
> So basically, only ascii can be used here effectively.
> 
> Is this really the case and is this intentional or is it a bug?

It is deliberate, per the PDF/A specification, as the message tells you.
Comment 3 stefan 2023-06-14 11:40:13 UTC
Hi Ken, I do not think this is true. I did the following test: I have removed the entire code block which set abort = true; and I compiled Ghostscript. Then I used umlauts so that these are embedded to the PDF file. OK so far. I checked the PDF file, umlauts are there. Then I used a PDF validator to validate the PDF file and voila, the file with the umlauts passes the test. So, I think you are wrong.
Comment 4 stefan 2023-06-14 11:48:20 UTC
Created attachment 24408 [details]
Test PDF with Umlauts passes PDF/A-3b validation

See the attached file, the test PDF file passes the PDF/A-3b validation. Report is also available in the zip container. The test.pdf file contains umlauts in info dict and XMP.
Comment 5 Ken Sharp 2023-06-14 12:17:53 UTC
(In reply to stefan from comment #3)

> Ghostscript. Then I used umlauts so that these are embedded to the PDF file.
> OK so far. I checked the PDF file, umlauts are there. Then I used a PDF
> validator to validate the PDF file and voila, the file with the umlauts
> passes the test. So, I think you are wrong.

No. Your validation report is against PDF/A-3B, try it again with PDF/A-1B or PDF/A-2B.

At no point have you given me a command line, or even mentioned a variant of PDF/A, in fact the onl;y time you even mention PDF/A is in the subject line. So there is no reason for me to anticipate that you are producing an A-3 variant.

We don't support PDF/A-3 beyond permitting you to set the value so that ZugFERD files can be produced. Since I don't have a copy of the PDF/A-3 specification I can't check to see what else may be different.

Note that a PDF/A-2 conforming document is also a PDF/A-3 conforming document.
Comment 6 Ken Sharp 2023-06-14 12:26:11 UTC
(In reply to Ken Sharp from comment #5)

> At no point have you given me a command line

Actually I meant that at the time I originally responded there was no command line. You did (30 minutes later) attach a zip archive but by then I had already closed the bug and replied. The reason my comment is dated later is because there was a mid-air collision with your attachment and I had to redo my comment.
Comment 7 stefan 2023-06-14 12:27:40 UTC
Created attachment 24409 [details]
Test PDF/A-1b and validation report

Hi Ken, thanks for responding. The same is valid for the PDF/A-1b test pdf file which is attached to this comment. The new file is a PDF/A-1b file and the validation report is also available in the zip container. Validation passes and the test PDF file contains umlauts in DOCINFO and XMP.

The first zip container, attached via comment2 contains a pdf file, the command line I use and all other required files, so that you can execute it.
Comment 8 Ken Sharp 2023-06-14 12:32:02 UTC
It is possible that Vera PDF have relaxed their PDF/A requirement. They insisted in the past that the XMP and PDF strings be byte-for-byte compatible, which I argued was not what the spec says, it says 'equivalent'.

But the whole reason for the code is because Vera PDF was refusing to validate PDF/A files which contain UTF16-BE PDF text strings and the corresponding XML strings being UTF-8 (even though the actual content was the same when decoded).
Comment 9 stefan 2023-06-14 12:58:07 UTC
This is the rule in VeraPDF:

<rule object="CosInfo">
	<id specification="ISO_19005_1" clause="6.7.3" testNumber="2"/>
	<description>The value of Title entry from the document information dictionary, if present, and its
		analogous XMP property dc:title['x-default'] shall be equivalent.</description>
	<test>Title == null || Title == XMPTitle</test>
	<error>
		<message>The value of Title entry from the document Info dictionary and its matching XMP property
			dc:title['x-default'] are not equivalent (Info /Title = %1, XMP dc:title['x-default'] = %2)</message>
		<arguments>
			<argument>Title</argument>
			<argument>XMPTitle</argument>
		</arguments>
	</error>
	<references>
		<reference specification="ISO 19005-1:2005/Cor.1:2007/Cor.2:2011" clause="6.7.3"/>
	</references>
</rule>


The test is:
<test>Title == null || Title == XMPTitle</test>

So the DOCInfo title can be null (not available) or the same value as the corresponding XMP value. I would say the encoding doesn't matter the decoded values should be the same.

I think that's what the specification wants as well. If there is a value in the DOCINFO and in the XMP, then they should have the value, so that two different strings are not used.
Comment 10 stefan 2023-06-14 13:07:43 UTC
Created attachment 24410 [details]
PDFA-1b Chinese

Here is a test file with Chinese characters tp force unicode here. The texts are correctly embedded in the INFO DICT of the PDF, but the XML does not have the right content. I would assume that there is an encoding issue when generating the XML. If the strings would be encoded to UTF8 and added to the XML, then the test should pass.
Comment 11 stefan 2023-06-14 15:06:19 UTC
I think the problem is in gs_ConvertUTF16. This function can only convert chars which produces 1 or 2 bytes. If a char has 3 bytes, then there is the following code:

bytes = 3;
U16 = 0xFFFD;

This sets a default char and this is the cause why there is now a mismatch between DOCINFO and XML.

If we would extend gs_ConvertUTF16 so that this function is able to convert every u16 string to utf8, then we could solve this issue.
Comment 12 Ken Sharp 2023-06-14 15:21:11 UTC
(In reply to stefan from comment #11)
 
> If we would extend gs_ConvertUTF16 so that this function is able to convert
> every u16 string to utf8, then we could solve this issue.

Well we're always open to submissions. Don't forget to fill out and return an Artifex Contributer Licence Agreement though. You may need to get permission to do so from your own employer.
Comment 13 stefan 2023-06-15 09:21:00 UTC
Ok, I'll make the changes and then post a patch.
Comment 14 Ken Sharp 2023-06-15 10:36:17 UTC
(In reply to stefan from comment #13)
> Ok, I'll make the changes and then post a patch.

Don't forget the Licence Agreement, we can't accept code without that.

Also bear in mind that we need to be able to sanitise text strings in PDFDocEncoding as well as UTF-16BE.
Comment 15 stefan 2023-06-15 11:22:42 UTC
Right. I think I have a good solution. After I have send the Licence Agreement, can I submit the patch here?
Comment 16 Ken Sharp 2023-06-15 12:03:11 UTC
(In reply to stefan from comment #15)
> Right. I think I have a good solution. After I have send the Licence
> Agreement, can I submit the patch here?

Certainly, yes.
Comment 17 stefan 2023-06-15 12:56:30 UTC
Created attachment 24417 [details]
Patch that fixes the issue

Attached is the patch so that it can be reviewed. I have run some tests with Unicode characters, even those that have 4-byte UTF8 encodings. The VeraPDF tests with the PDF files created this way have been fine so far.

License Agreement is on the way, is still being read. I'll submit it later if there is nothing bad in it.

The solution accepts DOCINFO values that can be safely decoded and encoded as UTF8 because then the DOCINFO values and the XMP values are the same and the PDF/A tests pass.

Please tell me your thoughts.
Comment 18 Ken Sharp 2023-06-19 14:05:25 UTC
Stefan, I just read through the CLA and it has an incorrect address in it, the office has moved since it was written and nobody thought to change it.

Can you email it to me please when you are ready and I'll send it on to our lawyer for retention.
Comment 19 stefan 2023-06-19 15:34:56 UTC
Is there an updated document? What I can download seems to be the same.

Your address on the signed document is:

Artifex Software, Inc.
1305 Grant Avenue, Suite 200
Novato, CA 94945

This seems to be correct.
Comment 20 Ken Sharp 2023-06-19 15:43:34 UTC
(In reply to stefan from comment #19)
> Is there an updated document? What I can download seems to be the same.
> 
> Your address on the signed document is:
> 
> Artifex Software, Inc.
> 1305 Grant Avenue, Suite 200
> Novato, CA 94945
> 
> This seems to be correct.

No, it really isn't. I'm not even sure we still have mail forwarding from that address.

Our current office address is:

Artifex Software, Inc.,
39 Mesa Street,
Suite 108A,
San Francisco,
CA 94129, USA

But the relevant person doesn't work out of that office, if you send it there (which you can if you like) it will have to be forwarded on to the right person, which will take time. Or email me a scanned/photographed copy and I'll forward it.
Comment 21 Ken Sharp 2023-06-19 15:45:23 UTC
If you go to www.artifex.com and scroll down to the bottom, then look at the small print above the big blue subscribe button, our office address is there.
Comment 22 Ken Sharp 2023-06-19 15:47:24 UTC
(In reply to stefan from comment #19)
> Is there an updated document? What I can download seems to be the same.

Not yet, because I only just noticed it :-(

I did email the legal people and point out the problem, but it'll probably be days before that gets fixed.
Comment 23 stefan 2023-06-19 16:30:00 UTC
I have updated the address in the current document and refilled the form. New email is on the way, sent to your info email address.
Comment 24 Ken Sharp 2023-06-27 09:22:24 UTC
Hi Stefan. I finally managed to track down the CLA and so I've started looking at the code you have supplied.

I cannot apply the patch as provided cleanly I'm afraid, because you seem to have based it on the 10.01.0 release, and not current code. The commit to add surrogate pairs for bug #706551 is missing which means the patch won't apply.

Now your code does include the surrogate pairs, but you've recast the code in a way which I can't easily follow or transfer to the current code, because it is difficult to pick out the changes.

Can I ask you to redo the patch against current code please ?

Please also do not use C++ comments ("//"), and I'd be much happier if you did not rename the function gs_ConvertUTF16 and its parameters.
Comment 25 stefan 2023-06-27 09:48:39 UTC
Sure, give me some minutes..
Comment 26 stefan 2023-06-27 10:27:07 UTC
Created attachment 24442 [details]
Second version of the patch

Here is the new patch. Uses C comments and keeps the method name gs_ConvertUTF16
Comment 27 Ken Sharp 2023-06-27 10:44:27 UTC
I'm sorry Stefan but that still doesn't seem to be a diff against the current (HEAD) code in our Git repository, it looks like it's a diff against the 10.01.0 release.

For example:

-        if (U16 >= 0xD800 && U16 <= 0xDBFF) {
-            return gs_note_error(gs_error_rangecheck);
+        /* Decode surrogates */
+        if (i >= 0xD800 && i <= 0xDBFF) {
+            /* High surrogate. Must be followed by a low surrogate, or this is a failure. */
+            if (u16_len == 0) {

So that's saying the old code was:

        if (U16 >= 0xD800 && U16 <= 0xDBFF) {
            return gs_note_error(gs_error_rangecheck);

Whereas the current code in our Git repository has this:

        if (U16 >= 0xD800 && U16 <= 0xDBFF) {
            /* Ensure at least two bytes of input left */
            if (i == (UTF16Len / sizeof(short)) - 1)
                return gs_note_error(gs_error_rangecheck);

            U32 += (U16 & 0x3FF) << 10;
            U16 = (*(UTF16++) << 8);
            U16 += *(UTF16++);
            i++;

The differences between the 10.01.0 release and the current code mean that 'patch' can't apply the diff, and I can't blame it because I'm not at all confident of doing the job manually either, without making a mistake.
Comment 28 Ken Sharp 2023-06-27 10:45:58 UTC
If you can't diff against HEAD< can I suggest sending me a complete replacement for the gs_ConvertUTF16() function ? That should avoid any problems.
Comment 29 stefan 2023-06-27 10:56:23 UTC
Ok, I see, you are right, let me create a new patch against HEAD
Comment 30 stefan 2023-06-27 11:03:53 UTC
Created attachment 24443 [details]
Patch v3 that fixes the issue

New patch against HEAD that fixes the issue
Comment 31 Ken Sharp 2023-06-27 11:08:14 UTC
That seems to have applied cleanly, thanks! I'll go and start testing it now.
Comment 32 Ken Sharp 2023-06-27 15:07:56 UTC
Removing this test:

-            if (p->size > 9 && memcmp(p->data, "(\\376\\377", 9) == 0)
-                abort = true;
-            else {
-                int j;
-                for (j = 0;j < p->size;j++)
-                {
-                    if (p->data[j] == '\\' || p->data[j] > 0x7F || p->data[j] < 0x20)
-                    {
-                        abort = true;
-                        break;
-                    }
-                }

prevents the checks for sensible characters in PDFDocEncoding, which re-introduces bug #703486. I'm going to add back the PDFDocEncoding check.
Comment 33 stefan 2023-06-27 15:43:14 UTC
The following code aborts if unicode is used and this should definitively be allowed.
p->size > 9 && memcmp(p->data, "(\\376\\377", 9)


The following code aborts if chars outside asci range or \\encoded chars are used and this is definitively wrong and needs to be changed/removed.
p->data[j] == '\\' || p->data[j] > 0x7F || p->data[j] < 0x20
Comment 34 Ken Sharp 2023-06-27 16:01:43 UTC
(In reply to stefan from comment #33)
> The following code aborts if unicode is used and this should definitively be
> allowed.
> p->size > 9 && memcmp(p->data, "(\\376\\377", 9)
> 
> 
> The following code aborts if chars outside asci range or \\encoded chars are
> used and this is definitively wrong and needs to be changed/removed.
> p->data[j] == '\\' || p->data[j] > 0x7F || p->data[j] < 0x20

Then you will re-introduce bug #703486, as I mentioned in comment #32. I won't accept an enhancement which introduces a failure as a regression.

This half of the test **only** applies to PDFDocEncoding, because UTF-16BE is dealt with in the first part of the test. As I said, I plan to add back the PDFDocEncoding check.

If you don't find that acceptable, then you'll need to also provide a function to convert strings in PDFDocEncoding reliably to UTF-8. I highlighted this in comment #14.
Comment 35 stefan 2023-06-27 16:09:55 UTC
I think we can extend the Ascii range. Actually we only need to check if the characters in p->data have a value in PDFDocEncoding and if they do then we can also cleanly convert to UTF8. Correct? Then we could allow all PDFDocEncoding chars.
Comment 36 Ken Sharp 2023-06-27 16:44:03 UTC
(In reply to stefan from comment #35)
> I think we can extend the Ascii range. Actually we only need to check if the
> characters in p->data have a value in PDFDocEncoding and if they do then we
> can also cleanly convert to UTF8. Correct? Then we could allow all
> PDFDocEncoding chars.

I haven't checked what the full range of glyphs in PDFDocEncoding are and I'm not convinced that, outside the ASCII range, they map directly to UTF-8. Certainly the bug I'e highlighted uses 0x00, which obviously isn't going to work.

I also have not checked that PDF validators are able to deal sensibly with PDFDocEncoding and compare it to UTF-8 outside the 7-bit ASCII range, which is why we use the restricted range, I'm reasonably certain that the PDFDocEncoding and UTF-8 bytes are the same for that range, and that PDF/A validators will accept them as the same.

It 'looks like' much of PDFDocEnmcoding maps directly to Unicode code points (U+00xx where xx is the Hex value of the PDFDocEncoding character) but I wouldn't want to rely on that without checking all 255 characters (excluding 0x00 obviously).

Hmm well PDFDocEncoding has character code 0x18 (Octal 030) listed as breve, while Unicode code point U+0018 is listed as 'Cancel' so not the same at all. All of the low values (below 0x20) look dodgy in fact, though values > 0x7F look compatible on a quick scan.
Comment 37 stefan 2023-06-27 16:56:41 UTC
In fact, the data in p->data does not have to be byte-compatible with UTF8. This is not required. The following is important:

utf8Data = convertToUtf8(p->data, PDFDocEncoding);

utf8Data should match the UTF8 data in XMP.

So, when we can convert every char in p->data with PDFDocEncoding to UTF8, then we are save and VeraPDF validator gives OK.
Comment 38 Ken Sharp 2023-06-27 18:33:14 UTC
(In reply to stefan from comment #37)
> In fact, the data in p->data does not have to be byte-compatible with UTF8.
> This is not required. The following is important:
> 
> utf8Data = convertToUtf8(p->data, PDFDocEncoding);
> 
> utf8Data should match the UTF8 data in XMP.

Yes, but that will not (or at least may not) match the bytes in the PDFDocEncoding string.

 
> So, when we can convert every char in p->data with PDFDocEncoding to UTF8,
> then we are save and VeraPDF validator gives OK.

I am seriously doubtful about that.

I just found Appendix D.2 in the 1.7 PDF Reference and 'most' cases are OK, but there are definite exceptions. For example small tilde is 0x1F in PDFDocEncoding, but UTF-8 0x1F is 'information separator 1'.

The Unicode equivalent for small tilde is U+02DC. Converting that to UTF-8 will not result in a single '0x1F' code point.

Values from 0x18 to 0x1E in PDFDocEncoding are all accents, and are completely different in UTF-8.

So I'm still inclined to disallow values below 0x20 as I said.

PDFDocncoding values above 0x7F (which is listed as undefined itself) to 0xA0 do not map simply to U+00xx. 0xAD is undefined. 

So I do not believe that simply converting these values to UTF-8 will yield correct results, in the sense that a string drawn using PDFDocEncoding and one using the same bytes converted to UTF-8 will give the same set of glyphs. That's what I would define as being 'equivalent'.

I'm certainly prepared to believe that PDF Validators don't properly check PDFDocEncoding, I'm quite prepared to believe that all they do is check that the PDFDocEncoding bytes are identical to the bytes in the UTF-8 representation, this matches previously observed behaviour of PDF Validators.

So either we need to create a table like that in Appendix D.1 which maps the PDFDocEncoding character codes into UTF16, and then pass that through the conversion to UTF-8, or we need to disallow character codes which will result in the bytes of the PDFDocEncoding not matching the bytes of the UTF-8 string (if we assume that the PDF Validators just look for PDFDocEncoding == UTF-8)

It should be possible to construct a pdfmark which writes DocInfo to prove what's happening.
Comment 39 stefan 2023-06-27 20:00:58 UTC
In my example above, the function

utf8Data = convertToUtf8(p->data, PDFDocEncoding);

would internally translates the bytes in p->data to utf8. This means that we have to decode with PDFDocEncoding. This gives us unicode code points if the conversion is possible. These unicode code points can then be converted to utf8. If this conversion can be done than everything is OK because then the value in p->data would be the same as in XMP.

This is what I mean with "value", the values have to be the same, not the encoded ones.

I mean hex_encode('hello') and base64_encode('hello') have different results but the actual value which has been encoded was the same and if we would undo the corresponding encoding via hex_decode and base64_decode, then we would get the same value back.

> For example small tilde is 0x1F in PDFDocEncoding, but
> UTF-8 0x1F is 'information separator 1'.

Right bytes are different because they are encoded. Results have to be the same.

> PDFDocncoding values above 0x7F (which is listed as undefined itself)
> to 0xA0 do not map simply to U+00xx.

Right, same as above, an encoding is in place, we have to decode first and then we have to compare the unicode code points.

> So either we need to create a table like that in Appendix D.1
> which maps the PDFDocEncoding character codes into UTF16,
> and then pass that through the conversion to UTF-8..

Yes, decode the PDFDocEncoding encoded bytes to get unicode values, then convert these values to utf8. That's what pdf_xmp_write_translated does internally.
Comment 40 Ken Sharp 2023-06-28 12:57:38 UTC
OK I've made a commit here: 28ed63e2c5a55ac320fefb7c3034c7242187bbc3

This diverges significantly from your the code in your patches. In particular I've chosen not to alter gs_ConvertUTF16() because I don't see a need to do so with the current code. If you come up with a UTF-16BE string which it doesn't convert correctly to UTF-8 then we can revisit that.

I have added extra validation to the PDFDocEncoding to UTF-16 lookup so that we now validate values below 0x20 and translate the few legal character codes in that range into UTF-16BE, in addition to the codes in the 0x80 to 0xAD range (and the prohibited 0x7F code).

I've tested this with a bunch of files that proved problematic in the past, as well as some invalid character codes in the PDFDocEncoding range and a few values (0x1F for example) which I though might be problematic. Finally I tested with character codes which must be escaped in PDFDocEncoding (eg the '(' and ')' codes).

Checking the results manually looked correct to me.

In all cases VeraPDF validated the PDF/A file without complaint. So it appears to be doing a good job of determining 'equivalence' between PDFDocEncoding, UTF-16BE and UTF-8 now.
Comment 41 stefan 2023-06-28 14:02:28 UTC
Cool, I'll have a look later.