Bug 703428 - UTF-8 in metadata title shouldn't get mangled
Summary: UTF-8 in metadata title shouldn't get mangled
Status: RESOLVED WORKSFORME
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.27
Hardware: PC Linux
: P4 normal
Assignee: Default assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-28 10:19 UTC by ghostscript
Modified: 2021-01-29 15:46 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
Original PDF (16.29 KB, application/pdf)
2021-01-28 10:19 UTC, ghostscript
Details
gs-processed PDF (17.69 KB, application/pdf)
2021-01-28 10:20 UTC, ghostscript
Details

Note You need to log in before you can comment on or make changes to this bug.
Description ghostscript 2021-01-28 10:19:13 UTC
Created attachment 20516 [details]
Original PDF

I have a PDF file (generated by Prince) with non-latin characters (e.g. ☙ – ❧) in the metadata title that I want to run through gs pdfwrite for optimization/linearization - but the non-latin characters end up mangled.

I originally thought this might be a bug in Prince (used to generate the original test.pdf from HTML), but it looks like it might be a gs bug: https://www.princexml.com/forum/topic/4519/utf-8-html-title-gives-broken-pdf-title

Files from a reduced test case attached, using this command:

gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=optimised.pdf test.pdf
Comment 1 ghostscript 2021-01-28 10:20:11 UTC
Created attachment 20517 [details]
gs-processed PDF
Comment 2 ghostscript 2021-01-28 11:31:48 UTC
The original PDF has:

/Title <FEFF005400690074006C00650020007400610067002000770069007400680020005500540046002D0038003A0020261920132767>>>


...and gs converts it to:

/Title(\376\377\000T\000i\000t\000l\000e\000 \000t\000a\000g\000 \000w\000i\000t\000h\000 \000U\000T\000F\000-\0008\000:\000 &\031 \023'g)

...which isn't correct.
Comment 3 Peter Cherepanov 2021-01-28 20:12:18 UTC
Why do you thing that Ghostscript is incorrect? It just converts your string to a different format. The equality of these strings can be checked as:
gs -q -dNOPAUSE -dBATCH -dNODISPLAY -c "<FEFF005400690074006C00650020007400610067002000770069007400680020005500540046002D0038003A0020261920132767> (\376\377\000T\000i\000t\000l\000e\000 \000t\000a\000g\000 \000w\000i\000t\000h\000 \000U\000T\000F\000-\0008\000:\000 &\031 \023'g) eq =="
Comment 4 ghostscript 2021-01-29 11:00:01 UTC
Thanks for the quick response.

Sorry, I should have said "which doesn't look correct - but I'm not sure how to determine that for sure".

Unfortunately, I don't know much about low-level text encoding, so I'm stuggling to prove to my satisfaction whether gs is or isn't mangling the characters.

Mike at Prince/YesLogic seems to think that "the special characters at the end of the string don't look like they are encoded correctly".


> Why do you thing that Ghostscript is incorrect? 

Superficially, in the gs output, I can see the BOM, then the latin characters "Title tag with UTF-8:", but the last few (" ☙–❧") are "\000 &\031 \023'g"

What does that break down into?

"\000 "   [space character]
"&"       [some Unicode flag?]
"\031 "   [???] 
"\023'g"  [???]

What is that format?
The group of '&\031 \023'g' doesn't seem big enough for three characters with multiple-byte encoding (U+2619 U+2013 U+2767).


If I understand the equality check in your comment, you're asking *Ghostscript* whether it thinks the strings are equivilent - which doesn't seem like it would necessarily  be reliable if gs got it wrong in the first place.


It might be that gs is correct (I read about similar problems that suggest maybe Prince is not using the right encoding - although it looks like UTF-16 to me) - I just can't figure out how to prove it one way or the other!
Comment 5 Ken Sharp 2021-01-29 12:25:13 UTC
(In reply to ghostscript from comment #4)

> Unfortunately, I don't know much about low-level text encoding, so I'm
> stuggling to prove to my satisfaction whether gs is or isn't mangling the
> characters.

There is no text encoding here. You need to think of strings in PDF files as byte arrays rather than having specific textual meanings. The context is supplied externally by the Font Encoding when the string data is used for a text operation, or implicitly as in this case.

Strings in PDF are discussed in section 3.2.3 (page 53) of the 1.7 PDF Reference Manual. I would point particularly to table 3.2 on page 54 which covers escape sequences in literal strings, for reference later in this answer.


In the case of DocInfo strings they are *either* encoded as PDFDocEncoding, *or* UTF-16BE. Note that there is no means to encode them as UTF-8, so you can't have UTF-8 in the Title member of the Document Info dictionary. If the string begins with a UTF-16BE BOM then its UTF-16BE, otherwise it is PDFDocEncoding.

*If* the PDF file has XML metadata (it is optional), then the content of the XML 'title' tag must be the same as the Document Information dictionary /Title value, and must be encoded as UTF-8 (because this is XML).

Your original PDF file doesn't have any XML Metadata (and note what you seem to be referring to as Metadata is in fact entries in the document information dictionary, not metadata in the sense of the PDF specification). Ghostscript adds one for you and the title member there ends with (hex values):

20 E2 98 99 E2 80 93 E2 9D A7

Using an online UTF-8 to UTF-16 converter I get:

 \u2619\u2013\u2767

I did go through the exercise of converting the UTF-8 by hand as well and got the same answer. The process is somewhat lengthy so I'd suggest you use an online tool if you want to check this. Presumably you know what the original UTF-8 was, so this is a useful check for you to see that the result is correct.


> Mike at Prince/YesLogic seems to think that "the special characters at the
> end of the string don't look like they are encoded correctly".

Like Peter I would disagree with him. I've worked through the example below, you may like to refer the Prince developers to this answer. All of this is, of course, covered in the PDF Reference Manual.


> Superficially, in the gs output, I can see the BOM, then the latin
> characters "Title tag with UTF-8:", but the last few (" ☙–❧") are "\000
> &\031 \023'g"

You've failed to notice that the 'latin' characters are all preceded by \000, the octal value for 0 (see below). This is important because these are UTF-16BE, and therefore 2 byte values.

 
> What does that break down into?
> 
> "\000 "   [space character]

No. That is a single byte, encoded as octal. In PDF strings the '\' character introduces an escape, if the following bytes are three numeric values then it's an octal value. Octal 000 has the value 0.

> "&"       [some Unicode flag?]

No, its a byte value. I imagine your editor is treating it as ASCII and is displaying it as a '&', it is simply the byte value 0x26, decimal 38, octal 046.

> "\031 "   [???] 

Again, octal, that's octal 031 = hex 19, decimal 25.

> "\023'g"  [???]
> 
> What is that format?

Its an array of bytes (see above)

> The group of '&\031 \023'g' doesn't seem big enough for three characters
> with multiple-byte encoding (U+2619 U+2013 U+2767).

6 bytes, why would that be insufficient to hold 3 UTF-16BE code points ? Each code point is 16 bits = 2 bytes, 3 characters, 2 bytes per character = 6 bytes.
Of course it is insufficient to hold these 3 characters as UTF-8, but we aren't using UTF-8.


> It might be that gs is correct (I read about similar problems that suggest
> maybe Prince is not using the right encoding - although it looks like UTF-16
> to me) - I just can't figure out how to prove it one way or the other!

It is UTF-16BE, to be a valid string in the document information dictionary it must be either UTF-16BE or PDFDocEncoding (the 'external context' here is that this is specified as being the case for these strings). PDFDocEncoding only contains 255 characters and so is not capable of even a reasonable subset of the Unicode range. Also, of course, it begins with the UTF-16BE BOM:

\376\377

That's two octal bytes, converting to hex we get:  0xFE 0xFF

To turn the bytes you are concerned with into Unicode code points, you need to start by turning all the bytes in the string into hex.

\000 = 0x00
" "  = 0x20
"&"  = 0x26
\031 = 0x19
" "  = 0x20
\023 = 0x13
"'"  = 0x27
"g"  = 0x67

So that's 

0x0020
0x2619
0x2013
0x2767

If we look at the Title hex string from your original file it ends with:

0020261920132767

Which is 0x0020, 0x2619, 0x2013, 0x2767

So the output string is the same as the input string (just not UTF-8) and indeed matches the UTF-8 value encoded in the XML Metadata.

QED.

On a final note; if you open the original and Ghostscript-output PDF files in Adobe Acrobat and then open the 'Document Properties' dialog (File->Properties in my version of Acrobat), you will see that the 'Title' is displayed the same for both files, and contains the characters you noted in comment #0.

That in itself should be sufficient to show that the Ghostscript-produced file is correct, but since your question discussed Metadata it was not initially obvious that you were talking about the Document Information dictionary entry rather than the XML Metadata contents. Acrobat (or at least my version of it) displays the value from the document information dictionary, not the XML.
Comment 6 Ken Sharp 2021-01-29 12:37:15 UTC
(In reply to ghostscript from comment #0)

> I have a PDF file (generated by Prince) with non-latin characters (e.g. ☙ –
> ❧) in the metadata title that I want to run through gs pdfwrite for
> optimization/linearization - but the non-latin characters end up mangled.

The following is my standard response to anyone who mentions 'optimising' PDF files using Ghostscript.

Ghostscript's pdfwrite device does not 'optimise' PDF files. It produces new PDF files whose visual appearance should be equivalent to the appearance of the original input (unless quality is specifically degraded by use of the various controls such as image downsampling).

The process is described here:

https://www.ghostscript.com/doc/9.53.3/VectorDevices.htm#Overview

When the input is PDF it is important to notice that the internal (PDF) representation, and format, of the marking operations from the original input PDF file will not be the same in the produced file. In certain workflows this may cause problems.

In addition, non-marking objects or metadata may be dropped from the output PDF file.

Personally I would not linearise PDF files. Very few (really, very few) PDF consumers bother with special implementations for linearised PDF files. Also the linearised files only ever speed up the loading of the first page, at the cost of increasing the file size.

It is also impossible to linearise PDF files produced for later versions of the PDF specification (I think PDF 1.5 files using compressed objects and xref streams).

In short its a rarely implemented, somewhat obselete feature with very limited value even when it is supported.
Comment 7 ghostscript 2021-01-29 13:53:07 UTC
Wow, thanks Ken, that explains it really clearly and has helped me immensely - maybe even reduced my ignorance a bit :-)


Re optimising: Yes, I see your point. I was mainly using/abusing it for reduced file size, and included the linearize because it was there (-dFastWebView=true -dPDFSETTINGS=/printer -dPrinted=false").

I guess I should review the results and check if it's really worth it.


Sorry for wasting you and Peter's time, and thanks for your work!
Comment 8 Ken Sharp 2021-01-29 14:06:10 UTC
(In reply to ghostscript from comment #7)
> Wow, thanks Ken, that explains it really clearly and has helped me immensely
> - maybe even reduced my ignorance a bit :-)

Well these replies turn up on Google searches, so investing the time on an explanation may mean someone else can find what they need too.

A couple of things I should have said; first you are using 9.27 which is now a little old, and has known security vulnerabilities (which almost certainly won't affect you, but still), I'd recommend upgrading. The current version is 9.53.3 and a new version, 9.54 should be released in March.

Just because I can't see a problem doesn't mean there isn't one! If you still think there's a bug and something is being corrupted then I'm happy to look into it further. But I do need to know what I'm looking for, so if you could explain what leads you to think there's a problem that would be very helpful.


> Re optimising: Yes, I see your point. I was mainly using/abusing it for
> reduced file size, and included the linearize because it was there
> (-dFastWebView=true -dPDFSETTINGS=/printer -dPrinted=false").
> 
> I guess I should review the results and check if it's really worth it.

The output of pdfwrite does sometimes prove to be (usually slightly) more efficient than that produced by some PDF producers. YMMV. But it is important to realise that stuff might go missing too. In particular we do not currently preserve marked content, and some kinds of annotations (movies and sounds for eample). We do make improvements from time to time, especially when customers request features, so this is something of an ever-improving story.


> Sorry for wasting you and Peter's time, and thanks for your work!

You're very welcome.
Comment 9 ghostscript 2021-01-29 15:46:31 UTC
> 9.27 which is now a little old

I'm using Debian Buster, and I think they've backported at least some CVE fixes. Will get 9.53.3 in Bullseye in a few months.


> If you still think there's a bug

Nope. I tried the Adobe PDF reader which - as you said - showed the characters correctly. It must be that the Buster PDF reader I used earlier (Evince) doesn't interpret the 'Title' correctly.

Thanks again!