Summary: | ps2pdf corrupts Unicode title in PDF 1.4 XML metadata | ||
---|---|---|---|
Product: | Ghostscript | Reporter: | John Millikin <john> |
Component: | PDF Writer | Assignee: | Default assignee <ghostpdl-bugs> |
Status: | RESOLVED FIXED | ||
Severity: | normal | ||
Priority: | P4 | ||
Version: | 10.01.1 | ||
Hardware: | PC | ||
OS: | Linux | ||
Customer: | Word Size: | --- | |
Attachments: |
A minimal .ps file with a Unicode /Title
Output of ps2pdf13 v10.01.1 Output of ps2pdf14 v10.01.1 gdevpdfe.c patch to support UTF-16 surrogates |
Description
John Millikin
2023-04-06 12:48:17 UTC
Created attachment 23978 [details]
Output of ps2pdf13 v10.01.1
Created attachment 23979 [details]
Output of ps2pdf14 v10.01.1
It isn't exactly 'corrupted', its simply using the Unicode replacement glyph. There's no reason for that really, fixed in commit dd3a13d7a1f5d22df2ceb958d262393965a99a7e by promoting UTF16 values > 0x800 to 3 bytes. Thank you for looking into this. From looking at the gdevpdfe.c file in that diff, I noticed that it seems to reject codepoints >= U+FFFF. The following test cases (music-related) might be useful to verify the UTF-16 -> UTF-8 conversion covers Unicode outside the BMP: % 𝄞 /Title (\376\377\330\064\335\036) % RUSH 🅱 /Title (\376\377\000\122\000\125\000\123\000\110\000\040\330\074\335\161) Looking at the code, I was also surprised that Ghostscript implements its own UTF-16 decoder. Depending on the motivation for this, you might instead consider: 1. Using the existing conversion routines in a library such as libicu (permissively licensed, from unicode.org) or libiconv (which Ghostscript already has an optional dependency on). 2. If you prefer to avoid a dependency, then inlining a permissively-licensed implementation such as https://dev.w3.org/XML/encoding.c would also be an improvement over current state. 3. If you prefer to roll your own UTF-16 -> UTF-8 conversion, then I recommend using plain codepoints (uint32_t, 32-bit unsigned int) as an intermediate stage. This will let you reference existing code and documentation for both the UTF-16 decoding and UTF-8 encoding steps. Created attachment 23986 [details]
gdevpdfe.c patch to support UTF-16 surrogates
Attached is a patch that adds support for UTF-16 surrogates in gs_ConvertUTF16(). I tried to match the style of the rest of the file to the extent practical.
I was not able to locate any unit tests for this code, so I tested it manually with files having /Title values containing various Unicode codepoints.
(In reply to John Millikin from comment #4) > Looking at the code, I was also surprised that Ghostscript implements its > own UTF-16 decoder. Depending on the motivation for this, you might instead > consider: > > 1. Using the existing conversion routines in a library such as libicu > (permissively licensed, from unicode.org) or libiconv (which Ghostscript > already has an optional dependency on). Not on Windows it doesn't, for example. We need to make sure we have the code available on *all* platforms, or that it isn't vital on a platform and can be elided. > 2. If you prefer to avoid a dependency, then inlining a > permissively-licensed > implementation such as https://dev.w3.org/XML/encoding.c would also be an > improvement over current state. Still too much code almost all of which we don't have any use for. This is, after all a simple task. Or it would be if ISO specs weren't such a nightmare to read..... I've replaced the code I ripped out back in 2016 while we were getting Coverity issues down to 0, and finished it off here: 91943811904f562b101b0ac410da60974b4186f2 That works with all the examples I've tried (the two you included, the example UTF16 string from the Unicode spec and a few others I concocted). I can't say it's exactly thoroughly tested but it seems to work. |