Created attachment 23977 [details] A minimal .ps file with a Unicode /Title I am trying to use Lilypond to typeset sheet music with a non-ASCII title. The resulting PDF has an incorrect title in Evince, and after some experimentation I think the bug is located within Ghostscript. A .ly file containing this metadata: \header { title = "文字化け" } will generate a PostScript file with a /Title containing the equivalent raw UTF-16 bytes. $ python3 >>> doc_ps = open("doc.ps", "rb").read() >>> title_idx = doc_ps.find(b"/Title") >>> doc_ps[title_idx:title_idx+20] b'/Title (\xfe\xffe\x87[WS\x160Q)\n' >>> b"\xfe\xffe\x87[WS\x160Q".decode("utf16") '文字化け' Using the minimal PS at <https://bugs.ghostscript.com/show_bug.cgi?id=693477> as a template, I created the attached minimal.ps file containing the following three lines. I believe the /Title line is equivalent to the Lilypond output: showpage [ /Title (\376\377\145\207\133\127\123\026\060\121) /DOCINFO pdfmark When converted with ps2pdf14, the XML metadata appears to be incorrectly encoded: $ gs -version GPL Ghostscript 10.01.1 (2023-03-27) Copyright (C) 2023 Artifex Software, Inc. All rights reserved. $ ps2pdf14 minimal.ps $ pdfinfo minimal.pdf | grep Title Title: 文字化け $ pdfinfo -meta minimal.pdf | grep dc:title <rdf:Description rdf:about="" xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li xml:lang='x-default'>����</rdf:li></rdf:Alt></dc:title></rdf:Description> Note how the <dc:title> text content is corrupted. This value is what Evince shows to the user.
Created attachment 23978 [details] Output of ps2pdf13 v10.01.1
Created attachment 23979 [details] Output of ps2pdf14 v10.01.1
It isn't exactly 'corrupted', its simply using the Unicode replacement glyph. There's no reason for that really, fixed in commit dd3a13d7a1f5d22df2ceb958d262393965a99a7e by promoting UTF16 values > 0x800 to 3 bytes.
Thank you for looking into this. From looking at the gdevpdfe.c file in that diff, I noticed that it seems to reject codepoints >= U+FFFF. The following test cases (music-related) might be useful to verify the UTF-16 -> UTF-8 conversion covers Unicode outside the BMP: % 𝄞 /Title (\376\377\330\064\335\036) % RUSH 🅱 /Title (\376\377\000\122\000\125\000\123\000\110\000\040\330\074\335\161) Looking at the code, I was also surprised that Ghostscript implements its own UTF-16 decoder. Depending on the motivation for this, you might instead consider: 1. Using the existing conversion routines in a library such as libicu (permissively licensed, from unicode.org) or libiconv (which Ghostscript already has an optional dependency on). 2. If you prefer to avoid a dependency, then inlining a permissively-licensed implementation such as https://dev.w3.org/XML/encoding.c would also be an improvement over current state. 3. If you prefer to roll your own UTF-16 -> UTF-8 conversion, then I recommend using plain codepoints (uint32_t, 32-bit unsigned int) as an intermediate stage. This will let you reference existing code and documentation for both the UTF-16 decoding and UTF-8 encoding steps.
Created attachment 23986 [details] gdevpdfe.c patch to support UTF-16 surrogates Attached is a patch that adds support for UTF-16 surrogates in gs_ConvertUTF16(). I tried to match the style of the rest of the file to the extent practical. I was not able to locate any unit tests for this code, so I tested it manually with files having /Title values containing various Unicode codepoints.
(In reply to John Millikin from comment #4) > Looking at the code, I was also surprised that Ghostscript implements its > own UTF-16 decoder. Depending on the motivation for this, you might instead > consider: > > 1. Using the existing conversion routines in a library such as libicu > (permissively licensed, from unicode.org) or libiconv (which Ghostscript > already has an optional dependency on). Not on Windows it doesn't, for example. We need to make sure we have the code available on *all* platforms, or that it isn't vital on a platform and can be elided. > 2. If you prefer to avoid a dependency, then inlining a > permissively-licensed > implementation such as https://dev.w3.org/XML/encoding.c would also be an > improvement over current state. Still too much code almost all of which we don't have any use for. This is, after all a simple task. Or it would be if ISO specs weren't such a nightmare to read..... I've replaced the code I ripped out back in 2016 while we were getting Coverity issues down to 0, and finished it off here: 91943811904f562b101b0ac410da60974b4186f2 That works with all the examples I've tried (the two you included, the example UTF16 string from the Unicode spec and a few others I concocted). I can't say it's exactly thoroughly tested but it seems to work.