Bug 706551

Summary: ps2pdf corrupts Unicode title in PDF 1.4 XML metadata
Product: Ghostscript Reporter: John Millikin <john>
Component: PDF WriterAssignee: Default assignee <ghostpdl-bugs>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P4    
Version: 10.01.1   
Hardware: PC   
OS: Linux   
Customer: Word Size: ---
Attachments: A minimal .ps file with a Unicode /Title
Output of ps2pdf13 v10.01.1
Output of ps2pdf14 v10.01.1
gdevpdfe.c patch to support UTF-16 surrogates

Description John Millikin 2023-04-06 12:48:17 UTC
Created attachment 23977 [details]
A minimal .ps file with a Unicode /Title

I am trying to use Lilypond to typeset sheet music with a non-ASCII title. The resulting PDF has an incorrect title in Evince, and after some experimentation I think the bug is located within Ghostscript.

A .ly file containing this metadata:

  \header {
    title = "文字化け"
  }

will generate a PostScript file with a /Title containing the equivalent raw UTF-16 bytes.

  $ python3
  >>> doc_ps = open("doc.ps", "rb").read()
  >>> title_idx = doc_ps.find(b"/Title")
  >>> doc_ps[title_idx:title_idx+20]
  b'/Title (\xfe\xffe\x87[WS\x160Q)\n'
  >>> b"\xfe\xffe\x87[WS\x160Q".decode("utf16")
  '文字化け'

Using the minimal PS at <https://bugs.ghostscript.com/show_bug.cgi?id=693477> as a template, I created the attached minimal.ps file containing the following three lines. I believe the /Title line is equivalent to the Lilypond output:

  showpage [
  /Title (\376\377\145\207\133\127\123\026\060\121)
  /DOCINFO pdfmark

When converted with ps2pdf14, the XML metadata appears to be incorrectly encoded:

  $ gs -version
  GPL Ghostscript 10.01.1 (2023-03-27)
  Copyright (C) 2023 Artifex Software, Inc.  All rights reserved.
  $ ps2pdf14 minimal.ps
  $ pdfinfo minimal.pdf | grep Title
  Title:           文字化け
  $ pdfinfo -meta minimal.pdf | grep dc:title
  <rdf:Description rdf:about="" xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li xml:lang='x-default'>����</rdf:li></rdf:Alt></dc:title></rdf:Description>

Note how the <dc:title> text content is corrupted. This value is what Evince shows to the user.
Comment 1 John Millikin 2023-04-06 12:49:24 UTC
Created attachment 23978 [details]
Output of ps2pdf13 v10.01.1
Comment 2 John Millikin 2023-04-06 12:49:58 UTC
Created attachment 23979 [details]
Output of ps2pdf14 v10.01.1
Comment 3 Ken Sharp 2023-04-06 16:17:58 UTC
It isn't exactly 'corrupted', its simply using the Unicode replacement glyph.

There's no reason for that really, fixed in commit dd3a13d7a1f5d22df2ceb958d262393965a99a7e by promoting UTF16 values > 0x800 to 3 bytes.
Comment 4 John Millikin 2023-04-07 01:45:17 UTC
Thank you for looking into this. From looking at the gdevpdfe.c file in that diff, I noticed that it seems to reject codepoints >= U+FFFF.

The following test cases (music-related) might be useful to verify the UTF-16 -> UTF-8 conversion covers Unicode outside the BMP:

  % 𝄞
  /Title (\376\377\330\064\335\036)

  % RUSH 🅱
  /Title (\376\377\000\122\000\125\000\123\000\110\000\040\330\074\335\161)

Looking at the code, I was also surprised that Ghostscript implements its own UTF-16 decoder. Depending on the motivation for this, you might instead consider:

 1. Using the existing conversion routines in a library such as libicu
    (permissively licensed, from unicode.org) or libiconv (which Ghostscript
    already has an optional dependency on).

 2. If you prefer to avoid a dependency, then inlining a permissively-licensed
    implementation such as https://dev.w3.org/XML/encoding.c would also be an
    improvement over current state.

 3. If you prefer to roll your own UTF-16 -> UTF-8 conversion, then I recommend
    using plain codepoints (uint32_t, 32-bit unsigned int) as an intermediate
    stage. This will let you reference existing code and documentation for both
    the UTF-16 decoding and UTF-8 encoding steps.
Comment 5 John Millikin 2023-04-07 03:55:08 UTC
Created attachment 23986 [details]
gdevpdfe.c patch to support UTF-16 surrogates

Attached is a patch that adds support for UTF-16 surrogates in gs_ConvertUTF16(). I tried to match the style of the rest of the file to the extent practical.

I was not able to locate any unit tests for this code, so I tested it manually with files having /Title values containing various Unicode codepoints.
Comment 6 Ken Sharp 2023-04-10 14:56:30 UTC
(In reply to John Millikin from comment #4)

> Looking at the code, I was also surprised that Ghostscript implements its
> own UTF-16 decoder. Depending on the motivation for this, you might instead
> consider:
> 
>  1. Using the existing conversion routines in a library such as libicu
>     (permissively licensed, from unicode.org) or libiconv (which Ghostscript
>     already has an optional dependency on).

Not on Windows it doesn't, for example.

We need to make sure we have the code available on *all* platforms, or that it isn't vital on a platform and can be elided.

 
>  2. If you prefer to avoid a dependency, then inlining a
> permissively-licensed
>     implementation such as https://dev.w3.org/XML/encoding.c would also be an
>     improvement over current state.

Still too much code almost all of which we don't have any use for. This is, after all a simple task. Or it would be if ISO specs weren't such a nightmare to read.....

I've replaced the code I ripped out back in 2016 while we were getting Coverity issues down to 0, and finished it off here:

91943811904f562b101b0ac410da60974b4186f2

That works with all the examples I've tried (the two you included, the example UTF16 string from the Unicode spec and a few others I concocted).

I can't say it's exactly thoroughly tested but it seems to work.