Bug 693477 - Encoding of pdf metadata do not comply with pdf standard
Summary: Encoding of pdf metadata do not comply with pdf standard
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.06
Hardware: PC Linux
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-29 13:42 UTC by Frédéric Bron
Modified: 2012-12-01 11:17 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
result of ps2pdf (2.41 KB, application/pdf)
2012-11-29 13:42 UTC, Frédéric Bron
Details
input ps file (76 bytes, application/postscript)
2012-11-29 17:50 UTC, Frédéric Bron
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Frédéric Bron 2012-11-29 13:42:02 UTC
Created attachment 9110 [details]
result of ps2pdf

This is a follow up from bugs reported here:
1. lilypond: http://code.google.com/p/lilypond/issues/detail?id=2985
2. evince: https://mail.gnome.org/archives/evince-list/2012-November/msg00018.html
After some discussions, it seems that the issue we found was that ps2pdf used by lilypond is wrong when encoding the pdf metadata.

Here is a very short example to reproduce this:

Let sss.ps be an ascii file containing:
showpage
[ /Title (Document title)
  /Author (\241 \242)
  /DOCINFO pdfmark

Notice that the author field contains non ASCII characters, 0xA1 and 0xA2.

If you transform this sss.ps file to sss.pdf with the following command (equivalent to ps2pdf):
$ ./gs-906-linux_x86_64 -P- -dSAFER -dCompatibilityLevel=1.4 -q -P- -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sstdout=%stderr -sOutputFile=sss.pdf -P- -dSAFER -dCompatibilityLevel=1.4 -c .setpdfwrite -f sss.ps

You get an error when the file is opened in evince (version 3.4.0 with poppler/cairo 0.18.4 and libxml 2.7.8 on fedora 17):
Entity: line 10: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xA1 0x20 0xA2 0x3C
fault'>Document title</rdf:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>

The unicode of the two characters are 0xA1 and 0xA2 as written in the pdf document but the UTF-8 representation of these is 0xC2 0xA1 and 0xC2 0xA2 so that two bytes are missing (0xC2).
Comment 1 Ken Sharp 2012-11-29 16:24:29 UTC
(In reply to comment #0)
> Created an attachment (id=9110) [details]
> result of ps2pdf

I'd much prefer that you post the file before conversion, I'm quite capable of running Ghostscript to see what the output looks like.
Comment 2 Frédéric Bron 2012-11-29 17:50:08 UTC
Created attachment 9111 [details]
input ps file
Comment 3 Ken Sharp 2012-11-30 16:07:32 UTC
Technically the 'correct' approach is to define a PDFDSCEncoding which maps
the non-ASCII values. However, this is non-trivial, and counter-intuitive.

I've made changes so that in the absence of a PDFDSCEncoding we will assume that
any non UTF-16BE string is using PDFDocEncoding. We then convert that to UTF-16BE and on to UTF-8.

This should resolve the problem. See commit:

a3d00daf5f9abb1209cb750a95e23bc6951c1c63
Comment 4 Frédéric Bron 2012-11-30 20:20:03 UTC
Thanks for quick fix.
Comment 5 Frédéric Bron 2012-11-30 21:01:51 UTC
I have built the modified gs. I do not have the error in evince anymore. I still have a question: why 0xA1 and 0xA2 in .ps are encoded 0xC2 0xA3 and 0xC2 0xA4 in the xml part of the.pdf and not 0xC2 0xA1 and 0xC2 0xA2? For a reason I do not understand pdfinfo interprets it the same but can you explain?
Comment 6 Ken Sharp 2012-12-01 09:12:47 UTC
(In reply to comment #5)
> I have built the modified gs. I do not have the error in evince anymore. I
> still have a question: why 0xA1 and 0xA2 in .ps are encoded 0xC2 0xA3 and 0xC2
> 0xA4 in the xml part of the.pdf and not 0xC2 0xA1 and 0xC2 0xA2? For a reason I
> do not understand pdfinfo interprets it the same but can you explain?

Hmm, I'd have to check, that would suggest that I messed up the lookup table which converts PDFDocEncoding into XML. I'll look at it again.
Comment 7 Ken Sharp 2012-12-01 09:46:06 UTC
Yes, you were quite correct, I'd missed an entry in the lookup table quite near the beginning. There's a fix here:

3a4439baee68c440da7164daf55de04a4d48609a

I believe that fixes it but its unfortunately easy to miss entries when cresting these kinds of tables.
Comment 8 Frédéric Bron 2012-12-01 11:17:16 UTC
that works now, thanks.