693477 – Encoding of pdf metadata do not comply with pdf standard

Bug 693477 - Encoding of pdf metadata do not comply with pdf standard

Summary: Encoding of pdf metadata do not comply with pdf standard

Status:	RESOLVED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Writer (show other bugs)
Version:	9.06
Hardware:	PC Linux

Importance:	P4 normal
Assignee:	Ken Sharp

URL:
Keywords:

Depends on:
Blocks:

Reported:	2012-11-29 13:42 UTC by Frédéric Bron
Modified:	2012-12-01 11:17 UTC (History)
CC List:	0 users

See Also:
Customer:
Word Size:	---

Attachments
result of ps2pdf (2.41 KB, application/pdf) 2012-11-29 13:42 UTC, Frédéric Bron	Details
input ps file (76 bytes, application/postscript) 2012-11-29 17:50 UTC, Frédéric Bron	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Frédéric Bron 2012-11-29 13:42:02 UTC

Created attachment 9110 [details]
result of ps2pdf

This is a follow up from bugs reported here:
1. lilypond: http://code.google.com/p/lilypond/issues/detail?id=2985
2. evince: https://mail.gnome.org/archives/evince-list/2012-November/msg00018.html
After some discussions, it seems that the issue we found was that ps2pdf used by lilypond is wrong when encoding the pdf metadata.

Here is a very short example to reproduce this:

Let sss.ps be an ascii file containing:
showpage
[ /Title (Document title)
  /Author (\241 \242)
  /DOCINFO pdfmark

Notice that the author field contains non ASCII characters, 0xA1 and 0xA2.

If you transform this sss.ps file to sss.pdf with the following command (equivalent to ps2pdf):
$ ./gs-906-linux_x86_64 -P- -dSAFER -dCompatibilityLevel=1.4 -q -P- -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sstdout=%stderr -sOutputFile=sss.pdf -P- -dSAFER -dCompatibilityLevel=1.4 -c .setpdfwrite -f sss.ps

You get an error when the file is opened in evince (version 3.4.0 with poppler/cairo 0.18.4 and libxml 2.7.8 on fedora 17):
Entity: line 10: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xA1 0x20 0xA2 0x3C
fault'>Document title</rdf:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>

The unicode of the two characters are 0xA1 and 0xA2 as written in the pdf document but the UTF-8 representation of these is 0xC2 0xA1 and 0xC2 0xA2 so that two bytes are missing (0xC2).

Comment 1 Ken Sharp 2012-11-29 16:24:29 UTC

(In reply to comment #0)
> Created an attachment (id=9110) [details]
> result of ps2pdf

I'd much prefer that you post the file before conversion, I'm quite capable of running Ghostscript to see what the output looks like.

Comment 2 Frédéric Bron 2012-11-29 17:50:08 UTC

Created attachment 9111 [details]
input ps file

Comment 3 Ken Sharp 2012-11-30 16:07:32 UTC

Technically the 'correct' approach is to define a PDFDSCEncoding which maps
the non-ASCII values. However, this is non-trivial, and counter-intuitive.

I've made changes so that in the absence of a PDFDSCEncoding we will assume that
any non UTF-16BE string is using PDFDocEncoding. We then convert that to UTF-16BE and on to UTF-8.

This should resolve the problem. See commit:

a3d00daf5f9abb1209cb750a95e23bc6951c1c63

Comment 4 Frédéric Bron 2012-11-30 20:20:03 UTC

Thanks for quick fix.

Comment 5 Frédéric Bron 2012-11-30 21:01:51 UTC

I have built the modified gs. I do not have the error in evince anymore. I still have a question: why 0xA1 and 0xA2 in .ps are encoded 0xC2 0xA3 and 0xC2 0xA4 in the xml part of the.pdf and not 0xC2 0xA1 and 0xC2 0xA2? For a reason I do not understand pdfinfo interprets it the same but can you explain?

Comment 6 Ken Sharp 2012-12-01 09:12:47 UTC

(In reply to comment #5)
> I have built the modified gs. I do not have the error in evince anymore. I
> still have a question: why 0xA1 and 0xA2 in .ps are encoded 0xC2 0xA3 and 0xC2
> 0xA4 in the xml part of the.pdf and not 0xC2 0xA1 and 0xC2 0xA2? For a reason I
> do not understand pdfinfo interprets it the same but can you explain?

Hmm, I'd have to check, that would suggest that I messed up the lookup table which converts PDFDocEncoding into XML. I'll look at it again.

Comment 7 Ken Sharp 2012-12-01 09:46:06 UTC

Yes, you were quite correct, I'd missed an entry in the lookup table quite near the beginning. There's a fix here:

3a4439baee68c440da7164daf55de04a4d48609a

I believe that fixes it but its unfortunately easy to miss entries when cresting these kinds of tables.

Comment 8 Frédéric Bron 2012-12-01 11:17:16 UTC

that works now, thanks.