693614 – pdfmark: accented character "c" in titles of generated PDF bookmarks is not displayed properly

Bug 693614 - pdfmark: accented character "c" in titles of generated PDF bookmarks is not displayed properly

Summary: pdfmark: accented character "c" in titles of generated PDF bookmarks is not d...

Status:	NOTIFIED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	Text (show other bugs)
Version:	9.06
Hardware:	PC Linux

Importance:	P4 normal
Assignee:	Chris Liddell (chrisl)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-02-04 10:56 UTC by tomas.marik
Modified:	2013-02-07 08:23 UTC (History)
CC List:	0 users

See Also:
Customer:
Word Size:	---

Attachments
pdfmark bookmark definition (36 bytes, text/plain) 2013-02-04 10:56 UTC, tomas.marik	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description tomas.marik 2013-02-04 10:56:56 UTC

Created attachment 9261 [details]
pdfmark bookmark definition

When writing new bookmark with accented "č" (latin small letter "c" with caron) in bookmark title, resulting PDF bookmark has different accent "Ċ" (uppercase letter with different accent - not caron). I checked resulting PDF in recent version of Adobe Reader.

According to this:
http://www.fileformat.info/info/unicode/char/10d/index.htm
unicode hex value (UTF-16) for desired character is 010D. This is correctly written to pdfmark.txt (see attachment). According to pdfmark docs the title text is also prepended with Unicode characters FEFF in hex (to inform pdfmark that title is in Unicode).

linux command line to write the bookmark is:
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=out.pdf final.pdf pdfmark.txt; mv out.pdf final.pdf

Comment 1 Ken Sharp 2013-02-04 11:23:55 UTC

It appears the UTF-16 string is being inappropriately 'escaped'. The 0x0d byte is being turned into '\n'. Shouldn't be hard to figure out.

Comment 2 tomas.marik 2013-02-04 11:38:15 UTC

I don't see any NL character around the title text, sequence of bytes around the title text is: 28 (left bracket) FE FF (for titles in UTF-16) 01 0D (c caron character) 29 (right bracket). The problem occures also in more than one bookmark, doesn't matter if the problematic character is at the beginning, somewhere in the middle or at the end of bookmark title. Also other accented characters used in my language is displayed correctly (lowercase and uppercase).

Comment 3 Ken Sharp 2013-02-04 11:51:56 UTC

(In reply to comment #2)
> I don't see any NL character around the title text, sequence of bytes around
> the title text is: 28 (left bracket) FE FF (for titles in UTF-16) 01 0D (c
> caron character) 29 (right bracket).
I mean in the output PDF file. If you look in the PDF file you will see that the 0x0d has been translated into '\n' which obviously makes little sense as a UTF-16 string, especially since it makes the number of bytes wrong.

> The problem occures also in more than
> one bookmark, doesn't matter if the problematic character is at the
> beginning, somewhere in the middle or at the end of bookmark title.

Well, no, it won't, all that matters is that the second byte is being 'escaped' into two incorrect bytes.

> Also
> other accented characters used in my language is displayed correctly
> (lowercase and uppercase).

As long as they aren't using one of the bytes which are normally escaped (0x09->\t 0x0a->\r 0x0d->\n etc) then there won't be a problem. This is clearly a bug, just leave it with me

Comment 4 tomas.marik 2013-02-04 12:03:42 UTC

Oh, I see at last. Anyway thanks for quick response.

Comment 5 Ken Sharp 2013-02-04 18:34:13 UTC

This appears to me to actually be a bug in Acrobat which fails to correctly process the escaped string.

Sadly that never carries any weight, so commit 25291a2f9b01504fbbe70153c07920b016b9f010 alters our emission to use octal escapes instead of specific characters, which seems to make Acrobat happy.

Comment 6 tomas.marik 2013-02-05 16:30:31 UTC

Hello Ken,
I've just tried your fixed version, the string is converted to octal representation, but 01 0D (c caron character) becames 01 0A (this is our wrong uppercase C accented character - http://www.fileformat.info/info/unicode/char/10a/index.htm). In resulting PDF I see in title text \376\377\001\012, which is FE FF 01 0A in hex.

So the problem still presists somewhere else.

Comment 7 Ken Sharp 2013-02-05 17:22:32 UTC

(In reply to comment #6)
> Hello Ken,
> I've just tried your fixed version, the string is converted to octal
> representation, but 01 0D (c caron character) becames 01 0A (this is our
> wrong uppercase C accented character -
> http://www.fileformat.info/info/unicode/char/10a/index.htm). In resulting
> PDF I see in title text \376\377\001\012, which is FE FF 01 0A in hex.
> 
> So the problem still presists somewhere else.

You need to create your PostScript string properly. If you want to use 0x0D then you either need to use a Hex string, or properly escape the binary.

(\376\377\001\015)

<FEFF010D>

Either will work.

Comment 8 tomas.marik 2013-02-06 07:24:20 UTC

(In reply to comment #7)
> You need to create your PostScript string properly. If you want to use 0x0D
> then you either need to use a Hex string, or properly escape the binary.
> 
> (\376\377\001\015)
> 
> <FEFF010D>
> 
> Either will work.

So it will not work with attached pdfmark definition file? Could you please attach working version of the definition?

Comment 9 Ken Sharp 2013-02-06 08:22:44 UTC

(In reply to comment #8)

> > (\376\377\001\015)
> > 
> > <FEFF010D>
> > 
> > Either will work.
> 
> So it will not work with attached pdfmark definition file? 

No, apparently because you haven't appropriately escaped the binary.

> Could you please
> attach working version of the definition?

See either of the two lines above.

Comment 10 tomas.marik 2013-02-07 07:37:35 UTC

I tried string encoded in <FEFF010D> format, which works also with older (original) version of GS. This is sufficent to me so thanks a lot.
Just to note, in this format the bookmark title could be only 126 characters long.
So obviously the pdfmark definition I attached in the beginning is either encoded in wrong way, or if it is fine the problem still persists.

Comment 11 Ken Sharp 2013-02-07 08:00:19 UTC

(In reply to comment #10)
> I tried string encoded in <FEFF010D> format, which works also with older
> (original) version of GS. 

Interesting, it certainly didn't for me.

> This is sufficent to me so thanks a lot.
> Just to note, in this format the bookmark title could be only 126 characters
> long.

Why do you say this ? PostScript strings are not Pascal strings, they are not limited to 256 bytes, but to 64kb.

Comment 12 tomas.marik 2013-02-07 08:06:05 UTC

(In reply to comment #11)
> (In reply to comment #10)
> > I tried string encoded in <FEFF010D> format, which works also with older
> > (original) version of GS. 
> 
> Interesting, it certainly didn't for me.
Wierd..comparing 9.06 to updated 9.08 (prerelease)
> 
> > This is sufficent to me so thanks a lot.
> > Just to note, in this format the bookmark title could be only 126 characters
> > long.
> 
> Why do you say this ? PostScript strings are not Pascal strings, they are
> not limited to 256 bytes, but to 64kb.

This is accroding to documentation (but there is also note that maximum of 32 characters is advised anyway):
"...Title has a maximum length of 255 PDFDocEncoding characters or
126 Unicode values, although a practical limit of 32 characters is
advised so that it can be read easily in the Acrobat viewer..."

Comment 13 Ken Sharp 2013-02-07 08:21:02 UTC

(In reply to comment #12)

> > > Just to note, in this format the bookmark title could be only 126 characters
> > > long.
> > 
> > Why do you say this ? PostScript strings are not Pascal strings, they are
> > not limited to 256 bytes, but to 64kb.
> 
> This is accroding to documentation (but there is also note that maximum of
> 32 characters is advised anyway):
> "...Title has a maximum length of 255 PDFDocEncoding characters or
> 126 Unicode values, although a practical limit of 32 characters is
> advised so that it can be read easily in the Acrobat viewer..."

OK but that's nothing to do with using Hex string to contain the data, which is what I thought you were referring to.

Comment 14 tomas.marik 2013-02-07 08:23:42 UTC

(In reply to comment #13)
> 
> OK but that's nothing to do with using Hex string to contain the data, which
> is what I thought you were referring to.

Anyway, now it works for me and I'm grateful for your help. Thanks a lot.