693945 – Incorrect Unicode Map Generated by gxps/pdfwrite

Bug 693945 - Incorrect Unicode Map Generated by gxps/pdfwrite

Summary: Incorrect Unicode Map Generated by gxps/pdfwrite

Status:	RESOLVED DUPLICATE of bug 692395

Alias:	None

Product:	GhostXPS
Classification:	Unclassified
Component:	General (show other bugs)
Version:	9.07
Hardware:	PC Windows 7

Importance:	P4 normal
Assignee:	Tor Andersson

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-05-02 12:32 UTC by Phil McSharry
Modified:	2015-11-13 01:12 UTC (History)
CC List:	2 users (show)

See Also:
Customer:
Word Size:	---

Attachments
xps file from printing text_graphics_image.pdf to an MXDW printer. (44.45 KB, application/vnd.ms-xpsdocument) 2013-05-02 12:32 UTC, Phil McSharry	Details
possible patch (707 bytes, text/plain) 2013-05-03 15:59 UTC, Ken Sharp	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Phil McSharry 2013-05-02 12:32:07 UTC

Created attachment 9597 [details]
xps file from printing text_graphics_image.pdf to an MXDW printer.

Some users reported that pdf files produced by gxps were producing incorrect text strings in the copy buffer when copied from Acrobat.
There are similar issues on the gs forums over the last few years marked resolved or invalid and it may well be a gs issue but I will report it here to put the ball in play.
It was verified using gxps 9.07 compiled with VS2008 on Win7, as follows:
Starting with the pdf example file distributed with gs in the tools folder (text_graphics_image.pdf) 
An xps file was created with the MXDW printer on Win7 -  text_graphics_image.xps (attached).
Converting this with gxps created a pdf file which had the problem
Dumping out the Unicode maps from both files showed a problem.

Original:
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
19 beginbfchar
<20> <0020>
<41> <0041>
<42> <0042>
<43> <0043>
<45> <0045>
<47> <0047>
<49> <0049>
<4C> <004C>
<4E> <004E>
<52> <0052>
<61> <0061>
<63> <0063>
<65> <0065>
<6B> <006B>
<6C> <006C>
<72> <0072>
<74> <0074>
<75> <0075>
<79> <0079>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end

From gxps:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R16 def
1 begincodespacerange
<00><ff>
endcodespacerange
19 beginbfrange
<20><20><0020>
<41><41><0042>
<42><42><0020>
<43><43><0042>
<45><45><0042>
<47><47><0020>
<49><49><0042>
<4c><4c><0042>
<4e><4e><0020>
<52><52><0020>
<61><61><0020>
<63><63><0020>
<65><65><0020>
<6b><6b><0020>
<6c><6c><0020>
<72><72><0020>
<74><74><0020>
<75><75><0020>
<79><79><0020>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end

A recompile was done without /DWINDOWS_NO_UNICODE but as expected that had no effect on the result.

The following patch was applied to line 168+ in gdevpsfm.c to verify that this was the root cause of the copy bug
<snip>
  case CODE_VALUE_CHARS:
  stream_putc(s, '<');
  value = *lenum.entry.key[0]<<8; pput_hex(s, &value, value_size);  // PJM test for corrupt Unicode map, reconstruct the unicode from the char key
  //pput_hex(s, lenum.entry.value.data, value_size);
  stream_putc(s, '>');
</snip>

This did produce an output file which copied correctly from Acrobat.
The map entry.value.data is wrong, perhaps from the input parsing.

Comment 1 Ken Sharp 2013-05-03 14:50:54 UTC

This isn't strictly a pdfwrite bug, and there isn't a way to deal with it in pdfwrite alone.

In order to generate a ToUnicode CMap (not the same thing as a regular CMap, which is what is quoted in comment #0 as being in the original PDF) we need the Unicode code point relevant to a given glyph or CID. This is normally provided by a callback to the interpreter, the callback is stored in the font structure built by the interpreter.

The callback in question is the 'decode_glyph' proc, in the case of this font it ends up in xps_true_callback_decode_glyph which *should* return the Unicode value. However, it does not do so, and simply returns 'xps_last_char'. I'm not sure what that is, but its not the Unicode code point for the glyph.

The routine has this comment:

    /* We should do a reverse cmap lookup here to match PS/PDF.
     * However, a complete rearchitecture of our text and font processing
     * would be necessary to match XPS unicode mapping with the
     * cluster maps. Alas, we cheat similarly to PCL. */

While its understandable that PCL is unable to return this information, since it isn't present in a PCL file, it seems that it should be possible to return it from an XPS file if it has a UnicodeString attribute.

So tossing this one back to Tor as its really an XPS interpreter problem.

Comment 2 Ken Sharp 2013-05-03 15:59:50 UTC

Created attachment 9598 [details]
possible patch

The problem occurs because we call xps_true_callback_encode_char once for each glyph in a string. We then call xps_true_callback_decode_glyph for each glyph in turn. Because xps_last_char is always the last glyph in the buffer, we get the same values.

While not correct, the simple change attached resolves the problem.

Comment 3 Adrian Buciuman 2013-06-13 17:22:03 UTC

Is this related to
http://bugs.ghostscript.com/show_bug.cgi?id=692395
or
http://bugs.ghostscript.com/show_bug.cgi?id=693031
?

I've also noticed that running
latest gxps on windows with txtwrite will crash gxps.

Comment 4 Adrian Buciuman 2013-06-13 17:23:30 UTC

Is this related to
http://bugs.ghostscript.com/show_bug.cgi?id=692395
or
http://bugs.ghostscript.com/show_bug.cgi?id=693031
?

I've also noticed that running
latest gxps on windows with txtwrite will crash gxps.

Comment 5 Ken Sharp 2015-11-13 01:12:35 UTC


*** This bug has been marked as a duplicate of bug 692395 ***