690661 – CID-Cmap font characters

Bug 690661 - CID-Cmap font characters

Summary: CID-Cmap font characters

Status:	NOTIFIED WORKSFORME

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	General (show other bugs)
Version:	8.63
Hardware:	PC Windows XP

Importance:	P1 normal
Assignee:	Ken Sharp

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-07-28 05:39 UTC by Tony Teveris
Modified:	2011-09-18 21:47 UTC (History)
CC List:	0 users

See Also:
Customer:	400
Word Size:	---

Attachments
PDF (193.20 KB, application/pdf) 2009-07-28 05:39 UTC, Tony Teveris	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tony Teveris 2009-07-28 05:39:09 UTC

In my continuing process to figure out how to process text back into real text 
I’ve found something I need some help on.

If I put a breakpoint in gxchar.c in the following location (around line 1210)
	}
	SET_CURRENT_CHAR(penum, chr);
	if (glyph == gs_no_glyph) {
	    glyph = (*penum->encode_char)(pfont, chr, GLYPH_SPACE_NAME);
	}
        SET_CURRENT_GLYPH(penum, glyph);
	cc = 0;
and I look at the variable “chr” I can for the most part see the “real” 
character. In the attached file when processing the word “Education” I see the 
following sequence:
E
d
u
then a seq of 6 bytes 0x1,0x10,0x1,0x2,0x1,0x9f
o
n
the 6 byte seq at the breakpoint are 0x110, 0x102 and 0x19f
From what I can gleam the process is mapping the characters from a special 
font type “ft_CID_TrueType”.
So is there a way I can find the “real” character code at that point in time. 
I’m real close to converting the driver level text back to a Gerber high level 
text and any help is appriciated.

Comment 1 Tony Teveris 2009-07-28 05:39:56 UTC

Created attachment 5247 [details]
PDF

Comment 2 Ken Sharp 2009-07-28 06:40:20 UTC

This is quite complicated. The font is a CIDFont with TrueType outlines, though
you don't have to worry about the outline type, just the fact that its a CIDFont.

So firstly you need to use a different font method, returning glyphs instead of
character codes:

	    code = font->procs.next_char_glyph(&scan, &chr, &glyph);

scan is a pointer to the text enumerator, chr is a gs_char and glyph is a gs_glyph.

CIDFonts use a different kind of encoding to regular type 1 fonts, and as you've
realised this may mean using more than a single byte for the CID. In your case
some of the glyphs have 2-byte CIDs.

Now the 'real' character code really is the 2-byte number, 0x110, 0x102 and
0x19f for the CIDs in your case. As you'll immediately realise these cannot
correspond to ASCII character values. 

In fact even type 1 or type 3 fonts need not have an Encoding which matches
ASCII, so simply retrieving the character code is not really sufficient. This is
one of the reasons why editing PDF files is erratic at best.

In your case, the font does include Unicode information, in the form of a
ToUnicode CMap. You can use this to return the Unicode code points which each
CID refers to. Here is the decoded CMap:

12 0 obj 
<<
/Length 565
>>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
16 beginbfchar
<0003> <0020>
<0102> <0061>
<002F> <0049>
<0110> <0063>
<011E> <0065>
<012E> <FB01>
<005A> <0052>
<0147> <FB02>
<015D> <0069>
<016F> <006C>
<0176> <006E>
<0190> <0073>
<0355> <002C>
<019F> <00740069>
<01A9> <00740074>
<01AB> <007400740069>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end

endstream 
endobj 

You can see that 0x110 equates to Unicode code point 0x63, 0x102 to 0x61 and
0x19f to 0x00740069. If you check these code points  Eg:

http://www.fileformat.info/info/unicode/char/0063/index.htm
http://www.fileformat.info/info/unicode/char/0061/index.htm

You will see that these correspond to lower case 'c' and lower case 'a'.

It looks like the final glyph is a 'ti' ligature, and therefore has two Unicode
code points, 0074 and 0069 which map to 't' and 'i' respectively. (note that
there also appear to be 'tt' and 'tti' ligatures defined in this ToUnicode CMap).

If at all possible you should use the ToUnicode CMap to give you the text
definitions, Encodings are not always reliably ASCII encoded, even for Latin
text. I suspect you have simply been lucky not to encounter this so far. If you
don't have ToUnicode information you should check the glyph names to see if they
match an ASCII encoding.

Note that you will not get ToUnicode CMaps for PostScript, only PDF files. There
is a 'similar', undocumented, table which the Adobe PostScript driver (only!)
produces called GlyphNames2Unicode.

Now, your next problem is that Ghostscript doesn't care about ToUnicode CMaps in
general, and so does not process them. In gs/Resource/Init in pdf_font.ps:

/.processToUnicode   % <font-resource> <font-dict> <encoding|null>
.processToUnicode -
{
  % Currently pdfwrite is only device which can handle GlyphNames2Unicoide to 
  % generate a ToUnicode CMaps. So don't bother with other devices.
  currentdevice .devicename /pdfwrite eq {

You will have to change the last line to something like :

  true eq {

or add the name of your device so that GS will process the ToUnicode CMap for you.

You can retrieve the Unicode code point via :

unicode = font->procs.decode_glyph(((gs_font *)font, glyph);

As usual I recommend you look at pdfwrite, which is currently the only device
which does anything like what you want. In particular I suggest the routines
pdf_text_process, which (for CIDFonts) calls process_cmap_text, which calls
pdf_add_ToUnicode.

If all this seems unreasonably complicated you can simply decide not to handle
this ytpe of font for the present, of course.

Comment 3 Tony Teveris 2009-07-28 06:55:42 UTC

Oh what a dangled web we weave.

Let me work on this for awhile, if I have more questions I'll attach, if not 
I'll close it out.

Again thanks for the quick reply.

Comment 4 Tony Teveris 2009-07-28 08:17:14 UTC

Ok, I tried a number of things, I found the pdf_font.ps in the "lib" folder.

Changing the code to 
% Currently pdfwrite is only device which can handle GlyphNames2Unicoide to 
  % generate a ToUnicode CMaps. So don't bother with other devices.
  % currentdevice .devicename /pdfwrite eq {
  true eq {
    PDFDEBUG {
 
Seem to cause an error

---------------------------
PDF/PS generated processing information
---------------------------
;GetFileInfoFromGS

;-P-

;-dNOPAUSE

;-dBATCH

;-dSAFER

;-IC:/Program Files (x86)/Gerber Scientific Products/OMEGA 
3.00/Software/gs/fonts;C:/Program Files (x86)/Gerber Scientific Products/OMEGA 
3.00/Software/gs/lib;C:/Program Files (x86)/Gerber Scientific Products/OMEGA 
3.00/Software/gs/resource

;-sFONTPATH=C:/Windows/Fonts

;-r72x72

;-sGSP=I

;-dNOTRANSPARENCY

;-sDEVICE=gimprgb

;C:\Users\tony.teveris\Documents\Jim Rand Lettexxxr.pdf

Artifex Ghostscript 8.63 (2008-08-01)

Copyright (C) 2008 Artifex Software, Inc.  All rights reserved.

This software comes with NO WARRANTY: see the file PUBLIC for details.

Processing pages 1 through 1.

Page 1

Error: /undefined in --run--

Operand stack:

   --nostringval--   --dict:11/20(L)--   TT0   1   --dict:9/18(L)--   --
dict:9/18(L)--   0.001   --dict:9/18(L)--   FontMatrix

Execution stack:

   %interp_exit   .runexec2   --nostringval--   --nostringval--   --
nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --
nostringval--   false   1   %stopped_push   1905   1   3   %oparray_pop   
1904   1   3   %oparray_pop   1888   1   3   %oparray_pop   --nostringval--   -
-nostringval--   2   1   1   --nostringval--   %for_pos_int_continue   --
nostringval--   --nostringval--   --nostringval--   --nostringval--   %
array_continue   --nostringval--   false   1   %stopped_push   --nostringval--
   %loop_continue   --nostringval--   --nostringval--   --nostringval--   --
nostringval--   --nostringval--

Dictionary stack:

   --dict:1158/1684(ro)(G)--   --dict:1/20(G)--   --dict:75/200(L)--   --
dict:75/200(L)--   --dict:106/127(ro)(G)--   --dict:275/300(ro)(G)--   --
dict:22/25(L)--   --dict:4/6(L)--   --dict:27/40(L)--   --dict:4/7(L)--

Current allocation mode is local

Last OS error: No such file or directory


---------------------------
OK   
---------------------------


if I changed it to the following

  % Currently pdfwrite is only device which can handle GlyphNames2Unicoide to 
  % generate a ToUnicode CMaps. So don't bother with other devices.
  % currentdevice .devicename /pdfwrite eq {
  currentdevice .devicename /gimprgb eq {

it seem to run ??? but when I get to the gs_font_map_glyph_to_unicode() 
function the UnicodeDecoding is always returned as NULL assuming 
the .processToUnicode is not defined or I'm off base.

I would like the " currentdevice .devicename /gimprgb eq {" to handle both my 
driver and the /pdfwrite driver, how do I code that.

Thanks

Comment 5 Ken Sharp 2009-07-28 08:32:30 UTC

OK, my mistake I didn't notice that you were using 8.63. That file (and a number
of others) moved in 8.64.

The original file *should* have read:

  currentdevice .devicename /pdfwrite eq {

But yours seems to read:

  % currentdevice .devicename /pdfwrite eq {

That is, it has been commented out (% introduces a comment), so you (or someone)
have already altered the code to make it process ToUnicode CMaps. Just put it
back as it was and it should be OK.

Comment 6 Tony Teveris 2009-07-28 08:36:16 UTC

I commented the line and added the true equ {

I thought that would force it to always process.

Right now looking for a way to activate the PDFDEBUG to get debug print out to 
make sure it's getting executed

Comment 7 Ken Sharp 2009-07-28 08:43:43 UTC

You can activate PDFDEBUG from the command line, specify '-dPDFDEBUG'.

Just set the line in to :

true {

rather than 

true eq {

Or to answer your other question:

currentdevice .devicename dup
/gimprgb eq 
exch
/pdfwrite eq
or {

That gets the device name, copies it, checks to see if its equal to the name
/gimprgb. Swaps the boolean result of the test with the copy of the device name.
Checks the device name against /pdfwrite.

We now have two booleans on the stack, OR them to leave a single boolean. That
then controls the procedure which terminates with an 'if'.

Comment 8 Tony Teveris 2009-07-28 09:59:54 UTC

Ok, with PDFDEBUG in place I can see the .processToUnicode starting and ending 
but if I add the lines

gs_char chr1;
chr1 = pfont->procs.decode_glyph(pfont, glyph);

in gxchar.c around 1210

Again the UnicodeDecoding = zfont_get_to_unicode_map(font->dir); always 
returns NULL.

Besides changing pdf_font.ps and adding the above line thats all I've done.

I'm using the glyph returned from the line around 1179 and the same font

Comment 9 Marcos H. Woehrmann 2011-09-18 21:47:46 UTC

Changing customer bugs that have been resolved more than a year ago to closed.