In my continuing process to figure out how to process text back into real text I’ve found something I need some help on. If I put a breakpoint in gxchar.c in the following location (around line 1210) } SET_CURRENT_CHAR(penum, chr); if (glyph == gs_no_glyph) { glyph = (*penum->encode_char)(pfont, chr, GLYPH_SPACE_NAME); } SET_CURRENT_GLYPH(penum, glyph); cc = 0; and I look at the variable “chr” I can for the most part see the “real” character. In the attached file when processing the word “Education” I see the following sequence: E d u then a seq of 6 bytes 0x1,0x10,0x1,0x2,0x1,0x9f o n the 6 byte seq at the breakpoint are 0x110, 0x102 and 0x19f From what I can gleam the process is mapping the characters from a special font type “ft_CID_TrueType”. So is there a way I can find the “real” character code at that point in time. I’m real close to converting the driver level text back to a Gerber high level text and any help is appriciated.
Created attachment 5247 [details] PDF
This is quite complicated. The font is a CIDFont with TrueType outlines, though you don't have to worry about the outline type, just the fact that its a CIDFont. So firstly you need to use a different font method, returning glyphs instead of character codes: code = font->procs.next_char_glyph(&scan, &chr, &glyph); scan is a pointer to the text enumerator, chr is a gs_char and glyph is a gs_glyph. CIDFonts use a different kind of encoding to regular type 1 fonts, and as you've realised this may mean using more than a single byte for the CID. In your case some of the glyphs have 2-byte CIDs. Now the 'real' character code really is the 2-byte number, 0x110, 0x102 and 0x19f for the CIDs in your case. As you'll immediately realise these cannot correspond to ASCII character values. In fact even type 1 or type 3 fonts need not have an Encoding which matches ASCII, so simply retrieving the character code is not really sufficient. This is one of the reasons why editing PDF files is erratic at best. In your case, the font does include Unicode information, in the form of a ToUnicode CMap. You can use this to return the Unicode code points which each CID refers to. Here is the decoded CMap: 12 0 obj << /Length 565 >> stream /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 16 beginbfchar <0003> <0020> <0102> <0061> <002F> <0049> <0110> <0063> <011E> <0065> <012E> <FB01> <005A> <0052> <0147> <FB02> <015D> <0069> <016F> <006C> <0176> <006E> <0190> <0073> <0355> <002C> <019F> <00740069> <01A9> <00740074> <01AB> <007400740069> endbfchar endcmap CMapName currentdict /CMap defineresource pop end end endstream endobj You can see that 0x110 equates to Unicode code point 0x63, 0x102 to 0x61 and 0x19f to 0x00740069. If you check these code points Eg: http://www.fileformat.info/info/unicode/char/0063/index.htm http://www.fileformat.info/info/unicode/char/0061/index.htm You will see that these correspond to lower case 'c' and lower case 'a'. It looks like the final glyph is a 'ti' ligature, and therefore has two Unicode code points, 0074 and 0069 which map to 't' and 'i' respectively. (note that there also appear to be 'tt' and 'tti' ligatures defined in this ToUnicode CMap). If at all possible you should use the ToUnicode CMap to give you the text definitions, Encodings are not always reliably ASCII encoded, even for Latin text. I suspect you have simply been lucky not to encounter this so far. If you don't have ToUnicode information you should check the glyph names to see if they match an ASCII encoding. Note that you will not get ToUnicode CMaps for PostScript, only PDF files. There is a 'similar', undocumented, table which the Adobe PostScript driver (only!) produces called GlyphNames2Unicode. Now, your next problem is that Ghostscript doesn't care about ToUnicode CMaps in general, and so does not process them. In gs/Resource/Init in pdf_font.ps: /.processToUnicode % <font-resource> <font-dict> <encoding|null> .processToUnicode - { % Currently pdfwrite is only device which can handle GlyphNames2Unicoide to % generate a ToUnicode CMaps. So don't bother with other devices. currentdevice .devicename /pdfwrite eq { You will have to change the last line to something like : true eq { or add the name of your device so that GS will process the ToUnicode CMap for you. You can retrieve the Unicode code point via : unicode = font->procs.decode_glyph(((gs_font *)font, glyph); As usual I recommend you look at pdfwrite, which is currently the only device which does anything like what you want. In particular I suggest the routines pdf_text_process, which (for CIDFonts) calls process_cmap_text, which calls pdf_add_ToUnicode. If all this seems unreasonably complicated you can simply decide not to handle this ytpe of font for the present, of course.
Oh what a dangled web we weave. Let me work on this for awhile, if I have more questions I'll attach, if not I'll close it out. Again thanks for the quick reply.
Ok, I tried a number of things, I found the pdf_font.ps in the "lib" folder. Changing the code to % Currently pdfwrite is only device which can handle GlyphNames2Unicoide to % generate a ToUnicode CMaps. So don't bother with other devices. % currentdevice .devicename /pdfwrite eq { true eq { PDFDEBUG { Seem to cause an error --------------------------- PDF/PS generated processing information --------------------------- ;GetFileInfoFromGS ;-P- ;-dNOPAUSE ;-dBATCH ;-dSAFER ;-IC:/Program Files (x86)/Gerber Scientific Products/OMEGA 3.00/Software/gs/fonts;C:/Program Files (x86)/Gerber Scientific Products/OMEGA 3.00/Software/gs/lib;C:/Program Files (x86)/Gerber Scientific Products/OMEGA 3.00/Software/gs/resource ;-sFONTPATH=C:/Windows/Fonts ;-r72x72 ;-sGSP=I ;-dNOTRANSPARENCY ;-sDEVICE=gimprgb ;C:\Users\tony.teveris\Documents\Jim Rand Lettexxxr.pdf Artifex Ghostscript 8.63 (2008-08-01) Copyright (C) 2008 Artifex Software, Inc. All rights reserved. This software comes with NO WARRANTY: see the file PUBLIC for details. Processing pages 1 through 1. Page 1 Error: /undefined in --run-- Operand stack: --nostringval-- --dict:11/20(L)-- TT0 1 --dict:9/18(L)-- -- dict:9/18(L)-- 0.001 --dict:9/18(L)-- FontMatrix Execution stack: %interp_exit .runexec2 --nostringval-- --nostringval-- -- nostringval-- 2 %stopped_push --nostringval-- --nostringval-- -- nostringval-- false 1 %stopped_push 1905 1 3 %oparray_pop 1904 1 3 %oparray_pop 1888 1 3 %oparray_pop --nostringval-- - -nostringval-- 2 1 1 --nostringval-- %for_pos_int_continue -- nostringval-- --nostringval-- --nostringval-- --nostringval-- % array_continue --nostringval-- false 1 %stopped_push --nostringval-- %loop_continue --nostringval-- --nostringval-- --nostringval-- -- nostringval-- --nostringval-- Dictionary stack: --dict:1158/1684(ro)(G)-- --dict:1/20(G)-- --dict:75/200(L)-- -- dict:75/200(L)-- --dict:106/127(ro)(G)-- --dict:275/300(ro)(G)-- -- dict:22/25(L)-- --dict:4/6(L)-- --dict:27/40(L)-- --dict:4/7(L)-- Current allocation mode is local Last OS error: No such file or directory --------------------------- OK --------------------------- if I changed it to the following % Currently pdfwrite is only device which can handle GlyphNames2Unicoide to % generate a ToUnicode CMaps. So don't bother with other devices. % currentdevice .devicename /pdfwrite eq { currentdevice .devicename /gimprgb eq { it seem to run ??? but when I get to the gs_font_map_glyph_to_unicode() function the UnicodeDecoding is always returned as NULL assuming the .processToUnicode is not defined or I'm off base. I would like the " currentdevice .devicename /gimprgb eq {" to handle both my driver and the /pdfwrite driver, how do I code that. Thanks
OK, my mistake I didn't notice that you were using 8.63. That file (and a number of others) moved in 8.64. The original file *should* have read: currentdevice .devicename /pdfwrite eq { But yours seems to read: % currentdevice .devicename /pdfwrite eq { That is, it has been commented out (% introduces a comment), so you (or someone) have already altered the code to make it process ToUnicode CMaps. Just put it back as it was and it should be OK.
I commented the line and added the true equ { I thought that would force it to always process. Right now looking for a way to activate the PDFDEBUG to get debug print out to make sure it's getting executed
You can activate PDFDEBUG from the command line, specify '-dPDFDEBUG'. Just set the line in to : true { rather than true eq { Or to answer your other question: currentdevice .devicename dup /gimprgb eq exch /pdfwrite eq or { That gets the device name, copies it, checks to see if its equal to the name /gimprgb. Swaps the boolean result of the test with the copy of the device name. Checks the device name against /pdfwrite. We now have two booleans on the stack, OR them to leave a single boolean. That then controls the procedure which terminates with an 'if'.
Ok, with PDFDEBUG in place I can see the .processToUnicode starting and ending but if I add the lines gs_char chr1; chr1 = pfont->procs.decode_glyph(pfont, glyph); in gxchar.c around 1210 Again the UnicodeDecoding = zfont_get_to_unicode_map(font->dir); always returns NULL. Besides changing pdf_font.ps and adding the above line thats all I've done. I'm using the glyph returned from the line around 1179 and the same font
Changing customer bugs that have been resolved more than a year ago to closed.