Originally reported by: igorm@users.sourceforge.net PDF interpreter ignores ToUnicode. With pdfwrite it breaks the searchability. I suggest to convert ToUnicode CMaps into FontInfo.GlyphNames2Unicode while reading a font resource from PDF file. This is a pretty simple in Postscript. ParseCMap_Inverse defined in lib/gs_ciddc.ps should help. My recent patches added a processing of GlyphNames2Unicode to pdfwrite. See SF bug #684120 about them.
Need Ray's approval for this bug because he handles PDF interpreter.
Likely the customer #562 needs this feature. I think so due to a recent mail from the customer. Should we bump it's priority ?
*** Bug 687532 has been marked as a duplicate of this bug. ***
Hi, I would appreciate if I can get an update on the status of this bug. Thanks Lakshmi
Closing for lack of engineering resources. ToUnicode will likely get addressed in the long run anyway.
Restoring the open status since it may be important for the supported feature list.
Adding to the bug bounty list. Consensus seems to be that preserving searchability of PDF (which this affects in the PDF->PDF) case is worthwhile. Therefore we leave this open in the tracker and hope someone will fix it for the bounty.
Just to substantiate one of my earlier comments on a related bug (http://bugs.ghostscript.com/show_bug.cgi?id=687492#c2) - pdftotext (part of xpdf suite) contain some functionality for extracting non-ascii texts. I have used it in the past to extract Big5-encoded "text", although I have not looked inside xpdf to see how it is implemented. (Sorry, I don't know enough about ToUnicode [yet], so please don't assume that I am going to attempt to fix this...)
Patch http://ghostscript.com/pipermail/gs-cvs/2005-August/005649.html
Recently I discovered a common use case that is broken by this bug: Use Mozilla to print a web page to PostScript file, and convert that using "ps2pdf" for archival purposes. In Adobe Acrobat you cannot copy text from that file even though the text appears correct. Acrobat Distiller does it correctly. IMHO displayed text should match the text the tools internally see (find, copy & paste).
Please attach the Postscript file.
Created attachment 2113 [details] Sample Mozilla PostScript print file
I confirm that released versions of Ghostscript generate PDF files that convert to text with wrong encoding. This problem is fixed in the current development version since rev. 6178. The development version of Ghostscript can be obtained from the Subversion repository as svn checkout http://svn.ghostscript.com:8080/ghostscript/trunk/gs/
Comment on attachment 2113 [details] Sample Mozilla PostScript print file #685335
PDF interpreter now processes ToUnicode CMaps when the target device is pdfwrite,but not when the target device is jpeg.I need to do so,but i do not know to do it.
ToUnicode CMaps are processed by the PDF interpreter using code in the file /gs/Resource/Init/pdf_font.ps, see the function '.processToUnicode'. There is a specific test against the pdfwrite device: { % Currently pdfwrite is only device which can handle GlyphNames2Unicoide to % generate a ToUnicode CMaps. So don't bother with other devices. currentdevice .devicename /pdfwrite eq { Despite the comments, I believe this handles ToUnicode CMaps from PDF files as well as GlyphNames2Unicode from PostScript files. If you remove the test, then the code will run normally for all devices. However pdfwrite is the only high level device which can use this information, its not clear to me what you want the JPEG device to do with it.
Thank you! I want to extract text and make jpg form pdf. I want to use pdf interpreter to parse pdf file and output infomation to xml file. After I remove "currentdevice .devicename /pdfwrite eq { ",I call gs_font_map_glyph_to_unicode to get text, but it failed. How to get text unicode?
Created attachment 5179 [details] How to get unicode text from this pdf
What do you mean by 'failed' ? Did you get a PostScript error, or something else ? You shouldn't be calling gs_font_map_glyph_to_unicode directly, you should use the fonts decode_glyph method. The JPEG device doesn't handle text, so presumably you are using a custom device ? Its pretty difficult to comment on the action of code I haven't seen. Note that pdfwrite doesn't use the Unicode information very much, it simply uses it to construct a ToUnicode CMap for the output PDF file. I would suggest you start by debugging the code, set a breakpoint in pdf_add_ToUnicode with your test file as an input and see what happens. You should also look at scn_cmap_text, especially this code: if (pdf_is_CID_font(subfont)) { if (subfont->procs.decode_glyph((gs_font *)subfont, glyph) != GS_NO_CHAR) { /* Since PScript5.dll creates GlyphNames2Unicode with character codes instead CIDs, and with the WinCharSetFFFF-H2 CMap character codes appears different than CIDs (Bug 687954), pass the character code intead the CID. */ code = pdf_add_ToUnicode(pdev, subfont, pdfont, chr + GS_MIN_CID_GLYPH, chr, NULL); } else { /* If we interpret a PDF document, ToUnicode CMap may be attached to the Type 0 font. */ code = pdf_add_ToUnicode(pdev, pte->orig_font, pdfont, chr + GS_MIN_CID_GLYPH, chr, NULL); You might find it easier to use MuPDF to extract the text while using GS to create a JPEG file.
Thank you very much! I can use MuPDF to extract the text while using GS to create a JPEG file, but I want to do the two things at the same time by GS, in order to save times and get some other informations. In gxchar.c, I add code in show_proceed(gs_show_enum * penum): ...... switch ((code = get_next_char_glyph((gs_text_enum_t *)penum, &chr, &glyph)) ) { default: /* error */ return code; case 2: /* done */ return show_finish(penum); case 1: /* font change */ pfont = penum->fstack.items[penum->fstack.depth].font; penum->current_font = pfont; pgs->char_tm_valid = false; show_state_setup(penum); pair = 0; penum->pair = 0; /* falls through */ case 0: /* plain char */ //add: { gs_char unicode = pfont->procs.decode_glyph((gs_font *)pfont, glyph); } ...... When I run "gswin32.exe -dNOPAUSE -dBATCH -sDEVICE=pdfwrite - sOutputFile=out.PDF x.PDF", decode_glyph can get correct code, but run "gswin32.exe -dProvideUnicodeDecoding -dProvideUnicode -dNOPAUSE -dBATCH - sDEVICE=jpeg -sOutputFile=out.jpg x.PDF", decode_glyph get incorrect code. How can I make JPEG device to handle text, or decode_glyph can work, like pdf write device?
> In gxchar.c, I add code in show_proceed(gs_show_enum * penum): You really shouldn't change the core library code, the way to deal with this is to create your own device (pdfwrite is a device for instance, as is the jpeg output device). Altering the default implementation may have unintended side effects. If you look at the pdfwrite device it has pdf_text begin and pdf_process_text members, which is how it processes text. You will notice that these are complex routines and spend a great amount of effort to decide how to process the text based on the kind of font. I'm not certain that pdfwrite handles ToUnicode CMaps for anything except CIDFonts. In any event you will need to duplicate or at least understand much of what is going on in this routine. I'm afraid what you are attempting is quite complex and well beyond the scope of any help I can give you in this bug thread. The best thing I can suggest is that you debug your way through the pdfwrite code to see what is happening there.