685335 – PDF interpreter doesn't process ToUnicode

Bug 685335 - PDF interpreter doesn't process ToUnicode

Summary: PDF interpreter doesn't process ToUnicode

Status:	RESOLVED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Interpreter (show other bugs)
Version:	master
Hardware:	All All

Importance:	P4 normal
Assignee:	Igor Melichev

URL:
Keywords:	bountiable

Duplicates (1):	687532 (view as bug list)
Depends on:
Blocks:

Reported:	2003-02-12 06:49 UTC by Igor Melichev
Modified:	2009-07-05 02:26 UTC (History)
CC List:	2 users (show)

See Also:
Customer:
Word Size:	---

Attachments
How to get unicode text from this pdf (410.24 KB, application/pdf) 2009-07-02 20:18 UTC, simengman	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Igor Melichev 2003-02-12 06:49:40 UTC

Originally reported by: igorm@users.sourceforge.net
PDF interpreter ignores ToUnicode.
With pdfwrite it breaks the searchability.

I suggest to convert ToUnicode CMaps into 
FontInfo.GlyphNames2Unicode while reading a font 
resource from PDF file. This is a pretty simple in 
Postscript. ParseCMap_Inverse defined in 
lib/gs_ciddc.ps should help. My recent patches added a 
processing of GlyphNames2Unicode to pdfwrite. See SF 
bug #684120 about them.

Comment 1 Igor Melichev 2003-12-14 12:16:52 UTC

Need Ray's approval for this bug because he handles PDF interpreter.

Comment 2 Igor Melichev 2003-12-17 09:45:23 UTC

Likely the customer #562 needs this feature.
I think so due to a recent mail from the customer.
Should we bump it's priority ?

Comment 3 Igor Melichev 2004-06-24 12:22:40 UTC

*** Bug 687532 has been marked as a duplicate of this bug. ***

Comment 4 Lakshmi 2004-08-09 06:53:17 UTC

Hi,

I would appreciate if I can get an update on the status of this bug.

Thanks
Lakshmi

Comment 5 Ralph Giles 2005-01-19 12:29:44 UTC

Closing for lack of engineering resources. ToUnicode will likely get addressed
in the long run anyway.

Comment 6 Igor Melichev 2005-01-24 12:16:46 UTC

Restoring the open status since it may be important for the supported feature 
list.

Comment 7 Ralph Giles 2005-01-24 19:21:06 UTC

Adding to the bug bounty list. Consensus seems to be that preserving
searchability of PDF (which this affects in the PDF->PDF) case is worthwhile.
Therefore we leave this open in the tracker and hope someone will fix it for the
bounty.

Comment 8 Hin-Tak Leung 2005-06-26 19:54:09 UTC

Just to substantiate one of my earlier comments on a related bug
(http://bugs.ghostscript.com/show_bug.cgi?id=687492#c2)
 - pdftotext (part of xpdf suite) contain some
functionality for extracting non-ascii texts. I have used it
in the past to extract Big5-encoded "text", although I 
have not looked inside xpdf to see how it is implemented.

(Sorry, I don't know enough about ToUnicode [yet], so please
don't assume that I am going to attempt to fix this...)

Comment 9 Igor Melichev 2005-08-17 15:02:47 UTC

Patch
http://ghostscript.com/pipermail/gs-cvs/2005-August/005649.html

Comment 10 Ulrich Windl 2006-03-17 03:54:34 UTC

Recently I discovered a common use case that is broken by this bug: Use Mozilla
to print a web page to PostScript file, and convert that using "ps2pdf" for
archival purposes. In Adobe Acrobat you cannot copy text from that file even
though the text appears correct. Acrobat Distiller does it correctly. IMHO
displayed text should match the text the tools internally see (find, copy & paste).

Comment 11 leonardo 2006-03-19 09:45:35 UTC

Please attach the Postscript file.

Comment 12 Ulrich Windl 2006-03-20 00:00:56 UTC

Created attachment 2113 [details]
Sample Mozilla PostScript print file

Comment 13 Alex Cherepanov 2006-03-20 04:34:01 UTC

I confirm that released versions of Ghostscript generate PDF files that convert
to text with wrong encoding. This problem is fixed in the current development
version since rev. 6178. The development version of Ghostscript can be
obtained from the Subversion repository as 
svn checkout http://svn.ghostscript.com:8080/ghostscript/trunk/gs/

Comment 14 24067864 2007-09-19 08:14:21 UTC

Comment on attachment 2113 [details]
Sample Mozilla PostScript print file

#685335

Comment 15 simengman 2009-07-01 20:37:17 UTC

PDF interpreter now processes ToUnicode CMaps when the target device is 
pdfwrite,but not when the target device is jpeg.I need to do so,but i do not 
know to do it.

Comment 16 Ken Sharp 2009-07-02 00:33:07 UTC

ToUnicode CMaps are processed by the PDF interpreter using code in the file
/gs/Resource/Init/pdf_font.ps, see the function '.processToUnicode'. 

There is a specific test against the pdfwrite device:

{
  % Currently pdfwrite is only device which can handle GlyphNames2Unicoide to 
  % generate a ToUnicode CMaps. So don't bother with other devices.
  currentdevice .devicename /pdfwrite eq {

Despite the comments, I believe this handles ToUnicode CMaps from PDF files as
well as GlyphNames2Unicode from PostScript files.

If you remove the test, then the code will run normally for all devices. However
pdfwrite is the only high level device which can use this information, its not
clear to me what you want the JPEG device to do with it.

Comment 17 simengman 2009-07-02 20:02:00 UTC

Thank you! I want to extract text and make jpg form pdf. I want to use pdf 
interpreter to parse pdf file and output infomation to xml file. After I 
remove "currentdevice .devicename /pdfwrite eq {
",I call gs_font_map_glyph_to_unicode to get text, but it failed. How to get 
text unicode?

Comment 18 simengman 2009-07-02 20:18:20 UTC

Created attachment 5179 [details]
How to get unicode text from this pdf

Comment 19 Ken Sharp 2009-07-03 00:54:17 UTC

What do you mean by 'failed' ? Did you get a PostScript error, or something else ?

You shouldn't be calling gs_font_map_glyph_to_unicode directly, you should use
the fonts decode_glyph method.

The JPEG device doesn't handle text, so presumably you are using a custom device
? Its pretty difficult to comment on the action of code I haven't seen.

Note that pdfwrite doesn't use the Unicode information very much, it simply uses
it to construct a ToUnicode CMap for the output PDF file. I would suggest you
start by debugging the code, set a breakpoint in pdf_add_ToUnicode with your
test file as an input and see what happens.

You should also look at scn_cmap_text, especially this code:

		    if (pdf_is_CID_font(subfont)) {
			if (subfont->procs.decode_glyph((gs_font *)subfont, glyph) != GS_NO_CHAR) {
			    /* Since PScript5.dll creates GlyphNames2Unicode with character codes
			       instead CIDs, and with the WinCharSetFFFF-H2 CMap
		               character codes appears different than CIDs (Bug 687954),
		               pass the character code intead the CID. */
			    code = pdf_add_ToUnicode(pdev, subfont, pdfont, 
				chr + GS_MIN_CID_GLYPH, chr, NULL);
			} else {
			    /* If we interpret a PDF document, ToUnicode 
			       CMap may be attached to the Type 0 font. */
			    code = pdf_add_ToUnicode(pdev, pte->orig_font, pdfont, 
				chr + GS_MIN_CID_GLYPH, chr, NULL);


You might find it easier to use MuPDF to extract the text while using GS to
create a JPEG file.

Comment 20 simengman 2009-07-05 00:29:23 UTC

Thank you very much! I can use MuPDF to extract the text while using GS to
create a JPEG file, but I want to do the two things at the same time by GS, in 
order to save times and get some other informations. In gxchar.c, I add code 
in show_proceed(gs_show_enum * penum):
......
	    switch ((code = get_next_char_glyph((gs_text_enum_t *)penum,
						&chr, &glyph))
		    ) {
		default:	/* error */
			return code;
		case 2:	/* done */
		    return show_finish(penum);
		case 1:	/* font change */
		    pfont = penum->fstack.items[penum->fstack.depth].font;
		    penum->current_font = pfont;
		    pgs->char_tm_valid = false;
		    show_state_setup(penum);
		    pair = 0;
		    penum->pair = 0;
		    /* falls through */
		case 0:	/* plain char */
//add:
			{
gs_char unicode = pfont->procs.decode_glyph((gs_font *)pfont, glyph);
			}
......
When I run "gswin32.exe -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -
sOutputFile=out.PDF x.PDF", decode_glyph can get correct code, but 
run "gswin32.exe -dProvideUnicodeDecoding -dProvideUnicode -dNOPAUSE -dBATCH -
sDEVICE=jpeg -sOutputFile=out.jpg x.PDF", decode_glyph get incorrect code. How 
can I make JPEG device to handle text, or decode_glyph can work, like pdf 
write device?

Comment 21 Ken Sharp 2009-07-05 02:26:54 UTC

> In gxchar.c, I add code in show_proceed(gs_show_enum * penum):

You really shouldn't change the core library code, the way to deal with this is
to create your own device (pdfwrite is a device for instance, as is the jpeg
output device). Altering the default implementation may have unintended side
effects.

If you look at the pdfwrite device it has pdf_text begin and pdf_process_text
members, which is how it processes text. You will notice that these are complex
routines and spend a great amount of effort to decide how to process the text
based on the kind of font. I'm not certain that pdfwrite handles ToUnicode CMaps
for anything except CIDFonts. In any event you will need to duplicate or at
least understand much of what is going on in this routine.

I'm afraid what you are attempting is quite complex and well beyond the scope of
any help I can give you in this bug thread. The best thing I can suggest is that
you debug your way through the pdfwrite code to see what is happening there.