Bug 687492

Summary: Extend ps2ascii to generate Unicode and use ToUnicode CMap.
Product: Ghostscript Reporter: Alex Cherepanov <alex>
Component: GeneralAssignee: Alex Cherepanov <alex>
Status: RESOLVED DUPLICATE    
Severity: enhancement CC: htl10, ray.johnston
Priority: P2 Keywords: bountiable
Version: master   
Hardware: All   
OS: All   
Customer: Word Size: ---

Description Alex Cherepanov 2004-05-31 18:48:13 UTC
This enhancement request is based on the bug 687466.
Current version of ps2ascii doesn't support composite fonts. It can be
extended to convert strings rendered with composite fonts with known CMap's
into Unicode strings. When the source is a PDF file, extra information
can be obtained from ToUnicode CMap. See bug 685335.
Comment 1 Igor Melichev 2004-06-09 06:26:53 UTC
A right way is to implement a text extraction device.
The old implementation with NOBIND must die.
Comment 2 Hin-Tak Leung 2005-06-10 10:42:47 UTC
pdftotext (of the xpdf suite) does support functionality 
of this sort.
Comment 3 Hin-Tak Leung 2005-06-26 19:53:27 UTC
Just to substantiate one of my earlier comments 
 - pdftotext (part of xpdf suite) contain some
functionality for extracting non-ascii texts. I have used it
in the past to extract Big5-encoded "text", although I 
have not looked inside xpdf to see how it is implemented.

(Sorry, I don't know enough about ToUnicode [yet], so please
don't assume that I am going to attempt to fix this...) 
Comment 4 leonardo 2007-05-21 01:30:05 UTC
Passing to alex sinse it's a part of his development project.
Comment 5 Leon Bottou 2007-09-22 02:57:13 UTC
The situation has vastly improved with the ToUnicode 
support for writing pdf files and the driver interface
to access the unicode data.  

It is now possible to write a ghostscript driver that extracts the text. 
In fact the cvs version of the djvusep driver does that.
See the text procedures in 
http://djvu.cvs.sourceforge.net/djvu/gsdjvu/gdevdjvu.c?revision=1.7
I can explain that and/or donate code if necessary.

The only missing piece is in the lack of support for ToUnicode maps
in the pdf interpreter. The way to go would be to interpret these
maps and generate a GlyphNames2Unicode using convert_ToUnicode-to-g2u.
But I cannot see how to do it.

- Leon Bottou.
Comment 6 Ray Johnston 2009-11-24 10:23:09 UTC
We now have customer interest in text extraction. Raising this to P2 (which
also increases the bounty $$$)
Comment 7 Ray Johnston 2010-04-25 22:33:25 UTC
This is really a duplicate of the 'text extraction' enhancement.

Collecting it there.

*** This bug has been marked as a duplicate of bug 689772 ***