Bug 687492 - Extend ps2ascii to generate Unicode and use ToUnicode CMap.
Summary: Extend ps2ascii to generate Unicode and use ToUnicode CMap.
Status: RESOLVED DUPLICATE of bug 689772
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: General (show other bugs)
Version: master
Hardware: All All
: P2 enhancement
Assignee: Alex Cherepanov
URL:
Keywords: bountiable
Depends on:
Blocks:
 
Reported: 2004-05-31 18:48 UTC by Alex Cherepanov
Modified: 2010-04-25 22:33 UTC (History)
2 users (show)

See Also:
Customer:
Word Size: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Cherepanov 2004-05-31 18:48:13 UTC
This enhancement request is based on the bug 687466.
Current version of ps2ascii doesn't support composite fonts. It can be
extended to convert strings rendered with composite fonts with known CMap's
into Unicode strings. When the source is a PDF file, extra information
can be obtained from ToUnicode CMap. See bug 685335.
Comment 1 Igor Melichev 2004-06-09 06:26:53 UTC
A right way is to implement a text extraction device.
The old implementation with NOBIND must die.
Comment 2 Hin-Tak Leung 2005-06-10 10:42:47 UTC
pdftotext (of the xpdf suite) does support functionality 
of this sort.
Comment 3 Hin-Tak Leung 2005-06-26 19:53:27 UTC
Just to substantiate one of my earlier comments 
 - pdftotext (part of xpdf suite) contain some
functionality for extracting non-ascii texts. I have used it
in the past to extract Big5-encoded "text", although I 
have not looked inside xpdf to see how it is implemented.

(Sorry, I don't know enough about ToUnicode [yet], so please
don't assume that I am going to attempt to fix this...) 
Comment 4 leonardo 2007-05-21 01:30:05 UTC
Passing to alex sinse it's a part of his development project.
Comment 5 Leon Bottou 2007-09-22 02:57:13 UTC
The situation has vastly improved with the ToUnicode 
support for writing pdf files and the driver interface
to access the unicode data.  

It is now possible to write a ghostscript driver that extracts the text. 
In fact the cvs version of the djvusep driver does that.
See the text procedures in 
http://djvu.cvs.sourceforge.net/djvu/gsdjvu/gdevdjvu.c?revision=1.7
I can explain that and/or donate code if necessary.

The only missing piece is in the lack of support for ToUnicode maps
in the pdf interpreter. The way to go would be to interpret these
maps and generate a GlyphNames2Unicode using convert_ToUnicode-to-g2u.
But I cannot see how to do it.

- Leon Bottou.
Comment 6 Ray Johnston 2009-11-24 10:23:09 UTC
We now have customer interest in text extraction. Raising this to P2 (which
also increases the bounty $$$)
Comment 7 Ray Johnston 2010-04-25 22:33:25 UTC
This is really a duplicate of the 'text extraction' enhancement.

Collecting it there.

*** This bug has been marked as a duplicate of bug 689772 ***