687492 – Extend ps2ascii to generate Unicode and use ToUnicode CMap.

Bug 687492 - Extend ps2ascii to generate Unicode and use ToUnicode CMap.

Summary: Extend ps2ascii to generate Unicode and use ToUnicode CMap.

Status:	RESOLVED DUPLICATE of bug 689772

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	General (show other bugs)
Version:	master
Hardware:	All All

Importance:	P2 enhancement
Assignee:	Alex Cherepanov

URL:
Keywords:	bountiable

Depends on:
Blocks:

Reported:	2004-05-31 18:48 UTC by Alex Cherepanov
Modified:	2010-04-25 22:33 UTC (History)
CC List:	2 users (show)

See Also:
Customer:
Word Size:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alex Cherepanov 2004-05-31 18:48:13 UTC

This enhancement request is based on the bug 687466.
Current version of ps2ascii doesn't support composite fonts. It can be
extended to convert strings rendered with composite fonts with known CMap's
into Unicode strings. When the source is a PDF file, extra information
can be obtained from ToUnicode CMap. See bug 685335.

Comment 1 Igor Melichev 2004-06-09 06:26:53 UTC

A right way is to implement a text extraction device.
The old implementation with NOBIND must die.

Comment 2 Hin-Tak Leung 2005-06-10 10:42:47 UTC

pdftotext (of the xpdf suite) does support functionality 
of this sort.

Comment 3 Hin-Tak Leung 2005-06-26 19:53:27 UTC

Just to substantiate one of my earlier comments 
 - pdftotext (part of xpdf suite) contain some
functionality for extracting non-ascii texts. I have used it
in the past to extract Big5-encoded "text", although I 
have not looked inside xpdf to see how it is implemented.

(Sorry, I don't know enough about ToUnicode [yet], so please
don't assume that I am going to attempt to fix this...)

Comment 4 leonardo 2007-05-21 01:30:05 UTC

Passing to alex sinse it's a part of his development project.

Comment 5 Leon Bottou 2007-09-22 02:57:13 UTC

The situation has vastly improved with the ToUnicode 
support for writing pdf files and the driver interface
to access the unicode data.  

It is now possible to write a ghostscript driver that extracts the text. 
In fact the cvs version of the djvusep driver does that.
See the text procedures in 
http://djvu.cvs.sourceforge.net/djvu/gsdjvu/gdevdjvu.c?revision=1.7
I can explain that and/or donate code if necessary.

The only missing piece is in the lack of support for ToUnicode maps
in the pdf interpreter. The way to go would be to interpret these
maps and generate a GlyphNames2Unicode using convert_ToUnicode-to-g2u.
But I cannot see how to do it.

- Leon Bottou.

Comment 6 Ray Johnston 2009-11-24 10:23:09 UTC

We now have customer interest in text extraction. Raising this to P2 (which
also increases the bounty $$$)

Comment 7 Ray Johnston 2010-04-25 22:33:25 UTC

This is really a duplicate of the 'text extraction' enhancement.

Collecting it there.

*** This bug has been marked as a duplicate of bug 689772 ***