This enhancement request is based on the bug 687466. Current version of ps2ascii doesn't support composite fonts. It can be extended to convert strings rendered with composite fonts with known CMap's into Unicode strings. When the source is a PDF file, extra information can be obtained from ToUnicode CMap. See bug 685335.
A right way is to implement a text extraction device. The old implementation with NOBIND must die.
pdftotext (of the xpdf suite) does support functionality of this sort.
Just to substantiate one of my earlier comments - pdftotext (part of xpdf suite) contain some functionality for extracting non-ascii texts. I have used it in the past to extract Big5-encoded "text", although I have not looked inside xpdf to see how it is implemented. (Sorry, I don't know enough about ToUnicode [yet], so please don't assume that I am going to attempt to fix this...)
Passing to alex sinse it's a part of his development project.
The situation has vastly improved with the ToUnicode support for writing pdf files and the driver interface to access the unicode data. It is now possible to write a ghostscript driver that extracts the text. In fact the cvs version of the djvusep driver does that. See the text procedures in http://djvu.cvs.sourceforge.net/djvu/gsdjvu/gdevdjvu.c?revision=1.7 I can explain that and/or donate code if necessary. The only missing piece is in the lack of support for ToUnicode maps in the pdf interpreter. The way to go would be to interpret these maps and generate a GlyphNames2Unicode using convert_ToUnicode-to-g2u. But I cannot see how to do it. - Leon Bottou.
We now have customer interest in text extraction. Raising this to P2 (which also increases the bounty $$$)
This is really a duplicate of the 'text extraction' enhancement. Collecting it there. *** This bug has been marked as a duplicate of bug 689772 ***