I'm trying to convert a french pdf into simple text, but accented characters don't seem to be translated. For example, occurrences of é are translated to e'. The command I use is: gswin32c.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps my.pdf -c quit >file.txt Am I doing something wrong? Is there a way to convert text to utf-8 rather than to ascii? Thanks for answering.
(In reply to comment #0) > For example, occurrences of é are translated to e'. Often a character with an accent is actually described as the base character + an accent character. Without a sample file its not possible to say any more. > Am I doing something wrong? > Is there a way to convert text to utf-8 rather than to ascii? Currently no, there is a long term project for better text extraction, but it has a very low priority.
Hi Ken, Thanks for the quick answer. I'm uploading one particular pdf, although any french pdf will do. The text on the first page: Jacques Chirac présidera une journée is extracted as: Jacques Chirac pre'sidera une journe' Is there no hope of extracting it correctly?
Created attachment 6569 [details] french pdf
(In reply to comment #2) > Hi Ken, > > Thanks for the quick answer. > I'm uploading one particular pdf, although any french pdf will do. > > The text on the first page: > Jacques Chirac présidera une journée > is extracted as: > Jacques Chirac pre'sidera une journe' The ps2ascii.ps script deliberately outputs accented glyphs as the regular glyph plus an 'accent', the accent characters are defined in the file in a particular way. So eacute is e', ecaron would be e^ adiereis would be a" and so on. This appears to be done in order to use plain old ASCII (ie 7-bit, *not* the extended ASCII range), all the accented characters are in the extended ASCII range, > 127. So this is working as designed. You can (of course!) change the way it works, you'll need to edit ps2ascii.ps. Look for the section commented '% Encode the ISO accented characters.'. That currently breaks the ISO Latin named accented glyphs into their two components, then glues the resulting characters back together to make a 2 character string. You'll need to change all of that. A quick and dirty solution would be to add something like: mark /eacute <XX> /egrave <XX> .chars.def and so on, where XX is the hexadecimal value of the character you want to use. Define this *after* the loop defining the ISO latin characters. Closing as 'worksforme'.
Ken, Thanks for the info. Can I find somewhere some information about the format of ps2ascii.ps? Thanks
(In reply to comment #5) > Can I find somewhere some information about the format of ps2ascii.ps? Its a PostScript program, the only documentation is that contained in the comments within the file itself.
Created attachment 6570 [details] ps2utf8.ps
Created attachment 6571 [details] DjVuLibre ps2utf8.ps
Ken, Per your instructions, I have created the file ps2utf8.ps based upon ps2ascii.ps. The character-mappings were adapted from the DjVuLibre open-source project. The result seems to work and does convert french pdf into utf-8 text, at least on the (few) files that I have tested. I have not tested other languages, although it seems that all European languages are supposedly supported. However, I don't have the knowledge to verify whether what I did was totally correct. I have also not programmed the addition of a BOM (byte order mark) under Windows, since I don't know how. This ideally should be optional. The file ps2utf8.ps was uploaded as attachment. I have also uploaded the original DjVuLibre file, for reference, as it takes some research to find. In the hope that you will find this useful, and might even be inspired to add a ps2utf8 encoding to GS ...
(In reply to comment #9) > Ken, > > Per your instructions, I have created the file ps2utf8.ps based upon > ps2ascii.ps. The character-mappings were adapted from the DjVuLibre open-source > project. Wow, you've gone a lot further than anything I was suggesting... > The file ps2utf8.ps was uploaded as attachment. I have also uploaded the > original DjVuLibre file, for reference, as it takes some research to find. > > In the hope that you will find this useful, and might even be inspired to add a > ps2utf8 encoding to GS ... Thanks for that, if anyone else requests it I can point them to this. One day we hope to have a more functional text extraction device written in C and able to deal with things like ToUnicode CMaps which will allow processing of some CIDFonts.