Bug 691506 - converting pdf with accented characters to text
Summary: converting pdf with accented characters to text
Status: RESOLVED WORKSFORME
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: Text (show other bugs)
Version: 8.71
Hardware: PC Windows Vista
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-07-28 11:54 UTC by Harry McKame
Modified: 2010-07-28 15:58 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
french pdf (382.97 KB, application/pdf)
2010-07-28 13:06 UTC, Harry McKame
Details
ps2utf8.ps (61.62 KB, application/postscript)
2010-07-28 15:15 UTC, Harry McKame
Details
DjVuLibre ps2utf8.ps (40.52 KB, application/postscript)
2010-07-28 15:28 UTC, Harry McKame
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Harry McKame 2010-07-28 11:54:47 UTC
I'm trying to convert a french pdf into simple text, but accented characters don't seem to be translated.

For example, occurrences of é are translated to e'.

The command I use is:

gswin32c.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps my.pdf -c quit >file.txt

Am I doing something wrong?
Is there a way to convert text to utf-8 rather than to ascii?

Thanks for answering.
Comment 1 Ken Sharp 2010-07-28 12:17:33 UTC
(In reply to comment #0)

> For example, occurrences of é are translated to e'.

Often a character with an accent is actually described as the base character + an accent character. Without a sample file its not possible to say any more.


> Am I doing something wrong?
> Is there a way to convert text to utf-8 rather than to ascii?

Currently no, there is a long term project for better text extraction, but it has a very low priority.
Comment 2 Harry McKame 2010-07-28 13:06:22 UTC
Hi Ken,

Thanks for the quick answer.
I'm uploading one particular pdf, although any french pdf will do.

The text on the first page:
Jacques Chirac présidera une journée
is extracted as:
Jacques Chirac pre'sidera une journe'

Is there no hope of extracting it correctly?
Comment 3 Harry McKame 2010-07-28 13:06:56 UTC
Created attachment 6569 [details]
french pdf
Comment 4 Ken Sharp 2010-07-28 13:52:30 UTC
(In reply to comment #2)
> Hi Ken,
> 
> Thanks for the quick answer.
> I'm uploading one particular pdf, although any french pdf will do.
> 
> The text on the first page:
> Jacques Chirac présidera une journée
> is extracted as:
> Jacques Chirac pre'sidera une journe'

The ps2ascii.ps script deliberately outputs accented glyphs as the regular glyph plus an 'accent', the accent characters are defined in the file in a particular way. So eacute is e', ecaron would be e^ adiereis would be a" and so on. This appears to be done in order to use plain old ASCII (ie 7-bit, *not* the extended ASCII range), all the accented characters are in the extended ASCII range, > 127.

So this is working as designed. 

You can (of course!) change the way it works, you'll need to edit ps2ascii.ps. Look for the section commented '% Encode the ISO accented characters.'. That currently breaks the ISO Latin named accented glyphs into their two components, then glues the resulting characters back together to make a 2 character string. You'll need to change all of that.

A quick and dirty solution would be to add something like:

mark 
/eacute <XX> 
/egrave <XX>
.chars.def

and so on, where XX is the hexadecimal value of the character you want to use. Define this *after* the loop defining the ISO latin characters.

Closing as 'worksforme'.
Comment 5 Harry McKame 2010-07-28 14:30:46 UTC
Ken,

Thanks for the info.

Can I find somewhere some information about the format of ps2ascii.ps?

Thanks
Comment 6 Ken Sharp 2010-07-28 14:39:05 UTC
(In reply to comment #5)

> Can I find somewhere some information about the format of ps2ascii.ps?

Its a PostScript program, the only documentation is that contained in the comments within the file itself.
Comment 7 Harry McKame 2010-07-28 15:15:47 UTC
Created attachment 6570 [details]
ps2utf8.ps
Comment 8 Harry McKame 2010-07-28 15:28:32 UTC
Created attachment 6571 [details]
DjVuLibre ps2utf8.ps
Comment 9 Harry McKame 2010-07-28 15:33:35 UTC
Ken,

Per your instructions, I have created the file ps2utf8.ps based upon ps2ascii.ps. The character-mappings were adapted from the DjVuLibre open-source project.

The result seems to work and does convert french pdf into utf-8 text, at least on the (few) files that I have tested. I have not tested other languages, although it seems that all European languages are supposedly supported. However, I don't have the knowledge to verify whether what I did was totally correct.

I have also not programmed the addition of a BOM (byte order mark) under Windows, since I don't know how. This ideally should be optional.

The file ps2utf8.ps was uploaded as attachment. I have also uploaded the original DjVuLibre file, for reference, as it takes some research to find.

In the hope that you will find this useful, and might even be inspired to add a ps2utf8 encoding to GS ...
Comment 10 Ken Sharp 2010-07-28 15:58:05 UTC
(In reply to comment #9)
> Ken,
> 
> Per your instructions, I have created the file ps2utf8.ps based upon
> ps2ascii.ps. The character-mappings were adapted from the DjVuLibre open-source
> project.

Wow, you've gone a lot further than anything I was suggesting...

> The file ps2utf8.ps was uploaded as attachment. I have also uploaded the
> original DjVuLibre file, for reference, as it takes some research to find.
> 
> In the hope that you will find this useful, and might even be inspired to add a
> ps2utf8 encoding to GS ...

Thanks for that, if anyone else requests it I can point them to this. One day we hope to have a more functional text extraction device written in C and able to deal with things like ToUnicode CMaps which will allow processing of some CIDFonts.