Bug 696874 - Rotated OCR text mangled after passing through pdfwrite
Summary: Rotated OCR text mangled after passing through pdfwrite
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.18
Hardware: All All
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-28 12:28 UTC by James R Barlow
Modified: 2016-07-04 23:59 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
in1.pdf input PDF produced by Tesseract 3.05, out1.pdf after ghostscript (403.31 KB, application/zip)
2016-06-28 12:28 UTC, James R Barlow
Details

Note You need to log in before you can comment on or make changes to this bug.
Description James R Barlow 2016-06-28 12:28:41 UTC
Created attachment 12642 [details]
in1.pdf input PDF produced by Tesseract 3.05, out1.pdf after ghostscript

OCR text generated by tesseract 3.04 or 3.05 seems to be mangled by Ghostscript. 

The OCR text of in1.pdf begins as follows – no problems here:

    JP. Morgan (Suisse) SA

    Account n“ 7973101
    Geneva, 3rd June 2016

The PDF is searchable and text can be selected in Acrobat without issue.

The problem manifests after passing the file through Ghostscript 9.18 to refry the PDF...

    gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -o out1.pdf in1.pdf

The OCR text is mangled by the insertion of spaces after each recognized letter, and line breaks after certain words (from pdftotext), and loss of spaces so that the word boundaries are gone "3rdJune2016". Normally one would pass some other parameters to Ghostscript such as PDF/A conversion, but regardless of parameters the OCR text is mangled. Acrobat text highlight shows it can't find word boundaries and text search is broken.

J P .

M o r g a n

A c c o u n t n

( S u i s s e ) S A

7 9 7 3 1 0 1

G e n e v a , 3 r d J u n e 2 0 1 6


This is a sample of text in the uncompressed PDF before Ghostscript:

BT
3 Tr 1 0 0 1 82.2 512.8 Tm /f-0-0 10 Tf 140.64 Tz [ <0045><0058><0050><0045><0052><0049><0045><004E><0043><0045> ] TJ 77.16 0 Td 156.802 Tz [ <0041><004E><0044> ] TJ 30.6 0 Td 129.334 Tz [ <0052><0045><004C><0041><0054><0049><0056><0049><0054> ] TJ 58.8 0 Td 141.6 Tz [ <0059> ] TJ 30.36 0 Td 94.8 Tz [ <0035><0039> ] TJ 
ET

and after Ghostscript:

BT
/R10 10 Tf
1.4064 0 0 1 82.2 512.8 Tm
3 Tr
[(�E)-500(�X)-500(�P)-500(�E)-500(�R)-500(�I)-500(�E)-500(�N)-500(�C)-500(�E)-500]TJ
1.56802 0 0 1 159.36 512.8 Tm
[(�A)-500(�N)-500(�D)-500]TJ
1.29334 0 0 1 189.96 512.8 Tm
[(�R)-500(�E)-500(�L)-500(�A)-500(�T)-500(�I)-500(�V)-500(�I)-500(�T)-500]TJ
1.416 0 0 1 248.76 512.8 Tm
[(�Y)-500]TJ
0.948 0 0 1 279.12 512.8 Tm
[(�5)-500(�9)-500]TJ

I tried deskewing the image and running OCR. The resulting PDF makes it through Ghostscript/pdfwrite unscathed, so the problem has something with how Tesseract PDF describes rotated OCR text or with how pdfwrite replicates it.

See also https://github.com/tesseract-ocr/tesseract/issues/357
Comment 1 James R Barlow 2016-06-28 12:30:13 UTC
To be clear, by rotated text I mean skewed to small angles, not cardinal angles.
Comment 2 James R Barlow 2016-07-01 13:56:57 UTC
I tried checking how Acrobat XI's optimized PDF affects the issue.

Tests:

1. Tesseract 3.05 -> Acrobat Optimize PDF -> content stream altered but OCR still usable [PASS]

2. Tesseract -> Acrobat -> Ghostscript -> content stream altered, OCR not usable [FAIL]

In both cases the optimize settings were "convert to PDF 1.7" and "compress JBIG2 lossless", just to force it to do something. 

These results indicate to me that there is something wrong with how pdfwrite deals with OCR text rather, not with the output from Tesseract.
Comment 3 Marcos H. Woehrmann 2016-07-03 21:41:11 UTC
Commit f25436308ce95a714319c572b7fa2f571ef5e84b changed the behaviour of pdfwrite, now it no longer adds extra spaces between characters, but the text still isn't searchable and cutting and pasting results in non-ascii characters.
Comment 4 Ken Sharp 2016-07-04 00:37:36 UTC
(In reply to James R Barlow from comment #0)

My standard disclaimer on all 'refrying' of PDFs applies here. You should read the 'Overview' in VectorDevices.htm in our documentation. A copy can be found online at:

http://www.ghostscript.com/doc/9.19/VectorDevices.htm

Now, the visual appearance of the output file is undisturbed after processing, no matter what version of Ghostscript is used. That's because the 'text' in question here is actually invisible, it is rendered in Text Rendering Mode 3.
If the text were rendered, it would actually be rendered correctly though.

Now, copy/paste/search in Acrobat relies upon non-marking content of the PDF
file, specifically the ToUnicode CMap. This is optional, and if not present
Acrobat has various fallback strategies with descending likelihood of success.
In this case the invisible text uses a CIDFont which is extremely unlikely to be searchable without a ToUnicode CMap.

(In reply to Marcos H. Woehrmann from comment #3)
> Commit f25436308ce95a714319c572b7fa2f571ef5e84b changed the behaviour of
> pdfwrite, now it no longer adds extra spaces between characters, but the
> text still isn't searchable and cutting and pasting results in non-ascii
> characters.

The ToUnicode CMap generation had several problems; the bfrange code generated a .CodeMapData where only the top 255 entries section of the range were actually created, but due to the bug fixed by the noted commit, the high byte was ignored. This combination of bugs led to a situation where the generated ToUnicode CMap on the output *appears* correct, but only for Latin characters (0->255). Any characters outside that range, and the supplied file contains 2 such characters, were not carried forward into the re-generated ToUnicode CMap.

Commit 389657cf69b2a9c612442c8222236e3ce6f869ca addresses the problem with the bfrange limitation in the CMap decoding. In combination with commit f25436308ce95a714319c572b7fa2f571ef5e84b this now generates an adequate ToUnicode CMap for the output.

However commits:
9dba57f0f9a53c130ec2771c0ed1d7bd6bbef6ab, 0124e1a5e635abc4cb65e53ca13930e3b95499fe, d5f17bbf37e4090952080423e3aa35c29c2751f5, 5a7a7f4ed4effb576368c7b923c4aa95aef0f5b4

are also relevant and should be applied as well.




> The OCR text is mangled by the insertion of spaces after each recognized
> letter

Not so, the text is unaffected, the ToUnicode CMap is regenerated incorrectly and so Acrobat decodes the 2-byte character codes as two single byte characters. The 'space after each recognised letter' is in fact a NULL before each recognised letter.

> and line breaks after certain words (from pdftotext), and loss of
> spaces so that the word boundaries are gone "3rdJune2016".

PDF files do not contain any information about word or paragraph boundaries, line breaks or any other such metadata. Text in a PDF file is simply font selection and sizing, character codes and positions of said character codes on the page.

Word delimiting, paragraph and line breaks etc are all derived heuristically by the Acrobat engine.

Note that neither the original file, nor the output file contain any space character codes, the spaces are achieved by drawing the text at locations further along the page in the x direction.

As noted in our documentation, the way that Ghostscript and the pdfwrite device work mean that the output file content may not bear any relation to the input content. There are many ways to draw text on the page which result in the same final appearance, but differ significantly in which operators are used and what arguments are passed to those operators.

The PDF file output by pdfwrite, using current code, is perfectly valid, and has a valid and correct ToUnicode CMap. Acrobat can now copy and paste the text, as expected. It does *not* honour the spaces between words for the simple reason that there are no spaces between the words, and the way that we construct our output appears to defeat the Acrobat heuristic.

There is nothing we can do about this.

NB there is nothing in our output or the Tesseract output that relates to the text being other than horizontal, the problems are all related to the ToUnicode generation. So it seems unlikely that the slight page rotation has any bearing on the matter. Obviously since we didn't get the non-rotated example I can't comment.
Comment 5 James R Barlow 2016-07-04 16:26:55 UTC
Ken, thanks for the detailed explanation. I am curious - does Acrobat text search work when the latest ghostscript refries the PDF?

I double-checked the deskewed file. Acrobat can highlight words in the deskewed example, where in the skewed one it cannot even join contiguous letters. However, the copy-paste text is still unusable, so it is essentially the same issue after all, it just gets further along in Acrobat's heuristic.
Comment 6 Ken Sharp 2016-07-04 23:59:56 UTC
(In reply to James R Barlow from comment #5)
> Ken, thanks for the detailed explanation. I am curious - does Acrobat text
> search work when the latest ghostscript refries the PDF?

At least partially, I didn't try an exhaustive comparison because there's no real point, there's nothing we can really do to improve someone else's heuristic detection but I did check that a few individual characters were detected.

Since the ToUnicode CMap is now correct, it should at least find all the individual characters.

 
> I double-checked the deskewed file. Acrobat can highlight words in the
> deskewed example, where in the skewed one it cannot even join contiguous
> letters. However, the copy-paste text is still unusable, so it is
> essentially the same issue after all, it just gets further along in
> Acrobat's heuristic.

I imagine the technique Acrobat is using to detect that words are continuous is to determine the start/end of each glyph and look for differences in the x and y co-ordinates. The problem with that is that if the baseline isn't horizontal then there will be unexpected differences in y.

A smarter approach would examine the start and end of each glyph to establish a baseline then extend the vector and check all the other characters (with fuzzy matching).

But realistically I have no idea what Acrobat actually does here.