Bug 708548 - PDFwrite OCR returns exotic characters on standard Latin script PDF file
Summary: PDFwrite OCR returns exotic characters on standard Latin script PDF file
Status: RESOLVED INVALID
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 10.05.1
Hardware: PC Windows 10
: P2 normal
Assignee: Default assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-05-18 22:55 UTC by Pavel Hanak
Modified: 2025-06-03 10:11 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
Samples with OCR exotic output (221.32 KB, application/x-zip-compressed)
2025-05-18 22:55 UTC, Pavel Hanak
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pavel Hanak 2025-05-18 22:55:28 UTC
Created attachment 26807 [details]
Samples with OCR exotic output

This should be regarded as follow-up to bug #708547, because it occurs in the same PDF documents. It happens in AGPL 10.05.1 and Tesseract 5.4.0.

I know this may be a Tesseract bug, but PDFwrite OCR retured quite exotic characters when I tried to use it to fix broken text encoding ("mojibake") in standard Latin script PDFs. I'm attaching two sample PDF files, they have the same base as samples in bug #708547. But first is reduced to a single paragraph and the second further reduced to a single line. Hopefully, this will help you find the problem faster. I used this command:

gswin64c -dNOPAUSE -sDEVICE=pdfwrite -sUseOCR=Always -sOCRLanguage="ces" -sOutputFile=Sample_paragraph_out.pdf Sample_paragraph.pdf -c quit

Specifically, these characters are consistently recognized wrong:

1. Glyph G132 Double Low-9 Quotation Mark U+201E was recognized as Ol Chiki Letter Al U+1C5E.
2. Glyph G147 Left Double Quotation Mark U+201C was recognized as Hangul Choseong Nieun-Cieuc U+115C.

Note that Tesseract knows sample language via -sOCRLanguage switch. The U+1C5E character in particular looks completely different than U+201E quotation mark. And this wrong recognition happened repeatedly and consistenly throughout the entire text. 

Interestingly, both quotation marks get OCRed correctly when I flatten the samples to bitmap with pdfocr8 device:

gswin64c -dNOPAUSE -sDEVICE=pdfocr8 -sUseOCR=Always -sOCRLanguage="ces" -sOutputFile=Sample_paragraph_bitmap.pdf Sample_paragraph.pdf -c quit 

This suggests there may be some deeper problem with how Ghostscript interfaces with Tesseract in the vector OCR mode.
Comment 1 Ken Sharp 2025-06-03 10:11:41 UTC
(In reply to Pavel Hanak from comment #0)

> I know this may be a Tesseract bug, but PDFwrite OCR retured quite exotic
> characters when I tried to use it to fix broken text encoding ("mojibake")
> in standard Latin script PDFs.

I'm not convinced this is a bug in either Tesseract or Ghostscript, I believe it is a consequence of the way we have to use Tesseract when writing PDF files and retaining them as PDF, rather than simply rendering to a bitmap and turning that into a PDF.

So a couple of points;

Firstly the Unicode code points being returned to us by Tesseract are the ones you see in the ToUnicode CMap, so it isn't the pdfwrite device making any kind of error.

Secondly I note that using the standard English training data instead of the Czech language gives a different answer. In fact removing some of the text also gives a different answer.


> Note that Tesseract knows sample language via -sOCRLanguage switch. The
> U+1C5E character in particular looks completely different than U+201E
> quotation mark. And this wrong recognition happened repeatedly and
> consistenly throughout the entire text. 
> 
> Interestingly, both quotation marks get OCRed correctly when I flatten the
> samples to bitmap with pdfocr8 device:
> 
> gswin64c -dNOPAUSE -sDEVICE=pdfocr8 -sUseOCR=Always -sOCRLanguage="ces"
> -sOutputFile=Sample_paragraph_bitmap.pdf Sample_paragraph.pdf -c quit 
> 
> This suggests there may be some deeper problem with how Ghostscript
> interfaces with Tesseract in the vector OCR mode.

No. The pdfocr* devices and the pdfwrite device work entirely differently with Tesseract; they are forced to by the difference in behaviour of the devices.

For the pdfocr devices we render to an image, pass that image to Tesseract in its entirety and then write the resulting 'text' back on top of the image using text rendering mode 3 (neither stroke not fill nor clip) using a minimalist font which makes no marks anyway, but assigning Unicode code points to the character codes.

For pdfwrite we do not want to create a bitmap; we want to retain, as far as possible, the PDF content. The way pdfwrite works is outlined here https://ghostscript.readthedocs.io/en/latest/VectorDevices.html

We cannot create an image from the entire page, and even if we did we wouldn't be able to match the returned data from Tesseract to the fonts and character codes that we are embedding into the output PDF.


Instead we 'cheat'. We can, for various reasons, use the glyph cache to render the characters to, even though we are not in general doing any rendering. So for every glyph in the output, we render it to a bitmap and store it in the glyph cache. When we 'draw' the text we only get a portion of the entire text at once. For example the top line in sample_line.pdf looks like this:

0.34 Tc 10.195 Tw 219.499 599.59 Td
(\002\034\t\022\b\002\031%78\r\t\(\n\025\002\022\016\(\b)Tj

So we take the cached glyph bitmaps and assemble them, in order, into a 'strip' of image data. We then pass that to Tesseract and get back the Unicode values. We then pick that buffer apart to assign a Unicode code point to each glyph, and from there to each character code and on to create a ToUnicode CMap for the font currently being used (note that PDF cannot change font in the middle of an argument to a text operator).


From what little I understand of the way Tesseract works it uses predictive text algorithms to try and improve its detection rate. So the more data you give it, the more likely it is to get the correct result. Obviously passing a page at a time to Tesseract gives it **much** more to work with than a few characters at a time.

In our initial investigation we tried passing each individual glyph bitmap to Tesseract and getting back a Unicode value. This was much simpler to implement and we thought would give a good result, since the bitmaps are (obviously) very clean. The error rate was about 25%. That was clearly unacceptable which is why we moved to the more complicated method described above, which gives a higher success rate.

There is no prospect of us being able to pass more data to Tesseract from pdfwrite, which means there is no possibility of getting a better success rate from it.

Note that I do not believe there is a bug as such in Tesseract either. This is simply a consequence of the way that pdfwrite and Tesseract work together.