Bug 708547 - Crash when performing OCR on certain PDF pages
Summary: Crash when performing OCR on certain PDF pages
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 10.05.1
Hardware: PC Windows 10
: P2 normal
Assignee: Default assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-05-18 11:24 UTC by Pavel Hanak
Modified: 2025-06-03 15:29 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
OCR crash samples (6.36 MB, application/x-zip-compressed)
2025-05-18 11:24 UTC, Pavel Hanak
Details
reduced file (7.57 KB, application/pdf)
2025-06-03 15:22 UTC, Ken Sharp
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pavel Hanak 2025-05-18 11:24:06 UTC
Created attachment 26806 [details]
OCR crash samples

Pdfwrite device has capability to perform OCR while preserving vector contents of the source file:

https://ghostscript.readthedocs.io/en/latest/Devices.html#vector-pdf-output-with-ocr-unicode-cmaps

This didn't work at all in 10.03.1 due to a bug, but was fixed in current AGPL 10.05.1. Now I tried to use OCR to fix PDF files with garbled text encoding, but GS crashes when trying to process certain pages. I've encountered this crash on more than a dozen files so far. There is no error message in the console and the output PDF is corrupted. I'm using Tesseract 5.4.0, because that's the last version available as Windows installer. Note I'm using Czech language pack, because the input PDFs are in that language. The exact command is:

gswin64c -dNOPAUSE -sDEVICE=pdfwrite -sUseOCR=Always -sOCRLanguage="ces" -sOutputFile=Sample_full_out.pdf Sample_full.pdf -c quit

I'm attaching two sample PDF files: one with full 46 pages, GS always crashes on page 17. And second shortened to pages 16 to 18, GS again crashes on page 17. I'm also attaching GS console and output PDF files, but both are corrupted and unreadable.
Comment 1 Ken Sharp 2025-06-03 15:22:17 UTC
Created attachment 26838 [details]
reduced file

Added much smaller sample file
Comment 2 Ken Sharp 2025-06-03 15:29:11 UTC
Fixed in commit ebe87bf7b7971e6ec636216ec9bce9168ee83f40

The problem was confusion over what state the device was in when rendering a type 3 font for OCR.

Note that we do not cache glyphs of this type (see bug #708548 for information on why this is relevant) so these glyphs are not subject to OCR.