Bug 708547

Summary: Crash when performing OCR on certain PDF pages
Product: Ghostscript Reporter: Pavel Hanak <hanakp>
Component: PDF WriterAssignee: Default assignee <ghostpdl-bugs>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 10.05.1   
Hardware: PC   
OS: Windows 10   
Customer: Word Size: ---
Attachments: OCR crash samples
reduced file

Description Pavel Hanak 2025-05-18 11:24:06 UTC
Created attachment 26806 [details]
OCR crash samples

Pdfwrite device has capability to perform OCR while preserving vector contents of the source file:

https://ghostscript.readthedocs.io/en/latest/Devices.html#vector-pdf-output-with-ocr-unicode-cmaps

This didn't work at all in 10.03.1 due to a bug, but was fixed in current AGPL 10.05.1. Now I tried to use OCR to fix PDF files with garbled text encoding, but GS crashes when trying to process certain pages. I've encountered this crash on more than a dozen files so far. There is no error message in the console and the output PDF is corrupted. I'm using Tesseract 5.4.0, because that's the last version available as Windows installer. Note I'm using Czech language pack, because the input PDFs are in that language. The exact command is:

gswin64c -dNOPAUSE -sDEVICE=pdfwrite -sUseOCR=Always -sOCRLanguage="ces" -sOutputFile=Sample_full_out.pdf Sample_full.pdf -c quit

I'm attaching two sample PDF files: one with full 46 pages, GS always crashes on page 17. And second shortened to pages 16 to 18, GS again crashes on page 17. I'm also attaching GS console and output PDF files, but both are corrupted and unreadable.
Comment 1 Ken Sharp 2025-06-03 15:22:17 UTC
Created attachment 26838 [details]
reduced file

Added much smaller sample file
Comment 2 Ken Sharp 2025-06-03 15:29:11 UTC
Fixed in commit ebe87bf7b7971e6ec636216ec9bce9168ee83f40

The problem was confusion over what state the device was in when rendering a type 3 font for OCR.

Note that we do not cache glyphs of this type (see bug #708548 for information on why this is relevant) so these glyphs are not subject to OCR.