708547 – Crash when performing OCR on certain PDF pages

Bug 708547 - Crash when performing OCR on certain PDF pages

Summary: Crash when performing OCR on certain PDF pages

Status:	RESOLVED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Writer (show other bugs)
Version:	10.05.1
Hardware:	PC Windows 10

Importance:	P2 normal
Assignee:	Default assignee

URL:
Keywords:

Depends on:
Blocks:

Reported:	2025-05-18 11:24 UTC by Pavel Hanak
Modified:	2025-06-03 15:29 UTC (History)
CC List:	0 users

See Also:
Customer:
Word Size:	---

Attachments
OCR crash samples (6.36 MB, application/x-zip-compressed) 2025-05-18 11:24 UTC, Pavel Hanak	Details
reduced file (7.57 KB, application/pdf) 2025-06-03 15:22 UTC, Ken Sharp	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Pavel Hanak 2025-05-18 11:24:06 UTC

Created attachment 26806 [details]
OCR crash samples

Pdfwrite device has capability to perform OCR while preserving vector contents of the source file:

https://ghostscript.readthedocs.io/en/latest/Devices.html#vector-pdf-output-with-ocr-unicode-cmaps

This didn't work at all in 10.03.1 due to a bug, but was fixed in current AGPL 10.05.1. Now I tried to use OCR to fix PDF files with garbled text encoding, but GS crashes when trying to process certain pages. I've encountered this crash on more than a dozen files so far. There is no error message in the console and the output PDF is corrupted. I'm using Tesseract 5.4.0, because that's the last version available as Windows installer. Note I'm using Czech language pack, because the input PDFs are in that language. The exact command is:

gswin64c -dNOPAUSE -sDEVICE=pdfwrite -sUseOCR=Always -sOCRLanguage="ces" -sOutputFile=Sample_full_out.pdf Sample_full.pdf -c quit

I'm attaching two sample PDF files: one with full 46 pages, GS always crashes on page 17. And second shortened to pages 16 to 18, GS again crashes on page 17. I'm also attaching GS console and output PDF files, but both are corrupted and unreadable.

Comment 1 Ken Sharp 2025-06-03 15:22:17 UTC

Created attachment 26838 [details]
reduced file

Added much smaller sample file

Comment 2 Ken Sharp 2025-06-03 15:29:11 UTC

Fixed in commit ebe87bf7b7971e6ec636216ec9bce9168ee83f40

The problem was confusion over what state the device was in when rendering a type 3 font for OCR.

Note that we do not cache glyphs of this type (see bug #708548 for information on why this is relevant) so these glyphs are not subject to OCR.