Created attachment 26806 [details] OCR crash samples Pdfwrite device has capability to perform OCR while preserving vector contents of the source file: https://ghostscript.readthedocs.io/en/latest/Devices.html#vector-pdf-output-with-ocr-unicode-cmaps This didn't work at all in 10.03.1 due to a bug, but was fixed in current AGPL 10.05.1. Now I tried to use OCR to fix PDF files with garbled text encoding, but GS crashes when trying to process certain pages. I've encountered this crash on more than a dozen files so far. There is no error message in the console and the output PDF is corrupted. I'm using Tesseract 5.4.0, because that's the last version available as Windows installer. Note I'm using Czech language pack, because the input PDFs are in that language. The exact command is: gswin64c -dNOPAUSE -sDEVICE=pdfwrite -sUseOCR=Always -sOCRLanguage="ces" -sOutputFile=Sample_full_out.pdf Sample_full.pdf -c quit I'm attaching two sample PDF files: one with full 46 pages, GS always crashes on page 17. And second shortened to pages 16 to 18, GS again crashes on page 17. I'm also attaching GS console and output PDF files, but both are corrupted and unreadable.
Created attachment 26838 [details] reduced file Added much smaller sample file
Fixed in commit ebe87bf7b7971e6ec636216ec9bce9168ee83f40 The problem was confusion over what state the device was in when rendering a type 3 font for OCR. Note that we do not cache glyphs of this type (see bug #708548 for information on why this is relevant) so these glyphs are not subject to OCR.