Bug 705187

Summary: Ghostscript 9.56 removes hidden (e.g. OCR) text layers when refrying with NEWPDF=true
Product: Ghostscript Reporter: James R Barlow <jim>
Component: PDF InterpreterAssignee: Ken Sharp <ken.sharp>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P4    
Version: 9.56.0   
Hardware: All   
OS: All   
Customer: Word Size: ---
Attachments: pdf demonstrating issue

Description James R Barlow 2022-04-04 06:41:42 UTC
Created attachment 22368 [details]
pdf demonstrating issue

$ pdftotext graph_ocred.pdf -
<has some recognized text, mostly OCR gibberish>

$ ./gs-9560-linux-x86_64 -sDEVICE=pdfwrite -o graph_ocred_refry.pdf graph_ocred.pdf

$ pdftotext graph_ocred_refry.pdf -
<all text removed>

With Ghostscript 9.55.0, the text is not removed.
With Ghostscript 9.56.0 and -dNEWPDF=false, the text is also not removed.

The content stream from the -dNEWPDF=true version is reduced to an image draw - all of the text operations are removed.

This OCR text in this file was generated using hocr2pdf/hocrtransform or some variant. The script uses Python-reportlib to render text drawn (possibly with text rendering mode 3) and then the image is drawn over top. Pretty common method because it allows "select to highlight OCR" in some viewers.

Issue was found on Windows 10 with Ghostscript 9.56 (provenance unknown) and confirmed on Linux using Artifex's released binary for 9.56.

Thanks.
Comment 1 Ken Sharp 2022-04-04 13:30:10 UTC
Thanks for the report. This is fixed in commit fa895673a942caefb81efe1c922407a46d6780c9 and will probably be pulled into a forthcoming 9.56.1 patch release.