Bug 705187 - Ghostscript 9.56 removes hidden (e.g. OCR) text layers when refrying with NEWPDF=true
Summary: Ghostscript 9.56 removes hidden (e.g. OCR) text layers when refrying with NEW...
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Interpreter (show other bugs)
Version: 9.56.0
Hardware: All All
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-04-04 06:41 UTC by James R Barlow
Modified: 2022-04-04 13:30 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
pdf demonstrating issue (84.43 KB, application/pdf)
2022-04-04 06:41 UTC, James R Barlow
Details

Note You need to log in before you can comment on or make changes to this bug.
Description James R Barlow 2022-04-04 06:41:42 UTC
Created attachment 22368 [details]
pdf demonstrating issue

$ pdftotext graph_ocred.pdf -
<has some recognized text, mostly OCR gibberish>

$ ./gs-9560-linux-x86_64 -sDEVICE=pdfwrite -o graph_ocred_refry.pdf graph_ocred.pdf

$ pdftotext graph_ocred_refry.pdf -
<all text removed>

With Ghostscript 9.55.0, the text is not removed.
With Ghostscript 9.56.0 and -dNEWPDF=false, the text is also not removed.

The content stream from the -dNEWPDF=true version is reduced to an image draw - all of the text operations are removed.

This OCR text in this file was generated using hocr2pdf/hocrtransform or some variant. The script uses Python-reportlib to render text drawn (possibly with text rendering mode 3) and then the image is drawn over top. Pretty common method because it allows "select to highlight OCR" in some viewers.

Issue was found on Windows 10 with Ghostscript 9.56 (provenance unknown) and confirmed on Linux using Artifex's released binary for 9.56.

Thanks.
Comment 1 Ken Sharp 2022-04-04 13:30:10 UTC
Thanks for the report. This is fixed in commit fa895673a942caefb81efe1c922407a46d6780c9 and will probably be pulled into a forthcoming 9.56.1 patch release.