707831 – I found that after the scanned PDF was compressed, the text in the PDF Ctrl+A, Ctrl+C, and Ctrl+V were garbled in the text editor.

Bug 707831 - I found that after the scanned PDF was compressed, the text in the PDF Ctrl+A, Ctrl+C, and Ctrl+V were garbled in the text editor.

Summary: I found that after the scanned PDF was compressed, the text in the PDF Ctrl+A...

Status:	RESOLVED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	Build Process (show other bugs)
Version:	master
Hardware:	PC Windows 11

Importance:	P2 normal
Assignee:	Default assignee

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-06-19 15:16 UTC by cnlixs
Modified:	2024-06-21 13:18 UTC (History)
CC List:	0 users

See Also:
Customer:
Word Size:	---

Attachments
This is my compressed pdf file. The text in the copied pdf is garbled.???????????????????????????????????????????????????????????????????????? ???????????????? (9.12 MB, application/pdf) 2024-06-19 15:19 UTC, cnlixs	Details
simplified file (272.29 KB, application/pdf) 2024-06-19 16:06 UTC, Ken Sharp	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description cnlixs 2024-06-19 15:16:07 UTC

I found that after the scanned PDF was compressed, the text in the PDF Ctrl+A, Ctrl+C, and Ctrl+V were garbled in the text editor.

I turned on "-dSubsetFonts=false",But the text in the compressed PDF is still garbled.

Following is the code I run

gs_command = f'"C:\\Users\\cnlix\\PycharmProjects\\PDF\\gs\\gs10.03.1\\bin\\gswin64c.exe" -dDEBUG -dSAFER -dBATCH -dNOPAUSE -dQUIET -sDEVICE=pdfwrite -dPDFSETTINGS=/printer -dSubsetFonts=false -sOutputFile={output_path_str}\\output.pdf {input_pdf} > {output_path_str}\\debug.log 2>&1'

Comment 1 cnlixs 2024-06-19 15:19:46 UTC

Created attachment 25737 [details]
This is my compressed pdf file. The text in the copied pdf is garbled.???????????????????????????????????????????????????????????????????????? ????????????????

This is my compressed pdf file. The text in the copied pdf is garbled.􋴅􆔇􊲐􃎑􅁇􂪺􆔇􇄸􈳺􆔇􃺂􅆝􈌤􎽳
The text in the pdf file is in Chinese􇼸􊓠􇬡􇼛 􄍎􆰠􇅝􇖵

Comment 2 cnlixs 2024-06-19 15:26:49 UTC

I am using 10.03.1

Comment 3 cnlixs 2024-06-19 15:38:15 UTC

https://www.dropbox.com/scl/fi/hurnj0wxkprlivlh6lt47/231004.pdf?rlkey=5323c3o1xa6oz1si6besxtx51&st=t7qo8bx6&dl=0

Comment 4 Ken Sharp 2024-06-19 16:06:42 UTC

Created attachment 25738 [details]
simplified file

Much simplified file. The command line can also be simplified to :

gs -sDEVICE=pdfwrite -o out.pdf 7.pdf

Comment 5 cnlixs 2024-06-19 16:17:22 UTC

Yes, this way we can study why garbled characters appear on this simplified PDF.

Comment 6 Ken Sharp 2024-06-21 13:18:05 UTC

So....

The actual problem is that the original file is invalid. It has multiple ToUnicode CMaps which includes sections of the form:

1450 beginbfchar
<00ff> <2014>

Adobe Technical Note 5014, The CMap and CIDFonts specification, states (on page 74) that:

"The beginbfchar and endbfchar operators map int number of individual
input codes (srcCode) to a corresponding number of individual character
codes (dstCode) or character names (dstCharname), where int can be ≤ 100."

1450 clearly isn't less than or equal to 100, and is therefore illegal.

The PDF interpreter was rejecting such ToUnicode CMaps, which meant that the Unicode code points weren't available to pdfwrite, and so the information was lost.

This commit:

2d7b268236fe4086f3fd24beeac32f7586554766

treats the invalid range as a warning not an error and will continue to process the CMap. Note that a sufficiently large value will still (eventually) trigger an error as we won't be able to increase the size of the stack we are using.

So that solves that problem. The file has other faults; the xref entry for object 0 is not the head of the list of free objects:

xref 0 142
0000000016 00000 n

Object 0 *must* be a free object and should look something like this:

xref
0 142
0000000000 65536 f

The offset given for object 0 (16 bytes) is also nonsense, that points part way into the definition of object 1. Finally the file actually attempts to *use* object 0:

141 0 obj<</Type/Catalog/Pages 140 0 R/Outlines 0 0 R>>
endobj

There is no definition of object 0 in the file so the Outlines are also invalid.

All in all, not a great quality PDF file.