I found that after the scanned PDF was compressed, the text in the PDF Ctrl+A, Ctrl+C, and Ctrl+V were garbled in the text editor. I turned on "-dSubsetFonts=false",But the text in the compressed PDF is still garbled. Following is the code I run gs_command = f'"C:\\Users\\cnlix\\PycharmProjects\\PDF\\gs\\gs10.03.1\\bin\\gswin64c.exe" -dDEBUG -dSAFER -dBATCH -dNOPAUSE -dQUIET -sDEVICE=pdfwrite -dPDFSETTINGS=/printer -dSubsetFonts=false -sOutputFile={output_path_str}\\output.pdf {input_pdf} > {output_path_str}\\debug.log 2>&1'
Created attachment 25737 [details] This is my compressed pdf file. The text in the copied pdf is garbled.???????????????????????????????????????????????????????????????????????? ???????????????? This is my compressed pdf file. The text in the copied pdf is garbled. The text in the pdf file is in Chinese
I am using 10.03.1
https://www.dropbox.com/scl/fi/hurnj0wxkprlivlh6lt47/231004.pdf?rlkey=5323c3o1xa6oz1si6besxtx51&st=t7qo8bx6&dl=0
Created attachment 25738 [details] simplified file Much simplified file. The command line can also be simplified to : gs -sDEVICE=pdfwrite -o out.pdf 7.pdf
Yes, this way we can study why garbled characters appear on this simplified PDF.
So.... The actual problem is that the original file is invalid. It has multiple ToUnicode CMaps which includes sections of the form: 1450 beginbfchar <00ff> <2014> Adobe Technical Note 5014, The CMap and CIDFonts specification, states (on page 74) that: "The beginbfchar and endbfchar operators map int number of individual input codes (srcCode) to a corresponding number of individual character codes (dstCode) or character names (dstCharname), where int can be ≤ 100." 1450 clearly isn't less than or equal to 100, and is therefore illegal. The PDF interpreter was rejecting such ToUnicode CMaps, which meant that the Unicode code points weren't available to pdfwrite, and so the information was lost. This commit: 2d7b268236fe4086f3fd24beeac32f7586554766 treats the invalid range as a warning not an error and will continue to process the CMap. Note that a sufficiently large value will still (eventually) trigger an error as we won't be able to increase the size of the stack we are using. So that solves that problem. The file has other faults; the xref entry for object 0 is not the head of the list of free objects: xref 0 142 0000000016 00000 n Object 0 *must* be a free object and should look something like this: xref 0 142 0000000000 65536 f The offset given for object 0 (16 bytes) is also nonsense, that points part way into the definition of object 1. Finally the file actually attempts to *use* object 0: 141 0 obj<</Type/Catalog/Pages 140 0 R/Outlines 0 0 R>> endobj There is no definition of object 0 in the file so the Outlines are also invalid. All in all, not a great quality PDF file.