Summary: | Regression in 9.0x pdfwrite: copy'n'paste of OCR-ed text does no longer work well. | ||
---|---|---|---|
Product: | Ghostscript | Reporter: | pipitas |
Component: | PDF Writer | Assignee: | Ken Sharp <ken.sharp> |
Status: | RESOLVED WONTFIX | ||
Severity: | normal | ||
Priority: | P4 | ||
Version: | 9.02 | ||
Hardware: | PC | ||
OS: | Linux | ||
Customer: | Word Size: | --- | |
Attachments: |
OCR-ed PDF file used as input for Ghostscript
modified sample |
Description
pipitas
2011-05-16 22:37:21 UTC
Created attachment 7527 [details] modified sample I've looked into this, and I do not see the described difference with 8.71. Using the Windows 32-bit release of 8.71 I see that the area where the problem occurs: > ( Deutsche)Tj > 0 Tc > 36.235 0 Td %% extra Td ('move text current point' operator) > (r)Tj is exactly the same on 8.71 as for 9.02 and 9.03 (pre release). I do see a difference between 8.71 and 9.02, 9.02 copies the bullet point between (eg) 'JAHRGANG' and 'HEFT 1', 8.71 does not. The file supplied in the GS bug report could not have been simply made from pdfwrite, as it has been linearized, which pdfwrite will not do. This has altered a number of sections in the layout, making it hard to compare the files. However I do not see any differences in the content stream. The fonts are all variants of Helvetica and are not embedded, nor are any /Widths arrays present, so the files do not contain any font metrics. Since I cannot reproduce the described difference in the 8.71 and 9.02 output files, I can't see this as a bug or regression. It seems to me that if anything this is a bug in Adobe Reader, since it appears to be the only application incapable of correctly copying the text. BTW pipitas, why has this moved from stackoverflow.com to superuser.com ? (In reply to comment #1) > is exactly the same on 8.71 as for 9.02 and 9.03 (pre release). I do see a > difference between 8.71 and 9.02, 9.02 copies the bullet point between (eg) > 'JAHRGANG' and 'HEFT 1', 8.71 does not. > > The file supplied in the GS bug report Should have read Stack Overflow report here, not GS... I have now been able to reproduce the issue, and it is not a regression from 8.71, its a progression (and an Adobe change). 8.71 shipped with a bug which caused it to write invalid ToUnicode CMaps. Misleading and contradictory Adobe documentation led to the CMap being written as a CMap, when in fact ToUnicode CMaps have their own, incompatible, rules. ToUnicode CMaps are normally only used for searching and copy/paste. As the name implies they are used to map character codes to Unicode code points. The ToUnicode CMap in the 8.71 PDF file is not used, because it is invalid, the one in later versions is valid, and Acrobat is known to use it. It appears that in Acrobat Reader up to and including 9.2 the existence of the ToUnicode data makes no difference. At some point after 9.2 the search mechanism, changed, and Acrobat appears to use two different mechanisms depending on whether a ToUnicode CMap is present. I don't have access to Acrobat Pro after 9.2 and only recently installed Reader X, I have nothing between. The 'no Unicode' method works on all versions of Acrobat, the 'Unicode' method fails on newer versions. I showed this by white spacing the reference to the ToUnicode CMap from the FontDescriptor. If required I can make the various files available, but they are large as they are decompressed. Since search is a heuristic effort in PDF it is not going to be possible to guarantee a result. The change in behaviour is due to Acrobat, not Ghostscript, and the change in Ghostscript was to fix a real bug, so a progression, not a regression. I'm therefore closing this as 'WONTFIX' In reply to question in Comment #1: 'Why was the topic moved from stackoverflow.com to superuser.com?' Users with enough 'credits' can decide to move questions/topics they deem off-topic to one of the sister-websites (I believe they need to find some other people to vote for the move, and then it just happens). Sometimes off-topic questions stay where they are (because no-one bothers). On-topic on stackoverflow are questions related to programming... In reply to Comment #2: Thanks for the thorough explanation (and the meticulous investigation behind of the issue at hand), Ken. I appreciate that very much. However, it will be hard to make users understand that. To those using a recent Acrobat or Reader version it will look like Ghostscript is at fault. Update: ------- I tested the original file again with the current version from Git ("GIT PRERELEASE 9.08 (2013-01-29)"). It now works (again) the same as GS-8.71 did. Even after re-considering Ken's comment #2 (which, admittedly, I didn't grok fully), I still think there must have been a glitch within Ghostscript 9.02. After all, version GS-9.02 produced this PDF code: ( Deutsche)Tj 0 Tc 36.235 0 Td %% extra Td ('move text current point' operator) (r)Tj 2.16501 0 Td %% Td ('move text current point' instead of Tm) 3.569 Tw 0.706 Tc ( Gymnastik-Schulleite)Tj which copy'n'pasted this text: »Bun d Deutsche r GymnastikSchulleite r « while GS-8.71 (and GS-9.08Git) produce this PDF code: ( Deutsche) Tj 0 Tc (r) Tj 1 0 0 1 143.236 265.140 Tm %% Tm ('text matrix' operator) 3.569 Tw 0.706 Tc ( Gymnastik-Schulleite) Tj which copy'n'pastes the respective text snippet as: »Bund Deutscher Gymnastik-Schulleiter« The glitch was that throughout the *complete* text 99% of words had an extra blank inserted before their last character, and about 5% of word spacing blanks where removed. Anyway, I'm glad that the problem currently is gone again, despite of the 'WONTFIX' :-) |