Bug 692206

Summary: Regression in 9.0x pdfwrite: copy'n'paste of OCR-ed text does no longer work well.
Product: Ghostscript Reporter: pipitas
Component: PDF WriterAssignee: Ken Sharp <ken.sharp>
Status: RESOLVED WONTFIX    
Severity: normal    
Priority: P4    
Version: 9.02   
Hardware: PC   
OS: Linux   
Customer: Word Size: ---
Attachments: OCR-ed PDF file used as input for Ghostscript
modified sample

Description pipitas 2011-05-16 22:37:21 UTC
Created attachment 7507 [details]
OCR-ed PDF file used as input for Ghostscript

I found this problem posted over at stackoverflow.com ( http://stackoverflow.com/questions/6018237/pdf-has-an-extra-blank-in-all-words-after-running-through-ghostscript ) and thought it was interesting enough to have a closer look...


Summary:
========

  A scanned PDF OCR-ed by Abbyy Finereader 10 allows for a perfectly working
  copy'n'paste of the text.
  
  After running this file through Ghostscript 9.02 the pdfwrite output does
  no longer work well for copy'n'paste. The pasted text shows extra spaces
  inside most words. 
  
  When running the file through Ghostscript 8.71, the output has no problem
  to let copy'n'paste the text paragraphs.


Details:
========

  The original file, "from_abbyy.pdf", is attached. Ghostscript's output was
  created by these commandlines (on Windows):

      c:\gs\gs9.02\bin\gswin32c.exe ^
          -o from_ghostscript_902.pdf ^
          -sDEVICE=pdfwrite ^
           from_abbyy.pdf

      c:\gs\gs8.71\bin\gswin32c.exe ^
          -o from_ghostscript_871.pdf ^
          -sDEVICE=pdfwrite ^
           from_abbyy.pdf

  For the original "from_abbyy.pdf" as well as for "from_ghostscript_871.pdf", 
  copy'n'paste gives this text output:
  
     Der »Bund Deutscher Gymnastik-Schulleiter« wurde am 20. November 1955 anläßlich einer Zusammenkunft
  
  For the Ghostscript output "from_ghostscript_902.pdf", copy'n'past gives
  this text output:
  
     Der »Bun d Deutsche r GymnastikSchulleiter
     « wurd e a m 20 . Novembe r 195 5 anläßlic h eine r Zusammenkunf t
  

Findings:
=========

First, I used the `qpdf` commandline tool to un-compress PDF data streams so I could better see the source codes of both files:

    qpdf.exe ^
       --qdf ^
         from_abbyy.pdf ^
         qdf--from_abbyy.pdf

    qpdf.exe ^
       --qdf ^
         after_ghostscript_902.pdf ^
         qdf--after_ghostscript_902.pdf

Looking at one of the first occurrences where an extra space gets inserted (it is the original string *"Bund Deutscher Gymnastik-Schulleiter"* turning into *"Bun d Deutsche r GymnastikSchulleiter"*), I find the following PDF snippets:

In qdf--from_abbyy.pdf:
-----------------------

    ( Deutsche) Tj
    0 Tc
    (r) Tj
    1 0 0 1 143.236 265.140 Tm   %% Tm ('text matrix' operator)
    3.569 Tw
    0.706 Tc
    ( Gymnastik-Schulleite) Tj

In qdf--after_ghostscript_902.pdf and qdf--after_ghostscript_871.pdf :
----------------------------------------------------------------------

    ( Deutsche)Tj
    0 Tc
    36.235 0 Td              %% extra Td ('move text current point' operator)
    (r)Tj
    2.16501 0 Td             %% Td ('move text current point' instead of Tm)
    3.569 Tw
    0.706 Tc
    ( Gymnastik-Schulleite)Tj


List of PDF graphic operators:

    Tj - show text
    Tc - set character spacing
    Tm - set text matrix
    Tw - set word spacing
    Td - move text current point

It looks like Ghostscript replaced the original "Tm" (text matrix) operator by a "Td" (move text current point) one, and it also added an extra "2.16501 0 Td"... I don't know why this is, but it probably it is OK.

Note however, that this text f***up problem from 9.02 output does not occur, when I use the (Linux or Windows) Acrobat Reader 9.4.2 and go through the menu action "File -> Save as Text...". In this case, there are no additional spaces (a few extra linebreaks, however, are added). But despite of this, in Acrobat Reader the text is still not correctly searchable via the Acrobat search box, and text output always shows the extra spaces when doing *copy'n'paste*....
Comment 1 Ken Sharp 2011-05-23 17:58:37 UTC
Created attachment 7527 [details]
modified sample

I've looked into this, and I do not see the described difference with 8.71. Using the Windows 32-bit release of 8.71 I see that the area where the problem occurs:

>     ( Deutsche)Tj
>     0 Tc
>     36.235 0 Td              %% extra Td ('move text current point' operator)
>     (r)Tj

is exactly the same on 8.71 as for 9.02 and 9.03 (pre release). I do see a difference between 8.71 and 9.02, 9.02 copies the bullet point between (eg) 'JAHRGANG' and 'HEFT 1', 8.71 does not.

The file supplied in the GS bug report could not have been simply made from pdfwrite, as it has been linearized, which pdfwrite will not do. This has altered a number of sections in the layout, making it hard to compare the files. However I do not see any differences in the content stream. The fonts are all variants of Helvetica and are not embedded, nor are any /Widths arrays present, so the files do not contain any font metrics.

Since I cannot reproduce the described difference in the 8.71 and 9.02 output files, I can't see this as a bug or regression. It seems to me that if anything this is a bug in Adobe Reader, since it appears to be the only application incapable of correctly copying the text.

BTW pipitas, why has this moved from stackoverflow.com to superuser.com ?
Comment 2 Ken Sharp 2011-05-24 12:48:47 UTC
(In reply to comment #1)

> is exactly the same on 8.71 as for 9.02 and 9.03 (pre release). I do see a
> difference between 8.71 and 9.02, 9.02 copies the bullet point between (eg)
> 'JAHRGANG' and 'HEFT 1', 8.71 does not.
> 
> The file supplied in the GS bug report

Should have read Stack Overflow report here, not GS...

I have now been able to reproduce the issue, and it is not a regression from 8.71, its a progression (and an Adobe change).

8.71 shipped with a bug which caused it to write invalid ToUnicode CMaps. Misleading and contradictory Adobe documentation led to the CMap being written as a CMap, when in fact ToUnicode CMaps have their own, incompatible, rules.

ToUnicode CMaps are normally only used for searching and copy/paste. As the name implies they are used to map character codes to Unicode code points. The ToUnicode CMap in the 8.71 PDF file is not used, because it is invalid, the one in later versions is valid, and Acrobat is known to use it.

It appears that in Acrobat Reader up to and including 9.2 the existence of the ToUnicode data makes no difference. At some point after 9.2 the search mechanism, changed, and Acrobat appears to use two different mechanisms depending on whether a ToUnicode CMap is present. I don't have access to Acrobat Pro after 9.2 and only recently installed Reader X, I have nothing between.

The 'no Unicode' method works on all versions of Acrobat, the 'Unicode' method fails on newer versions. 

I showed this by white spacing the reference to the ToUnicode CMap from the FontDescriptor. If required I can make the various files available, but they are large as they are decompressed.

Since search is a heuristic effort in PDF it is not going to be possible to guarantee a result. The change in behaviour is due to Acrobat, not Ghostscript, and the change in Ghostscript was to fix a real bug, so a progression, not a regression.

I'm therefore closing this as 'WONTFIX'
Comment 3 pipitas 2011-05-26 09:21:04 UTC
In reply to question in Comment #1: 'Why was the topic moved from stackoverflow.com to superuser.com?'

Users with enough 'credits' can decide to move questions/topics they deem off-topic to one of the sister-websites (I believe they need to find some other people to vote for the move, and then it just happens). Sometimes off-topic questions stay where they are (because no-one bothers). On-topic on stackoverflow are questions related to programming...
Comment 4 pipitas 2011-05-26 09:26:29 UTC
In reply to Comment #2:

Thanks for the thorough explanation (and the meticulous investigation behind of the issue at hand), Ken. I appreciate that very much.

However, it will be hard to make users understand that. To those using a recent Acrobat or Reader version it will look like Ghostscript is at fault.
Comment 5 pipitas 2013-02-25 09:27:00 UTC
Update:
-------

I tested the original file again with the current version from Git ("GIT PRERELEASE 9.08 (2013-01-29)").

It now works (again) the same as GS-8.71 did.

Even after re-considering Ken's comment #2 (which, admittedly, I didn't grok fully), I still think there must have been a glitch within Ghostscript 9.02. After all, version GS-9.02 produced this PDF code:

    ( Deutsche)Tj
    0 Tc
    36.235 0 Td              %% extra Td ('move text current point' operator)
    (r)Tj
    2.16501 0 Td             %% Td ('move text current point' instead of Tm)
    3.569 Tw
    0.706 Tc
    ( Gymnastik-Schulleite)Tj

which copy'n'pasted this text:

    »Bun d Deutsche r GymnastikSchulleite r
        «

while GS-8.71 (and GS-9.08Git) produce this PDF code:

    ( Deutsche) Tj
    0 Tc
    (r) Tj
    1 0 0 1 143.236 265.140 Tm   %% Tm ('text matrix' operator)
    3.569 Tw
    0.706 Tc
    ( Gymnastik-Schulleite) Tj

which copy'n'pastes the respective text snippet as:

    »Bund Deutscher Gymnastik-Schulleiter«

The glitch was that throughout the *complete* text 99% of words had an extra blank inserted before their last character, and about 5% of word spacing blanks where removed.

Anyway, I'm glad that the problem currently is gone again, despite of the 'WONTFIX'   :-)