Bug 703591 - Conversion PDF->PDFA inserts spaces in OCR text
Summary: Conversion PDF->PDFA inserts spaces in OCR text
Status: RESOLVED WORKSFORME
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.27
Hardware: PC Linux
: P4 normal
Assignee: Default assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-02-22 11:36 UTC by Graham Seaman
Modified: 2021-02-23 20:30 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
Single page from source pdf (73.17 KB, application/pdf)
2021-02-22 11:36 UTC, Graham Seaman
Details
x.pdf (91.07 KB, application/pdf)
2021-02-22 22:03 UTC, Ray Johnston
Details
ar_703591.pdf (567 bytes, text/plain)
2021-02-22 22:09 UTC, Ray Johnston
Details
gs_703591.txt (546 bytes, text/plain)
2021-02-22 22:11 UTC, Ray Johnston
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Graham Seaman 2021-02-22 11:36:44 UTC
Created attachment 20643 [details]
Single page from source pdf

Converting some PDF files consisting of scanned page images plus embedded OCR text from PDF to PDF/A results in a space being inserted between the majority of characters in the embedded OCR text.

The problem was originally found using OCRmyPDF (which depends on ghostscript) but reproduced using ghostscript directly

gs -dPDFA -dBATCH -dNOPAUSE -sColorConversionStrategy=UseDeviceIndependentColor -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -sOutputFile=page6_pdfa.pdf page6.pdf

A paragraph of the original OCR text:

It will be seen that the North is makingan imperative call
upon him and that he answers with whole-hearted eager­
ness, for the time giving himself up almost entirely to the
delight o f this new interest, coming face to face with the
Northern literature that until now he had but known in
translations and abstracts; his mind is in a ferment with
all this fresh material urging him to fresh production; and
“ The Earthly Paradise ” work goes on steadily, while the
business claims his close personal attention.

The same paragraph in the PDF/A text:

It w ill b e se e n th a t th e N o r th is m a k in g a n im p era tiv e call
u p o n h im an d th at h e a n sw ers w ith w h o le -h e a r te d ea ^\ ^A
n e ss, fo r th e tim e g iv in g h im s e lf u p a lm o st e n tir e ly to th e
d e lig h t o f th is n ew in te r e st, c o m in g face to face w ith th e
N o r th e r n litera tu re th at u n til n o w h e had b u t k n o w n in
tra n sla tio n s and a b str a cts; his m in d is in a fe r m e n t w ith
all th is fresh m aterial u r g in g h im to fresh p r o d u c tio n ; and
“ T h e E a r th ly P a r a d ise ” w o r k g o e s o n ste a d ily , w h ile th e
b u sin e ss cla im s h is c lo s e p erson al a tte n tio n .

Attaching the relevant page from the original PDF (the behaviour is consistent across several hundred pages in the full pdf)
Comment 1 Ray Johnston 2021-02-22 19:08:13 UTC
9.27 is VERY old. Please re-test with 9.53.3 with Ghostscript.

Note 9.54 is coming soon (in March). There is a very good chance this has been
corrected.

Also, if you are trying to make a PDF/A file, you need to give is a PDF/A
definition file (PDFA_def.ps) as described in:
   https://www.ghostscript.com/doc/current/VectorDevices.htm#PDFA

Testing with:
gswin64c --permit-file-read=./iccprofiles/ -dPDFA -sDEVICE=pdfwrite -o x.pdf -sColorConversionStrategy=UseDeviceIndependentColor -sProcessCo
lorModel=DeviceRGB PDFA_def.ps Bug703591.pdf

Results in:

GPL Ghostscript GIT PRERELEASE 9.54.0 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
   **** Warning:  File has an invalid xref entry:  2.  Rebuilding xref table.
Processing pages 1 through 1.
Page 1
Querying operating system for font files...
Substituting font Times-Roman for TimesNewRomanPSMT.
GPL Ghostscript GIT PRERELEASE 9.54.0:
 **** A font missing from the input PDF has been substituted with a different font.
        Widths may differ, reverting to normal PDF output!
Loading NimbusRoman-Regular font from %rom%Resource/Font/NimbusRoman-Regular... 4818604 3444457 2676944 1272173 4 done.
Substituting font Times-Bold for TimesNewRomanPS-BoldMT.
GPL Ghostscript GIT PRERELEASE 9.54.0:
 **** A font missing from the input PDF has been substituted with a different font.
        Widths may differ, reverting to normal PDF output!
Loading NimbusRoman-Bold font from %rom%Resource/Font/NimbusRoman-Bold... 5026476 3689679 2819560 1393809 4 done.
GPL Ghostscript GIT PRERELEASE 9.54.0: Requested glyph not present in source font,
 not permitted in PDF/A, reverting to normal PDF output

The "widths may differ" condition is probably the cause of some applications
not showing the text with the expected spacing.

I will attached the output from the above command with the latest Ghostscript,
and the text for the second paragraph when I use Adobe Acrobat Pro 9 to "cut"
the relevant paragraph.

I will also attach the text output from the 'x.pdf' created by Ghostscript
using the 'txtwrite' device:
gswin64c  -sDEVICE=txtwrite -o gs_703591.txt x.pdf

As you can see, Ghostscript and Adobe differ as far as the text they provide,
but to me, the Ghostscript 'txtwrite' output is superior since the heuristics
used to reconstruct the text results in output that matches the scanned PDF
better. Note, I have not tried any newer versions of Acrobat.

Note that txtwrite can be used on the original document as well without needing
processing through 'pdfwrite'.

In order to create a PDFA, you will need to provide the correct fonts (see
https://www.ghostscript.com/doc/current/Use.htm#Font_lookup
Comment 2 Ray Johnston 2021-02-22 22:03:19 UTC
Created attachment 20645 [details]
x.pdf

Output from 9.54 PRE-RELEASE Ghostscript pdfwrite from input file.
Comment 3 Ray Johnston 2021-02-22 22:09:48 UTC
Created attachment 20646 [details]
ar_703591.pdf

Acrobat Pro 9 "cut" output of the relevant paragraph
Comment 4 Ray Johnston 2021-02-22 22:11:51 UTC
Created attachment 20647 [details]
gs_703591.txt

Output from Ghostscript -sDEVICE=txtwrite
Comment 5 Graham Seaman 2021-02-23 20:30:15 UTC
(In reply to Ray Johnston from comment #1)
> 9.27 is VERY old. Please re-test with 9.53.3 with Ghostscript.
> 

OK, done that. Results unchanged.

> Note 9.54 is coming soon (in March). There is a very good chance this has
> been
> corrected.
> 

> 
> Testing with:
> gswin64c --permit-file-read=./iccprofiles/ -dPDFA -sDEVICE=pdfwrite -o x.pdf
> -sColorConversionStrategy=UseDeviceIndependentColor -sProcessCo
> lorModel=DeviceRGB PDFA_def.ps Bug703591.pdf
> 

>  **** A font missing from the input PDF has been substituted with a
> different font.
>         Widths may differ, reverting to normal PDF output!

> 
> The "widths may differ" condition is probably the cause of some applications
> not showing the text with the expected spacing.

So you suggest it is missing fonts being replace with fonts with different widths that is the root of the problem. This is pretty insoluble for me as a general problem, but knowing the reason is very helpful and saves me from wasting more time. 

> 
> I will attached the output from the above command with the latest
> Ghostscript,
> and the text for the second paragraph when I use Adobe Acrobat Pro 9 to "cut"
> the relevant paragraph.
> 
> I will also attach the text output from the 'x.pdf' created by Ghostscript
> using the 'txtwrite' device:
> gswin64c  -sDEVICE=txtwrite -o gs_703591.txt x.pdf
> 
> As you can see, Ghostscript and Adobe differ as far as the text they provide,
> but to me, the Ghostscript 'txtwrite' output is superior since the heuristics
> used to reconstruct the text results in output that matches the scanned PDF
> better. Note, I have not tried any newer versions of Acrobat.
> 

I have been using pdftotext from poppler, which gives the same poor results (spaces between most letters) on your new x.pdf file created with your newer gs as I found with my versions. I will now switch to txtwrite!

More of a problem is that Linux PDF viewers (ie. non-Adobe) such as evince also have the same problem with the file, which makes search based on the OCR contents unusable. 

> 
> In order to create a PDFA, you will need to provide the correct fonts (see
> https://www.ghostscript.com/doc/current/Use.htm#Font_lookup

I had (mistakenly) assumed that conversion of existing scanned PDFs to PDF/A was a relatively straightforward process; now I see that because of the need to have many fonts available and because of the inability of older viewers to provide search on the PDF/A files if the fonts are not correct, PDF/A is not really a practical option for me. Thank you for your time and help in looking into this.