Summary: | Conversion PDF->PDFA inserts spaces in OCR text | ||
---|---|---|---|
Product: | Ghostscript | Reporter: | Graham Seaman <graham> |
Component: | PDF Writer | Assignee: | Default assignee <ghostpdl-bugs> |
Status: | RESOLVED WORKSFORME | ||
Severity: | normal | ||
Priority: | P4 | ||
Version: | 9.27 | ||
Hardware: | PC | ||
OS: | Linux | ||
Customer: | Word Size: | --- | |
Attachments: |
Single page from source pdf
x.pdf ar_703591.pdf gs_703591.txt |
9.27 is VERY old. Please re-test with 9.53.3 with Ghostscript. Note 9.54 is coming soon (in March). There is a very good chance this has been corrected. Also, if you are trying to make a PDF/A file, you need to give is a PDF/A definition file (PDFA_def.ps) as described in: https://www.ghostscript.com/doc/current/VectorDevices.htm#PDFA Testing with: gswin64c --permit-file-read=./iccprofiles/ -dPDFA -sDEVICE=pdfwrite -o x.pdf -sColorConversionStrategy=UseDeviceIndependentColor -sProcessCo lorModel=DeviceRGB PDFA_def.ps Bug703591.pdf Results in: GPL Ghostscript GIT PRERELEASE 9.54.0 (2020-10-01) Copyright (C) 2020 Artifex Software, Inc. All rights reserved. This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY: see the file COPYING for details. **** Warning: File has an invalid xref entry: 2. Rebuilding xref table. Processing pages 1 through 1. Page 1 Querying operating system for font files... Substituting font Times-Roman for TimesNewRomanPSMT. GPL Ghostscript GIT PRERELEASE 9.54.0: **** A font missing from the input PDF has been substituted with a different font. Widths may differ, reverting to normal PDF output! Loading NimbusRoman-Regular font from %rom%Resource/Font/NimbusRoman-Regular... 4818604 3444457 2676944 1272173 4 done. Substituting font Times-Bold for TimesNewRomanPS-BoldMT. GPL Ghostscript GIT PRERELEASE 9.54.0: **** A font missing from the input PDF has been substituted with a different font. Widths may differ, reverting to normal PDF output! Loading NimbusRoman-Bold font from %rom%Resource/Font/NimbusRoman-Bold... 5026476 3689679 2819560 1393809 4 done. GPL Ghostscript GIT PRERELEASE 9.54.0: Requested glyph not present in source font, not permitted in PDF/A, reverting to normal PDF output The "widths may differ" condition is probably the cause of some applications not showing the text with the expected spacing. I will attached the output from the above command with the latest Ghostscript, and the text for the second paragraph when I use Adobe Acrobat Pro 9 to "cut" the relevant paragraph. I will also attach the text output from the 'x.pdf' created by Ghostscript using the 'txtwrite' device: gswin64c -sDEVICE=txtwrite -o gs_703591.txt x.pdf As you can see, Ghostscript and Adobe differ as far as the text they provide, but to me, the Ghostscript 'txtwrite' output is superior since the heuristics used to reconstruct the text results in output that matches the scanned PDF better. Note, I have not tried any newer versions of Acrobat. Note that txtwrite can be used on the original document as well without needing processing through 'pdfwrite'. In order to create a PDFA, you will need to provide the correct fonts (see https://www.ghostscript.com/doc/current/Use.htm#Font_lookup Created attachment 20645 [details]
x.pdf
Output from 9.54 PRE-RELEASE Ghostscript pdfwrite from input file.
Created attachment 20646 [details]
ar_703591.pdf
Acrobat Pro 9 "cut" output of the relevant paragraph
Created attachment 20647 [details]
gs_703591.txt
Output from Ghostscript -sDEVICE=txtwrite
(In reply to Ray Johnston from comment #1) > 9.27 is VERY old. Please re-test with 9.53.3 with Ghostscript. > OK, done that. Results unchanged. > Note 9.54 is coming soon (in March). There is a very good chance this has > been > corrected. > > > Testing with: > gswin64c --permit-file-read=./iccprofiles/ -dPDFA -sDEVICE=pdfwrite -o x.pdf > -sColorConversionStrategy=UseDeviceIndependentColor -sProcessCo > lorModel=DeviceRGB PDFA_def.ps Bug703591.pdf > > **** A font missing from the input PDF has been substituted with a > different font. > Widths may differ, reverting to normal PDF output! > > The "widths may differ" condition is probably the cause of some applications > not showing the text with the expected spacing. So you suggest it is missing fonts being replace with fonts with different widths that is the root of the problem. This is pretty insoluble for me as a general problem, but knowing the reason is very helpful and saves me from wasting more time. > > I will attached the output from the above command with the latest > Ghostscript, > and the text for the second paragraph when I use Adobe Acrobat Pro 9 to "cut" > the relevant paragraph. > > I will also attach the text output from the 'x.pdf' created by Ghostscript > using the 'txtwrite' device: > gswin64c -sDEVICE=txtwrite -o gs_703591.txt x.pdf > > As you can see, Ghostscript and Adobe differ as far as the text they provide, > but to me, the Ghostscript 'txtwrite' output is superior since the heuristics > used to reconstruct the text results in output that matches the scanned PDF > better. Note, I have not tried any newer versions of Acrobat. > I have been using pdftotext from poppler, which gives the same poor results (spaces between most letters) on your new x.pdf file created with your newer gs as I found with my versions. I will now switch to txtwrite! More of a problem is that Linux PDF viewers (ie. non-Adobe) such as evince also have the same problem with the file, which makes search based on the OCR contents unusable. > > In order to create a PDFA, you will need to provide the correct fonts (see > https://www.ghostscript.com/doc/current/Use.htm#Font_lookup I had (mistakenly) assumed that conversion of existing scanned PDFs to PDF/A was a relatively straightforward process; now I see that because of the need to have many fonts available and because of the inability of older viewers to provide search on the PDF/A files if the fonts are not correct, PDF/A is not really a practical option for me. Thank you for your time and help in looking into this. |
Created attachment 20643 [details] Single page from source pdf Converting some PDF files consisting of scanned page images plus embedded OCR text from PDF to PDF/A results in a space being inserted between the majority of characters in the embedded OCR text. The problem was originally found using OCRmyPDF (which depends on ghostscript) but reproduced using ghostscript directly gs -dPDFA -dBATCH -dNOPAUSE -sColorConversionStrategy=UseDeviceIndependentColor -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -sOutputFile=page6_pdfa.pdf page6.pdf A paragraph of the original OCR text: It will be seen that the North is makingan imperative call upon him and that he answers with whole-hearted eager ness, for the time giving himself up almost entirely to the delight o f this new interest, coming face to face with the Northern literature that until now he had but known in translations and abstracts; his mind is in a ferment with all this fresh material urging him to fresh production; and “ The Earthly Paradise ” work goes on steadily, while the business claims his close personal attention. The same paragraph in the PDF/A text: It w ill b e se e n th a t th e N o r th is m a k in g a n im p era tiv e call u p o n h im an d th at h e a n sw ers w ith w h o le -h e a r te d ea ^\ ^A n e ss, fo r th e tim e g iv in g h im s e lf u p a lm o st e n tir e ly to th e d e lig h t o f th is n ew in te r e st, c o m in g face to face w ith th e N o r th e r n litera tu re th at u n til n o w h e had b u t k n o w n in tra n sla tio n s and a b str a cts; his m in d is in a fe r m e n t w ith all th is fresh m aterial u r g in g h im to fresh p r o d u c tio n ; and “ T h e E a r th ly P a r a d ise ” w o r k g o e s o n ste a d ily , w h ile th e b u sin e ss cla im s h is c lo s e p erson al a tte n tio n . Attaching the relevant page from the original PDF (the behaviour is consistent across several hundred pages in the full pdf)