Bug 701417 - Text is garbage after processing PDF
Summary: Text is garbage after processing PDF
Status: RESOLVED INVALID
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.27
Hardware: PC other
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-12 10:10 UTC by clark
Modified: 2019-08-13 06:31 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
test (134.87 KB, application/pdf)
2019-08-12 10:10 UTC, clark
Details
output (21.25 KB, application/pdf)
2019-08-12 12:20 UTC, clark
Details

Note You need to log in before you can comment on or make changes to this bug.
Description clark 2019-08-12 10:10:56 UTC
Created attachment 17962 [details]
test

After running this the PDF is unreadable

gs -dPDFSETTINGS=/screen -dColorImageResolution=200 -dGrayImageResolution=200 -dMonoImageResolution=200 \
-dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sFONTPATH=/usr/share/fonts/truetype/msttcorefonts \
-dCompressFonts=true -dSubsetFonts=true -dDetectDuplicateImages=true -dOverrideICC=true -dColorConversionStrategy=/Gray -dDoThumbnails=false \
-o out.pdf test.pdf
Comment 1 Ken Sharp 2019-08-12 11:38:23 UTC
(In reply to clark from comment #0)
> Created attachment 17962 [details]
> test
> 
> After running this the PDF is unreadable
> 
> gs -dPDFSETTINGS=/screen -dColorImageResolution=200
> -dGrayImageResolution=200 -dMonoImageResolution=200 \
> -dBATCH -dNOPAUSE -sDEVICE=pdfwrite
> -sFONTPATH=/usr/share/fonts/truetype/msttcorefonts \
> -dCompressFonts=true -dSubsetFonts=true -dDetectDuplicateImages=true
> -dOverrideICC=true -dColorConversionStrategy=/Gray -dDoThumbnails=false \
> -o out.pdf test.pdf

It would be helpful if you could try reducing the command line to just that which is neccesary to reproduce the problem.

Your file has a problem straight away, Ghostscript says :

   **** Error: Encountered 'obj' while expecting 'endobj'.
               Treating this as a missing 'endobj', output may be incorrect.

So you've got a warning of potential trouble. The object in question is the first object in the file (object 5 0) which is the Info dictionary and (as Ghostscript tells you) is lacking an endobj. You should probably report this to the authors of whichever tool was used to create this file.


I don't see the text in the output as being 'unreadable'. I do see that a small number of non-ASCII (in this case Scandinavian) characters are not being rendered correctly. Is this your problem ? If so it wold really help again if you could be specific about your problem when reporting bugs.


If so then this is because your original file uses two CIDFonts; Arial and Arial,Bold. The PDF specification is quite clear that CIDFonts must be embedded. 

In the absence of the required CIDFonts, and presumably also in the absence of a defined substitute CIDFont (setting -sFONTPATH only inludes Fonts, not CIDFonts) Ghostscript is forced to the fallback font. Again this is signalled in the output:

Can't find CID font "Arial".
Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution.
The substitute CID font "Adobe-Identity" is not provided either. attempting to use fallback CIDFont.See doc/Use.htm#CIDFontSubstitution.
Loading a TT font from %rom%Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identity ... Done.
Can't find CID font "Arial".
Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution.

Whenever you substitute a font it is possible that there will be problems. If you don't want that to happen then you must embed the font. Alternatively you must supply a suitable substitute font to use. If I edit /ghostpdl/Resource/Init/cidfmap and define Arial as using the arial.ttf font, then I get a PDF file which matches the original (the Bold variant is still incorrect, of course, as it is a manufactuired bold CIDFont, using the regular weight CIDFont)



If you are seeig a differebt problem, please reopen the report, but please be much more specific about what you see as the actual fault. You can upload the outptu file as well if this makes things more obvious.
Comment 2 clark 2019-08-12 12:20:39 UTC
Thanks for the good answer.. I have now reduced the command line. But I still get a unreadable PDF file.. I will upload a new attachment with the description "output" so you can see for your self

gs -dPDFSETTINGS=/screen \
-dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sFONTPATH=/usr/share/fonts/truetype/msttcorefonts \
-dCompressFonts=true -dSubsetFonts=true -dDetectDuplicateImages=true \
-o out.pdf test.pdf
Comment 3 clark 2019-08-12 12:20:59 UTC
Created attachment 17972 [details]
output
Comment 4 clark 2019-08-12 12:24:00 UTC
And you are talking about CID fonts

Is it possble to install core CID fonts in the same way as TTF fonts with the below command? And you do you set the CID font path in gs?

apt-get install ttf-mscorefonts-installer
Comment 5 Ken Sharp 2019-08-12 13:47:07 UTC
(In reply to clark from comment #2)
> Thanks for the good answer.. I have now reduced the command line. But I
> still get a unreadable PDF file.. I will upload a new attachment with the
> description "output" so you can see for your self

Well that's utterly unlike what I get with your command line (though I dropped the -sFONTPATH since I won't have the same fonts, or path). You might try it wihtout the FONTPATH. I do notice that if you select the text and copy it (in Acrobat) the pasted result is correct.....

I've tried release and debug code on Linux and Windows and am completely unable to reproduce your file. FOr me the only faults are a few (less than a dozen) glyphs such as oslash and ae which are rendered with the wrong glyph.

Given that you are using some flavour of Linux (though your report says Windows 10) my guess would be that you are using a version supplied by a package, and that some changes have been made by the packager, such as not including all the CMaps or Decoding resources which we supply.


(In reply to clark from comment #4)
> And you are talking about CID fonts

Yes, very specifically CIDFonts.

 
> Is it possble to install core CID fonts in the same way as TTF fonts with
> the below command? 

No, because CIDFonts are not TrueType fonts. A CIDFont may have TrueType outlines, but that's not the same thing I'm afraid. Microsoft does not supply CIDFonts.


> And you do you set the CID font path in gs?

There is no CIDFont path. Well actually that's not true, exactly, the path for CIDFonts is /ghostpdl/Resource/CIDFont. But I doubt you have a CIDFont to put there. I suspect what you want to do is use a non-CIDFont as a replacement for a missing CIDFont, in the same way that Ghostscript will often allow you to use (for example) a TrueType font as a subsittute for a missing PostScript font.

CIDFonts are more complicated than regular Fonts, where you are defining a non-CIDFont substitute for a missing CIDFont (eg using the TrueType font arial.ttf instead of a CIDFont named Arial) then you need to supply additional information. This is more true of Far Eastern fonts than Latin ones, but Ghostscript can't know what the font is to be used for, and cannot guess at the missing information. If we could, we would.

You need to edit cidfmap (/ghostpdl/Resource/Init/cidfmap) as I mentioned in Comment #1. The format of the CIDFont records is described in the file.

If you are using a version of Ghostscript which uses a ROM file system, then you will need to use the -I switch to Include the directory containing cidfmap in the search path, otherwise GS will use the built-in ROM version of the file, which obviously won't contain your substitute.
Comment 6 clark 2019-08-12 14:37:11 UTC
I'm on Debian 10 Buster.. have installed it via apt-get install ghostscript
Comment 7 Ken Sharp 2019-08-12 14:49:17 UTC
(In reply to clark from comment #6)
> I'm on Debian 10 Buster.. have installed it via apt-get install ghostscript

Then yes, you are getting a package. There may or may not be differences in what the package supplies, compared to vanilla Ghostscript.

Obviously we only support our code. When I run your test (but without -sFONTPATH) I don't get the same result you do (its not completely correct but its not as wildly wrong either). This suggests that there's something about the package which is causing the difference, or just barely possibly the presence of -sFONTPATH. I can't help you with either of those since I don't have the specific package and don't know what was done to it, nor do I have your specific fonts.

You can always download the Ghostscript source and build from that, which is what I'm doing, or you can try without -sFONTPATH and see if that helps, or you can try supplying a correct substittue for the missing CIDFonts.

There's nothing further we can do with this though, the basic explanation for your problem is in comment #1.
Comment 8 Chris Liddell (chrisl) 2019-08-12 15:22:29 UTC
The DroidSansFallback.ttf on Ubuntu (I assume it's the same as Debian's) isn't the same as the one we ship, so I'm guessing the glyph ordering is different, hence the difference.
Comment 9 clark 2019-08-12 21:32:06 UTC
how do I find the maintainer of the reposity package? :)
Comment 10 Chris Liddell (chrisl) 2019-08-13 06:31:45 UTC
(In reply to clark from comment #9)
> how do I find the maintainer of the reposity package? :)

Google? It lands on:
https://packages.debian.org/sid/ghostscript

FWIW, as Ken observed, even with our DroidSans, there are still glyphs that are not as intended. You would *probably* get closer to the intended output by creating a cidfmap file with substitutions for Arial and Arial,Bold (IIRC).