Bug 706074 - 10.0.0 gs fails to show JPN characters properly
Summary: 10.0.0 gs fails to show JPN characters properly
Status: RESOLVED WORKSFORME
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Interpreter (show other bugs)
Version: 10.0.0
Hardware: PC Linux
: P4 normal
Assignee: Default assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-11-12 14:31 UTC by becker.rg
Modified: 2022-11-14 15:25 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
Contains test pdf & jpeg outputs for 9.56.1 & 10.0.0 (452.65 KB, application/zip)
2022-11-12 14:31 UTC, becker.rg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description becker.rg 2022-11-12 14:31:03 UTC
Created attachment 23479 [details]
Contains test pdf & jpeg outputs for 9.56.1 & 10.0.0

I use gs to create jpeg images to check/compare reportlab pdf outputs.

Previously I used 9.56.1 in arch linux and it worked well.

After update to 10.0.0 released Nov 1 for arch I noticed that I was getting wrong jpeg outputs for some of our documents. These were previously generated correctly.

Since a major change was the pdf interpeter I did try to use the new flag 

-dNEWPDF=false

but I then get errors

/usr/bin/gs -q -dSAFER -dNOPAUSE -dBATCH   -dNEWPDF=false -sOutputFile=test_multibyte_jpn-page%04dold.jpg -sDEVICE=jpeg -r72x72 -f test_multibyte_jpn.pdf
   **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
......
and no jpegs are produced.

I attach a zip containing the PDF file used and the output images in the hope that it may be of use.
Comment 1 Ken Sharp 2022-11-12 14:55:51 UTC
The problem is 'probably' that the PDF file uses CIDFonts which are not embedded (CIDFonts should be embedded).

There are 3 instances of HeiSeiMin-W3 using different encodings and none of the CIDFonts are embedded. The file additionally uses MS-Mincho, which *is* embedded, and two regular fonts; Helvetica and ZapfDingbats, which are also not embedded.

Anyway, this means that Ghostscript has to use a substitute font and create a 'suitable' CIDFont from it, by using the CIDSystemInfo.

So having said all that... I can't see a problem. The file appears to render correctly with current code and with the 10.00.0 release for me, also when using -dNEWPDF=false. Also, 9.56.1 actually uses the new PDF interpreter as well.

Are you able to get the source code and compile Ghostscript yourself from source ? It may be that the Arch Linux packaging has a problem with 10.00.0, this might explain why -dNEWPDF would fail, since that code (the old PostScript-based PDF interpreter) is more or less unchanged since 9.56.1.
Comment 2 becker.rg 2022-11-14 13:09:50 UTC
I'm a bit puzzled about the fonts issue. These are as you say non-embedded fonts and are using various internal encodings about which I am not an expert. I will need to ask my colleague Andy about these.

However, given that I have not altered the generating code or the rest of the system (including fonts) it's hard to see why the new version should behave differently when I re-install the older version. 

Especially as you say the interpreter has not actually changed.

I will try a compile locally directly from source in case there's a packaging issue. Something that looks a bit odd in the latest PKGBUILD is this

# Remove internal CMaps (CMaps from poppler-data are used instead)
  rm -r Resource/CMap
Comment 3 Ken Sharp 2022-11-14 13:38:16 UTC
(In reply to becker.rg from comment #2)
> I'm a bit puzzled about the fonts issue. These are as you say non-embedded
> fonts and are using various internal encodings about which I am not an
> expert. I will need to ask my colleague Andy about these.

It's more than a simple Encoding with CIDFonts, but the data all appears consistent, however....


> I will try a compile locally directly from source in case there's a
> packaging issue. Something that looks a bit odd in the latest PKGBUILD is
> this

We don't do packaging for any Linux system, which is why I suggest using the source as we supply it.

 
> # Remove internal CMaps (CMaps from poppler-data are used instead)
>   rm -r Resource/CMap

OK that sounds like a really bad plan. In the absence of a CIDFont we substitute our fallback CIDont, NotoSans, and in order to get the CID->GID mapping correct we use the CIDSystemInfo and the relevant CMap.

I'm not at all convinced that using someone else's CMaps is going to work, and even if they do, it's entirely possible that Poppler's files don't have the same coverage as ours.

My (rather elderly) version of Poppler includes 9 CMap files, Ghostscript has 181 so pretty clearly the coverage of Poppler's files is nowhere near sufficient for Ghostscript's potential needs.


> However, given that I have not altered the generating code or the rest of
> the system (including fonts) it's hard to see why the new version should
> behave differently when I re-install the older version. 

I'm not sure what you mean here, are you suggesting that if you re-install 9.56.1 that 10.00.0 suddenly starts working as you would expect ? I don't see that mentioned in the original report.

Of course, if the packaging of 9.56.1 included our CMaps, and the two versions are using the same install directory for the support files (and are not using the ROM file system) then installing the old version would re-instate the CMaps, which (if that's the problem) would then mean that 10.00.0 would work, because the CMaps it wants would magically be available. FWIW for me, on Ubuntu with 9.54.0 installed as packaged, the /usr/share/ghostscript/Resource/CMap folder is a link to /usr/lib/ghostscript/CMap. Presumably if I were to install a newer package then it would update the same folder. So yes, I can see how installign the old version *after* installing the new version could suddenly cause this to work.


The fonts in your document use 90ms-RKSJ-V, EUC-V and UniJIS-UCS2-H, all those CMaps are present in Ghostscript's Resource/CMap folder, none of them are present in Poppler's cMap folder. (plus of course Ghostscript can't simply use Poppler's folder anyway, presumably there's a symlink or something to Resource/CMap).

While I can see a potential benefit in sharing CMaps, replacing the more comprehensive set with a reduced set makes no sense to me at all. It also seems to have caused problems in the past since I see a Red Hat report dating back to 2012 where the reporter asks (reasonably I feel) :

"Are these few MB overhead from ghostscript really worth the effort of all
this trouble and debugging and so on? Not really. Please fix..."

Why the packagers think saving < 7MB of space is worth risking Ghostscript not working escapes me.
Comment 4 becker.rg 2022-11-14 15:17:53 UTC
No worries regarding versions it is a packaging issue. Regarding the observed difference with the arch pkg I meant only that the only thing I was changing was the package so the difference seemed to lie there.

I used archlinux latest PKGBUILD and patch as detailed here 

https://github.com/archlinux/svntogit-packages/tree/packages/ghostscript/trunk

but I commented line 60 rm -r Resource/CMap. After I do that the render of the heisei fonts seems OK. Looked in the PKGBUILD and see some other refs to poppler. 

I will try and convince the pkg maintainer to take a look at this.
Comment 5 Ken Sharp 2022-11-14 15:25:41 UTC
(In reply to becker.rg from comment #4)
> No worries regarding versions it is a packaging issue. Regarding the
> observed difference with the arch pkg I meant only that the only thing I was
> changing was the package so the difference seemed to lie there.

Ah I see, pardon my confusion I'm not an expert on packaging for Linux.

 
> but I commented line 60 rm -r Resource/CMap. After I do that the render of
> the heisei fonts seems OK. Looked in the PKGBUILD and see some other refs to
> poppler. 
> 
> I will try and convince the pkg maintainer to take a look at this.

Thanks for taking that task on, if the package maintainer wants to discuss this with us there is information on contacting us here:

https://ghostscript.com/resources/index.html

Also as a last resort this bug report could be used. We're certainly open to explaining why using Poppler resources for Ghostscript isn't an ideal solution.

I'm going to close this (for now at least) as works for me.