Bug 706316 - pdfwrite can no longer unify external fonts of embedded PDFs
Summary: pdfwrite can no longer unify external fonts of embedded PDFs
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 10.0.0
Hardware: PC Linux
: P4 enhancement
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-01-10 09:08 UTC by Werner Lemberg
Modified: 2023-02-24 06:55 UTC (History)
4 users (show)

See Also:
Customer:
Word Size: ---


Attachments
gs.tar.gz (111.66 KB, application/gzip)
2023-01-10 09:08 UTC, Werner Lemberg
Details
pdftk output file 1 (4.07 KB, application/pdf)
2023-01-11 04:58 UTC, Werner Lemberg
Details
pdftk output file 2 (3.77 KB, application/pdf)
2023-01-11 04:59 UTC, Werner Lemberg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Werner Lemberg 2023-01-10 09:08:14 UTC
Created attachment 23689 [details]
gs.tar.gz

[ghostpdl 462efa959c5d7df1dd3fd6ea411522062d1a6c3b]

Consider the attached archive.  The input file, `lilypond.pdf`, was created by pdfTeX, which includes two PDF images without embedded fonts.  `pdfinfo lilypond.pdf` shows the following:

```
name           type    encoding  emb sub uni object ID
-------------- ------- --------- --- --- --- ---------
PBUMLG+CMR10   Type 1  Builtin   yes yes yes     22  0
ZHSIID+CMTT10  Type 1  Builtin   yes yes yes     23  0
C059-Roman     Type 1  WinAnsi   no  no  no       6  0
Emmentaler-20  Type 1  Custom    no  no  no       7  0
C059-Roman     Type 1  WinAnsi   no  no  no      14  0
Emmentaler-20  Type 1  Custom    no  no  no      15  0
```

Calling the script `call-gs.sh` with gs 9.56.1 and a self-compiled version of a recent git commit, then comparing the output, it becomes obvious that current git has lost the ability to unify identical fonts.

`pdffonts lilypond-gs-9.56.1.pdf`:

```
name                  type     encoding  emb sub uni object ID
--------------------- -------- --------- --- --- --- ---------
IVMHDO+CMTT10         Type 1C  WinAnsi   yes yes no       9  0
UAHBEY+Emmentaler-20  Type 1C  Custom    yes yes no      11  0
SDVPDT+C059-Roman     Type 1C  WinAnsi   yes yes no      13  0
MTVBEO+CMR10          Type 1C  Custom    yes yes no       7  0
```

`pdffonts lilypond-gs-10.01.0.pdf`:

```
name                  type     encoding  emb sub uni object ID
--------------------- -------- --------- --- --- --- ---------
YXTLOZ+CMTT10         Type 1C  WinAnsi   yes yes no       9  0
UAHBEY+Emmentaler-20  Type 1C  Custom    yes yes no      11  0
FJWDPV+C059-Roman     Type 1C  WinAnsi   yes yes no      13  0
UZONGY+Emmentaler-20  Type 1C  Custom    yes yes no      15  0
VGHJZY+C059-Roman     Type 1C  WinAnsi   yes yes no      17  0
NLOVNX+CMR10          Type 1C  Custom    yes yes no       7  0
```

For large documentation files like LilyPond's Notation Reference, which includes more than thousand of such PDF images, this causes a severe degradation: The size of the PDF file increases from about 8MByte to 29MByte...
Comment 1 Ken Sharp 2023-01-10 09:47:52 UTC
(In reply to Werner Lemberg from comment #0)

> Calling the script `call-gs.sh` with gs 9.56.1 and a self-compiled version
> of a recent git commit, then comparing the output, it becomes obvious that
> current git has lost the ability to unify identical fonts.

They are not identical, they have different object numbers. In PDF terms they are different fonts.

They have different LastChar entries, different Encodings and differing Widths.


> For large documentation files like LilyPond's Notation Reference, which
> includes more than thousand of such PDF images, this causes a severe
> degradation: The size of the PDF file increases from about 8MByte to
> 29MByte...

I see that this is a problem for you, but essentially you have been taking advantage of a limitation in the old PDF interpreter, not a feature. Because the old interpreter was written in PostScript it used the font names as the key to index on (because that's how PostScript works). This meant that if you had already defined a font called (eg) Emmentaler-20 then that is the font that would be found when the next font object was used.

For PDF this is, of course, incorrect the interpreter should use the object number. Over the years has led to a large number of bug reports and considerable effort to try and distinguish between two fonts with the same name but otherwise in some way different. I'll be frank; if this was still working for you then you were lucky, we've been disambiguating fonts for years, because treating two different fonts as the same leads to incorrect output (rendering the wrong glyphs from a font).

The new PDF interpreter uses the object number, which is the unique identifier in PDF files, to key the fonts and as a result there is no longer any possibility of confusing two different fonts as being the same.

And no possibility of 'unifying' different fonts with the pdfwrite device.

I've moved this to an enhancement which we'll consider in the future, but the current behaviour is as expected. The input has 6 fonts, the output has 6 fonts.
Comment 2 Werner Lemberg 2023-01-10 10:37:06 UTC
Thanks for the analysis.  This is extremely disappointing, since this unification process was the only reason for us to use GS as a postprocessor.  And we definitely know that the fonts are always the same, irrespective of the object number.

So: What can we do to fix this?  Or rather, what must be done to fix this?  We are certainly not the only people who include zillions of small PDF images in a master document – think of a geometry book with lots of labeled images.
Comment 3 Ken Sharp 2023-01-10 10:53:23 UTC
(In reply to Werner Lemberg from comment #2)

> And we definitely know that the fonts are always the same, irrespective of
> the object number.

You may know that in your case (and in fact they aren't the same, as I noted, they are just subsets that can be aggregated into a superset without collisions) but in the general case we cannot know that.

Many applications (LibreOffice for example) write font subsets with the same name even though they are not compatible.

 
> So: What can we do to fix this?  Or rather, what must be done to fix this? 
> We are certainly not the only people who include zillions of small PDF
> images in a master document – think of a geometry book with lots of labeled
> images.

Off the top of my head, I don't know. Not a clue currently I'm afraid.

At a guess we'd need to add some code to pdfwrite to see if there is already a font defined with the same name. If there is then we would need to try and decide whether the new font can be safely combined with the existing one to form a superset.

That's all going to be new code and very heuristic (ie guessing) so it'll be prone to errors initially. Obviously we'll err on the side of caution and only create a superset if we are certain that it is safe to do so.
Comment 4 Werner Lemberg 2023-01-10 11:20:39 UTC
> Many applications (LibreOffice for example) write font subsets with
> the same name even though they are not compatible.

I can imagine that, but in our case we don't embed the font(s) in question, which implies no subsetting...

In our current setup, the PDF images are created by GS, which converts an EPS file created by LilyPond, which in turn contains `/NeverEmbed`.  Is there an option to GS that disables even subsetting of the `/Width` array and friends?  If so, we could probably postprocess the master file, unifying the PDF object numbers.
Comment 5 Ken Sharp 2023-01-10 11:32:02 UTC
(In reply to Werner Lemberg from comment #4)

> In our current setup, the PDF images are created by GS, which converts an
> EPS file created by LilyPond, which in turn contains `/NeverEmbed`.  Is
> there an option to GS that disables even subsetting of the `/Width` array
> and friends?

Well you can set SubsetFonts to false, but I've no idea what effect that will have on a font which is not being embedded.


>  If so, we could probably postprocess the master file, unifying
> the PDF object numbers.

I'm not entirely certain what you are proposing (partly because I have a limited knowledge of the actual process) but that sounds like an awfully hacky way to proceed, and possibly wouldn't resolve the problem of the FirstChar and LastChar entries, and the fact that the Encoding wouldn't be correct.

You would need to alter the 'final' font object so that the LastChar was the last character actually used by any of the fonts, and also create a new Encoding which was populated with all the entries from the various fonts scattered through the document.
Comment 6 Werner Lemberg 2023-01-10 17:43:31 UTC
> I'm not entirely certain what you are proposing (partly because I have
> a limited knowledge of the actual process) but that sounds like an
> awfully hacky way to proceed, and possibly wouldn't resolve the problem
> of the FirstChar and LastChar entries, and the fact that the Encoding
> wouldn't be correct.

Well, there could be a GS option to set the `FirstChar` and `LastChar` entries together with the `Widths` array to span up the complete font.  Similarly, the encoding object could contain all glyphs in the font.  This would make the PDF images much larger, but after inclusion in the master document the fonts (descriptors) would be really identical and could be merged rather easily.  As a temporary solution it would be even possible by a script to adjust the PDF object IDs before calling GS to actually embed and subset the fonts.
Comment 7 Robin Watts 2023-01-11 00:00:58 UTC
So, I go out to the store and buy 20 copies of the same jigsaw.

I throw away a random 80% of the pieces from each jigsaw.

Then I photocopy all the jigsaws.

It would seem silly of me to complain to the manufacturer of my photocopier that it's failed to make a single jigsaw from all the parts.

If I wanted a single jigsaw, I shouldn't have thrown all the pieces away in the first place.

Similarly, the problem here is that pdfTeX shouldn't be subsetting the fonts in the first place. If it didn't subset them, there would be no need for gs to try to reassemble them.

And it looks to be possible to tell pdfTeX not to subset fonts:

https://tex.stackexchange.com/questions/24002/turning-off-font-subsetting-in-pdftex
Comment 8 Werner Lemberg 2023-01-11 04:58:00 UTC
> Similarly, the problem here is that pdfTeX shouldn't be subsetting
> the fonts in the first place. If it didn't subset them, there would
> be no need for gs to try to reassemble them.

As mentioned previously: the images are created by *GhostScript*!  I've attached two examples that differ only marginally in the use of the `Emmentaler-20`.  pdfTeX takes these files as-is (at least with respect to the external font data), not modifying them.

The original code LilyPond produces is EPS data, where no subsetting happens at all.  So it is this EPS-to-PDF step done by GS that creates subsetting – in most cases this is exactly the right thing, however, for our special case it would be better if that didn't happen.  Of course, the best solution would be that GS were able to handle that by itself, as Kent described above.

You are right that we could do more optimization by telling TeX not to subset stuff, but this is not the problem here.
Comment 9 Werner Lemberg 2023-01-11 04:58:50 UTC
Created attachment 23696 [details]
pdftk output file 1
Comment 10 Werner Lemberg 2023-01-11 04:59:23 UTC
Created attachment 23697 [details]
pdftk output file 2
Comment 11 Robin Watts 2023-01-11 10:52:48 UTC
(In reply to Werner Lemberg from comment #8)
> The original code LilyPond produces is EPS data, where no subsetting happens
> at all.  So it is this EPS-to-PDF step done by GS that creates subsetting –
> in most cases this is exactly the right thing, however, for our special case
> it would be better if that didn't happen.  Of course, the best solution
> would be that GS were able to handle that by itself, as Kent described above.

So... what happens if you tell gs NOT to subset fonts?

In call-gs.sh, add '-dSubsetFonts=false' to the list of options?

In general, I think you want to avoid all font subsetting at every stage until (possibly) the very last one.
Comment 12 Robin Watts 2023-01-11 16:59:28 UTC
An alternative idea to be tried if -dSubsetFonts=false doesn't solve it...

Make gs embed fonts when it converts the EPS->PDF. i.e. edit call-gs.sh to do -dEmbedAllFonts=true  -dSubsetFonts=no

Then, when you come to call gs for the final time, every one of the fonts that gs meets will be a full, unadulterated copy of the font. I think that gs will correctly spot them as being identical then.

You can choose NOT to embed the font in that final gs run if you want. That should get you back to the desired filesize.
Comment 13 Ken Sharp 2023-01-12 04:47:45 UTC
(In reply to Robin Watts from comment #12)

> Then, when you come to call gs for the final time, every one of the fonts
> that gs meets will be a full, unadulterated copy of the font. I think that
> gs will correctly spot them as being identical then.

I'm inclined to disagree. The specific condition that the PostScript interpreter uses which causes pdfwrite to think the fonts are the same font simply won't happen for multiple different embedded fonts, IMO.

Let's just leave this until I have a chance to tackle it properly.
Comment 14 Robin Watts 2023-01-13 13:58:59 UTC
(In reply to Ken Sharp from comment #13)
> I'm inclined to disagree. The specific condition that the PostScript
> interpreter uses which causes pdfwrite to think the fonts are the same font
> simply won't happen for multiple different embedded fonts, IMO.
> 
> Let's just leave this until I have a chance to tackle it properly.

Fair enough. I bow to your greater knowledge and will butt out.
Comment 15 Robin Watts 2023-01-13 14:15:16 UTC
(In reply to Robin Watts from comment #14)
> Fair enough. I bow to your greater knowledge and will butt out.

Of course, having said that, I think mutool clean should collate multiple identical font objects if that helps as a workaround for now?

 mutool clean -gggg in.pdf out.pdf
Comment 16 Chris Liddell (chrisl) 2023-02-23 13:11:22 UTC
This should be fixed or, at least, much improved with:

https://git.ghostscript.com/?p=ghostpdl.git;a=commitdiff;h=42a4ff9ac99e
Comment 17 Werner Lemberg 2023-02-24 06:55:06 UTC
Indeed it is!  LilyPond's notation reference (`notation.pdf`) is now back to 8MByte.  Thanks a lot.