691912 – Potential for improvements: don't embed same font multiple times

Bug 691912 - Potential for improvements: don't embed same font multiple times

Summary: Potential for improvements: don't embed same font multiple times

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Writer (show other bugs)
Version:	9.00
Hardware:	PC Linux

Importance:	P4 enhancement
Assignee:	Ken Sharp

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-01-27 21:21 UTC by pipitas
Modified:	2016-03-08 03:53 UTC (History)
CC List:	1 user (show)

See Also:
Customer:
Word Size:	---

Attachments
Output of command 't1testpage /usr/share/fonts/X11/Type1/n019003l.pfb' (178.55 KB, application/postscript) 2011-01-27 21:41 UTC, pipitas	Details
pdfwrite result when run input with '".setpdfwrite<</AlwaysEmbed[/NimbusSanL-Regu]>>setdistillerparams" (65.34 KB, application/pdf) 2011-01-27 21:44 UTC, pipitas	Details
pdfwrite result when run input with '".setpdfwrite<</NeverEmbed[/NimbusSanL-Regu]>>setdistillerparams" (31.49 KB, application/pdf) 2011-01-27 21:45 UTC, pipitas	Details
Modified 't1testpage' PS output: removed the 'NimbusSanL-Regu' font with text editor (23.83 KB, application/postscript) 2011-01-27 21:47 UTC, pipitas	Details
pdfwrite result when run input with '-dSubsetFonts=true -c ".setpdfwrite<</AlwaysEmbed[/NimbusSanL-Regu]>>setdistillerparams' (65.37 KB, application/pdf) 2011-01-27 22:20 UTC, pipitas	Details
pdfwrite result when run input with '-dSubsetFonts=false -c ".setpdfwrite<</AlwaysEmbed[/NimbusSanL-Regu]>>setdistillerparams' (156.89 KB, application/pdf) 2011-01-27 22:21 UTC, pipitas	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description pipitas 2011-01-27 21:21:35 UTC

I've been playing with 't1testpage' which consumes a .pfb fontfile as parameter and creates PostScript output representing a (supposedly complete) sample of glyphs. 

't1testpage' has this "--help" description: 

   "‘t1testpage’ creates a PostScript proof document for the specified Type 1 
    font file and writes it to the standard output. The proof shows every glyph 
    in the font, including its glyph name and encoding." (The PostScript output 
    has the font embedded)."

OK, I pipe this result to a Ghostscript commandline for pdfwrite-ing an output PDF. This works, so far so good.

I tweaked the Ghostscript parameters a bit to get full vs. subset embedding:

  (1) -dEmbedAllFonts=true -dSubsetFonts=false \
      -c ".setpdfwrite <</AlwaysEmbed [/${current_font}]>> setdistillerparams"
  (2) -dEmbedAllFonts=true -dSubsetFonts=true \
      -c ".setpdfwrite <</AlwaysEmbed [/${current_font}]>> setdistillerparams"

(1) indeed seems to have the font fully embedded (confirmed by Acroread's file properties display); filesize is 216k. 
(2) seems to have font subsetted (confirmed by Acroread's file properties display); filesize is 77k.

This same thing happens for every single font I try.

But there is no visual difference at all between the 2 files (as you'd expect). 

My questions are:

 * Why would the "subset" file be so much smaller than the "full embed" file? 
   (Remember, since the PDF is supposed to contain the complete set of the 
   font's glyphs, I'd assume 'subset == fullset'

 * Should I assume that the glyph sample is not completely contained after all?

Comment 1 pipitas 2011-01-27 21:25:46 UTC

More info: 't1testpage' is part of the of the 'lcdf-typetools' package on Debian/Ubuntu (upstream URL: http://www.lcdf.org/type/index.html).

Comment 2 pipitas 2011-01-27 21:38:21 UTC

I found out a few more things...

This one comes from testing with a very well-known fontfile from the point of view of Ghostscript: NimbusSanL-Regu. 

The resulting PDF contains a different sample of glyphs on each page (130 for each page). Ghostscript embeds a different subset of the font for each page of the PDF. So far, so comprehensible...

However, Ghostscript uses the same 'uniq' fontname, 'AIOKJG+NimbusSanL-Regu', for each of these subset fonts, even though each subset contains different ranges of glyphs for its page.

Isn't this a bug? Shouldn't the subset names be different?! (Indeed, page no. 5, the *last* page of the PDF contains only 43 glyphs, and the subset name is different here: 'TNAUUZ+NimbusSanL-Regu'.)

So I have:

 - Page 1:  130 glyphs,  subset no. 1,  subset name: 'AIOKJG+NimbusSanL-Regu'
 - Page 2:  130 glyphs,  subset no. 2,  subset name: 'AIOKJG+NimbusSanL-Regu'
 - Page 3:  130 glyphs,  subset no. 3,  subset name: 'AIOKJG+NimbusSanL-Regu'
 - Page 4:  130 glyphs,  subset no. 4,  subset name: 'AIOKJG+NimbusSanL-Regu'
 - Page 5:   43 glyphs,  subset no. 5,  subset name: 'TNAUUZ+NimbusSanL-Regu'

Comment 3 pipitas 2011-01-27 21:41:23 UTC

Created attachment 7157 [details]
Output of command 't1testpage /usr/share/fonts/X11/Type1/n019003l.pfb'

Comment 4 pipitas 2011-01-27 21:44:26 UTC

Created attachment 7158 [details]
pdfwrite result when run input with '".setpdfwrite<</AlwaysEmbed[/NimbusSanL-Regu]>>setdistillerparams"

Comment 5 pipitas 2011-01-27 21:45:06 UTC

Created attachment 7159 [details]
pdfwrite result when run input with '".setpdfwrite<</NeverEmbed[/NimbusSanL-Regu]>>setdistillerparams"

Comment 6 pipitas 2011-01-27 21:47:07 UTC

Created attachment 7160 [details]
Modified 't1testpage' PS output: removed the 'NimbusSanL-Regu' font with text editor

Comment 7 pipitas 2011-01-27 22:14:27 UTC

I removed the embedded fontfile from the original PostScript file produced by 't1tesetpage'. (Otherwise I would not have managed to make pdfwrite produce a PDF with un-embedded NimbusSanL-Regu.

'pdffonts' utility shows this for the different PDFs:

 $ pdffonts NimbusSanL-Regu-with-unembedded-Type1.pdf 
 name                  type         emb sub uni object ID
 --------------------- ------------ --- --- --- ---------
 Helvetica-Bold        Type 1       no  no  no       8  0
 Helvetica             Type 1       no  no  no       9  0
 NimbusSanL-Regu       Type 1       no  no  yes     10  0
 NimbusSanL-Regu       Type 1       no  no  yes     17  0
 NimbusSanL-Regu       Type 1       no  no  yes     24  0
 NimbusSanL-Regu       Type 1       no  no  yes     31  0
 NimbusSanL-Regu       Type 1       no  no  yes     38  0


 $ pdffonts NimbusSanL-Regu-with-unembedded-full-Type1_REFERENCE.pdf 
 name                  type         emb sub uni object ID
 --------------------- ------------ --- --- --- ---------
 Helvetica-Bold        Type 1       no  no  no       8  0
 Helvetica             Type 1       no  no  no       9  0
 NimbusSanL-Regu       Type 1       yes no  yes     10  0
 NimbusSanL-Regu       Type 1       yes no  yes     17  0
 NimbusSanL-Regu       Type 1       yes no  yes     24  0
 NimbusSanL-Regu       Type 1       yes no  yes     31  0
 NimbusSanL-Regu       Type 1       yes no  yes     38  0


 $ pdffonts NimbusSanL-Regu-with-embedded-subset-Type1_REFERENCE.pdf 
 name                    type       emb sub uni object ID
 ----------------------- ---------- --- --- --- ---------
 Helvetica-Bold          Type 1     no  no  no       8  0
 Helvetica               Type 1     no  no  no       9  0
 AIOKJG+NimbusSanL-Regu  Type 1C    yes yes yes     10  0
 AIOKJG+NimbusSanL-Regu  Type 1C    yes yes yes     17  0
 AIOKJG+NimbusSanL-Regu  Type 1C    yes yes yes     24  0
 AIOKJG+NimbusSanL-Regu  Type 1C    yes yes yes     31  0
 TNAUUZ+NimbusSanL-Regu  Type 1C    yes yes yes     38  0

Comment 8 pipitas 2011-01-27 22:20:27 UTC

Created attachment 7161 [details]
pdfwrite result when run input with '-dSubsetFonts=true -c ".setpdfwrite<</AlwaysEmbed[/NimbusSanL-Regu]>>setdistillerparams'

Comment 9 pipitas 2011-01-27 22:21:34 UTC

Created attachment 7162 [details]
pdfwrite result when run input with '-dSubsetFonts=false -c ".setpdfwrite<</AlwaysEmbed[/NimbusSanL-Regu]>>setdistillerparams'

Comment 10 pipitas 2011-01-27 22:22:41 UTC

BTW, I tested this with v8.71 as well as with SVN trunk.

Comment 11 pipitas 2011-01-27 22:27:27 UTC

So to me, the most grave of the problems looks like this:

 * when Ghostscript is asked to do no subsetting when embedding the font..
 
 * it indeed embeds the font fully, but it does so 5 times, because there seems
   to be a reference to the font on each of the 5 pages.

This seems to be a field where more efficiency could be achieved. (Imagine the file had not 5, but 500 pages...)

Comment 12 Ken Sharp 2016-03-08 03:53:08 UTC

The bug report is rather rambling and seems to want to raise several points which makes it difficult to address.

Overview
----------
The font on each page is a re-encoded version of NimbusSanL-Regu. Because each font instance has a different Encoding, each instance is in fact a different font, even though each has been given the same name. In general this is distinctly bad practice, for example it means that the font defined on page 1 as 'NimbusSanL-Regu' can't be used on subsequent pages because the name is reused.


From comment 0
---------------
The reason the non-subset font is larger than the subset is because *all* the glyphs from the font are included each time, that's why its better to subset.

I don't know what is meant by "(Remember, since the PDF is supposed to contain the complete set of the font's glyphs, I'd assume 'subset == fullset'" Not least because there are 6 different fonts here.

From comment 2
---------------
Current code, at least, embeds 5 different subsets of NimbusSanL-Reg, each with a different subset prefix.

From comment 11
----------------
While it would be possible to spot the fact that the base font is the same, the cost of such identification, in terms of performance, would be prohibitive, especially since this is actually a very rare case.