I've been playing with 't1testpage' which consumes a .pfb fontfile as parameter and creates PostScript output representing a (supposedly complete) sample of glyphs. 't1testpage' has this "--help" description: "‘t1testpage’ creates a PostScript proof document for the specified Type 1 font file and writes it to the standard output. The proof shows every glyph in the font, including its glyph name and encoding." (The PostScript output has the font embedded)." OK, I pipe this result to a Ghostscript commandline for pdfwrite-ing an output PDF. This works, so far so good. I tweaked the Ghostscript parameters a bit to get full vs. subset embedding: (1) -dEmbedAllFonts=true -dSubsetFonts=false \ -c ".setpdfwrite <</AlwaysEmbed [/${current_font}]>> setdistillerparams" (2) -dEmbedAllFonts=true -dSubsetFonts=true \ -c ".setpdfwrite <</AlwaysEmbed [/${current_font}]>> setdistillerparams" (1) indeed seems to have the font fully embedded (confirmed by Acroread's file properties display); filesize is 216k. (2) seems to have font subsetted (confirmed by Acroread's file properties display); filesize is 77k. This same thing happens for every single font I try. But there is no visual difference at all between the 2 files (as you'd expect). My questions are: * Why would the "subset" file be so much smaller than the "full embed" file? (Remember, since the PDF is supposed to contain the complete set of the font's glyphs, I'd assume 'subset == fullset' * Should I assume that the glyph sample is not completely contained after all?
More info: 't1testpage' is part of the of the 'lcdf-typetools' package on Debian/Ubuntu (upstream URL: http://www.lcdf.org/type/index.html).
I found out a few more things... This one comes from testing with a very well-known fontfile from the point of view of Ghostscript: NimbusSanL-Regu. The resulting PDF contains a different sample of glyphs on each page (130 for each page). Ghostscript embeds a different subset of the font for each page of the PDF. So far, so comprehensible... However, Ghostscript uses the same 'uniq' fontname, 'AIOKJG+NimbusSanL-Regu', for each of these subset fonts, even though each subset contains different ranges of glyphs for its page. Isn't this a bug? Shouldn't the subset names be different?! (Indeed, page no. 5, the *last* page of the PDF contains only 43 glyphs, and the subset name is different here: 'TNAUUZ+NimbusSanL-Regu'.) So I have: - Page 1: 130 glyphs, subset no. 1, subset name: 'AIOKJG+NimbusSanL-Regu' - Page 2: 130 glyphs, subset no. 2, subset name: 'AIOKJG+NimbusSanL-Regu' - Page 3: 130 glyphs, subset no. 3, subset name: 'AIOKJG+NimbusSanL-Regu' - Page 4: 130 glyphs, subset no. 4, subset name: 'AIOKJG+NimbusSanL-Regu' - Page 5: 43 glyphs, subset no. 5, subset name: 'TNAUUZ+NimbusSanL-Regu'
Created attachment 7157 [details] Output of command 't1testpage /usr/share/fonts/X11/Type1/n019003l.pfb'
Created attachment 7158 [details] pdfwrite result when run input with '".setpdfwrite<</AlwaysEmbed[/NimbusSanL-Regu]>>setdistillerparams"
Created attachment 7159 [details] pdfwrite result when run input with '".setpdfwrite<</NeverEmbed[/NimbusSanL-Regu]>>setdistillerparams"
Created attachment 7160 [details] Modified 't1testpage' PS output: removed the 'NimbusSanL-Regu' font with text editor
I removed the embedded fontfile from the original PostScript file produced by 't1tesetpage'. (Otherwise I would not have managed to make pdfwrite produce a PDF with un-embedded NimbusSanL-Regu. 'pdffonts' utility shows this for the different PDFs: $ pdffonts NimbusSanL-Regu-with-unembedded-Type1.pdf name type emb sub uni object ID --------------------- ------------ --- --- --- --------- Helvetica-Bold Type 1 no no no 8 0 Helvetica Type 1 no no no 9 0 NimbusSanL-Regu Type 1 no no yes 10 0 NimbusSanL-Regu Type 1 no no yes 17 0 NimbusSanL-Regu Type 1 no no yes 24 0 NimbusSanL-Regu Type 1 no no yes 31 0 NimbusSanL-Regu Type 1 no no yes 38 0 $ pdffonts NimbusSanL-Regu-with-unembedded-full-Type1_REFERENCE.pdf name type emb sub uni object ID --------------------- ------------ --- --- --- --------- Helvetica-Bold Type 1 no no no 8 0 Helvetica Type 1 no no no 9 0 NimbusSanL-Regu Type 1 yes no yes 10 0 NimbusSanL-Regu Type 1 yes no yes 17 0 NimbusSanL-Regu Type 1 yes no yes 24 0 NimbusSanL-Regu Type 1 yes no yes 31 0 NimbusSanL-Regu Type 1 yes no yes 38 0 $ pdffonts NimbusSanL-Regu-with-embedded-subset-Type1_REFERENCE.pdf name type emb sub uni object ID ----------------------- ---------- --- --- --- --------- Helvetica-Bold Type 1 no no no 8 0 Helvetica Type 1 no no no 9 0 AIOKJG+NimbusSanL-Regu Type 1C yes yes yes 10 0 AIOKJG+NimbusSanL-Regu Type 1C yes yes yes 17 0 AIOKJG+NimbusSanL-Regu Type 1C yes yes yes 24 0 AIOKJG+NimbusSanL-Regu Type 1C yes yes yes 31 0 TNAUUZ+NimbusSanL-Regu Type 1C yes yes yes 38 0
Created attachment 7161 [details] pdfwrite result when run input with '-dSubsetFonts=true -c ".setpdfwrite<</AlwaysEmbed[/NimbusSanL-Regu]>>setdistillerparams'
Created attachment 7162 [details] pdfwrite result when run input with '-dSubsetFonts=false -c ".setpdfwrite<</AlwaysEmbed[/NimbusSanL-Regu]>>setdistillerparams'
BTW, I tested this with v8.71 as well as with SVN trunk.
So to me, the most grave of the problems looks like this: * when Ghostscript is asked to do no subsetting when embedding the font.. * it indeed embeds the font fully, but it does so 5 times, because there seems to be a reference to the font on each of the 5 pages. This seems to be a field where more efficiency could be achieved. (Imagine the file had not 5, but 500 pages...)
The bug report is rather rambling and seems to want to raise several points which makes it difficult to address. Overview ---------- The font on each page is a re-encoded version of NimbusSanL-Regu. Because each font instance has a different Encoding, each instance is in fact a different font, even though each has been given the same name. In general this is distinctly bad practice, for example it means that the font defined on page 1 as 'NimbusSanL-Regu' can't be used on subsequent pages because the name is reused. From comment 0 --------------- The reason the non-subset font is larger than the subset is because *all* the glyphs from the font are included each time, that's why its better to subset. I don't know what is meant by "(Remember, since the PDF is supposed to contain the complete set of the font's glyphs, I'd assume 'subset == fullset'" Not least because there are 6 different fonts here. From comment 2 --------------- Current code, at least, embeds 5 different subsets of NimbusSanL-Reg, each with a different subset prefix. From comment 11 ---------------- While it would be possible to spot the fact that the base font is the same, the cost of such identification, in terms of performance, would be prohibitive, especially since this is actually a very rare case.