Bug 695728 - -dSubSetfonts=false not obeyed for some fonts
Summary: -dSubSetfonts=false not obeyed for some fonts
Status: RESOLVED WONTFIX
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: master
Hardware: PC Linux
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-12-11 14:38 UTC by Knut Petersen
Modified: 2014-12-17 23:52 UTC (History)
2 users (show)

See Also:
Customer:
Word Size: ---


Attachments
The Emmentaler-18 font in this ps will be subsetted by gs / pdfwrite (436.31 KB, application/postscript)
2014-12-11 14:38 UTC, Knut Petersen
Details
simple example (68.81 KB, application/postscript)
2014-12-13 02:48 UTC, Ken Sharp
Details
gs not obeying encoding (172.12 KB, application/postscript)
2014-12-16 03:38 UTC, Knut Petersen
Details
gs generated pdf that causes evince and pdffonts to complain (5.70 MB, application/pdf)
2014-12-17 14:08 UTC, Knut Petersen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Knut Petersen 2014-12-11 14:38:59 UTC
Created attachment 11362 [details]
The Emmentaler-18 font in this ps will be subsetted by gs / pdfwrite

Tested gs versions: 9.06, 9.15, git master, all are affected.

Lilypond is a free music engraving software. By default it produces
postscript files and calls gs to convert it to pdfs. For single documents
that does work fine.

If someone writes a musicological document often pdfTeX or more
modern TeX variants like luaTeX are used. It's easy to include the pdfs
generated by lilypond/gs, and everything looks fine. But the pdfs are large,
much too large.

The reason is that gs
* does font subsetting,
* is unable to merge differing subsets of the the same base font,
* and it has a habit to silently ignore -dSubsetFonts=false for some fonts.

If you include 100 lilypond pdf snippets in a pdfTeX document, you typically end
up with about 300 to 400 subsets of the same 3 to 4 fonts. Typically most of the
subsets slightly differ, and so

   gs -dBATCH -dNOPAUSE  -q -sDEVICE=pdfwrite ...

will clean the *TeX generated pdf from some identical subsets, but unfortunately
you still have a few hundred subsets of the few original fonts.

Well, there's gs -dSubsetFonts=false. That increase the size of the
lilypond/gs generated pdfs. pdffonts shows that all fonts are embedded,
and that the included fonts are no subsets.

Unfortunately that's _not_ the truth. gs gives no warnings, and it does not
mark the included font as a subset, but all emmentaler fonts (these are
the fonts that contain the music glyphs) only contain subsets of the original
otf font. Because of that a subsequent run of ghostscript will remove a lot
of duplicate fonts, but it fails to remove emmentaler subsets as long as they
differ. According to Murphy's law they differ

In an ideal world gs would do subsetting and merge differing subsets of the same font. That would help a lot of people with similar problems, it would not be specific to lilypond users.

I could easily accept a gs that would obey -dSubsetFonts=false, as this would allow the current gs to remove duplicates of included fonts and subset them. That also would be a general advantage, not specific to lilypond users.

Maybe it would be sufficient if someone could explain to me why ghostscript silently refrains to embed the full emmentaler* font found in the postscript documents as these fonts are part of the lilypond project and could be changed easily. Of course, that would a solution that would only help lilypond users.

I have to admit that I'm quite unhappy with the current situation. A real world example: Building the lilypond documentation with the new -dNoOutputFonts parameter should be expected to blow up file sizes ... in fact it saves more than 50 MB. 


cu,
 Knut
Comment 1 Hin-Tak Leung 2014-12-11 15:42:08 UTC
(In reply to Knut Petersen from comment #0)
...
> Unfortunately that's _not_ the truth. gs gives no warnings, and it does not
> mark the included font as a subset, but all emmentaler fonts (these are
> the fonts that contain the music glyphs) only contain subsets of the original
> otf font...

Running

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -o output.pdf input.ps

does generate a warning:
"Can't embed the complete font CMUSerif-Roman as it is too large, embedding a subset."

Not sure why it doesn't for Emmentaler-18 .
Comment 2 Knut Petersen 2014-12-12 01:09:37 UTC
(In reply to Hin-Tak Leung from comment #1)
> (In reply to Knut Petersen from comment #0)
> ...
> > Unfortunately that's _not_ the truth. gs gives no warnings, and it does not
> > mark the included font as a subset, but all emmentaler fonts (these are
> > the fonts that contain the music glyphs) only contain subsets of the original
> > otf font...
> 
> Running
> 
> gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -o output.pdf input.ps
> 
> does generate a warning:
> "Can't embed the complete font CMUSerif-Roman as it is too large, embedding
> a subset."
> 
> Not sure why it doesn't for Emmentaler-18 .

The reason for that warning is found in pdf_base_font_alloc() in devices/vector/gdevpdtb.c. There you'll see:

    if(pbfont->num_glyphs > 2048 && !is_standard { .. ... }

As there already is a constant MAX_NO_SUBSET_GLYPHS it probably would be a good idea to use it here instead of the magic value 2048. It's a question if this code should be active if -dSubsetFonts=false is given - I don' think so and believe that this is another bug.

The Emmentaler-18 font has only 565 glyphs, so it passes that test. 

Increasing the constant above to eg 8192 as a quick fix for the CMUSerif-Roman font is not a problem.

Knut
Comment 3 Ken Sharp 2014-12-12 02:10:58 UTC
This relates to the PDF writer, not the font API.

I'm not really sure I understand this one, probably this is because the problem seems to be concatenation of multiple PDF files, and the only thing that has been supplied is a single, over-complicated (3 fonts, pointless pdfmarks etc), PostScript file. I don't even have a Ghostscript command line to work with.

Using the supplied PostScript file I do not see pdfwrite subsetting any fonts when -dSubsetFonts=false. I also don't see the message Hin-Tak claims in comment #1, whether subsetting is permitted or not.

The Emmentaler font in the supplied PostScript file is not, as far as I can tell, subset in the PDF *if* you set -dSubsetFonts=false. Of course it is subset if you don't set that switch as the default is to subset fonts.

But, see the end of this comment for what I *think* is actually happening here.


Taking the points individually:

"In an ideal world gs would do subsetting and merge differing subsets of the same font. That would help a lot of people with similar problems, it would not be specific to lilypond users."

That's not really possible, because it can't be guaranteed that the glyphs in 2 different subsets are the same without extensive (think; slow) checking of all the glyph data in each font. Even if we can identify such cases, we would need to alter the encoding used for the text in at least one of the fonts, because its likely that each font uses glyphs numbered from 1, so we would have multiple glyphs using (eg) encoding position 1. And for TrueType (type 42) fonts
we would need to rebuild new TT tables for the merged subsets, which our TrueType font code is not currently capable of.

Having said that, where we can identify compatible subsets we *do* combine fonts. What we don't do is check each glyph in each subset font against every glyph in every other (same name) subset font to see if subsets are compatible at that level.

The particular problem outlined seems to be to be *quite* specific to Lilypond, there are few (if any) people who have multiple PDF files with incompatible fonts running into the hundreds of instances.


"The reason is that gs
* does font subsetting,"
Yes, unless you tell it not to. In general people prefer smaller PDF files.

"* is unable to merge differing subsets of the the same base font,"
Not completely true, in general it can and does merge subsets, sometimes it can't. It does not do an extended search and compare operation on glyph data as this would hurt performance.

"* and it has a habit to silently ignore -dSubsetFonts=false for some fonts."
No it doesn't. While it is true that pdfwrite will occasionally subset fonts, it doesn't do it silently. I'm well prepared to believe that this could happen, and if it does I'll fix it (the silence), but I certainly don't see any evidence of it here.


"Unfortunately that's _not_ the truth. gs gives no warnings, and it does not
mark the included font as a subset, but all emmentaler fonts (these are
the fonts that contain the music glyphs) only contain subsets of the original
otf font."

Can you supply some evidence of this ? It does not appear to me to be the case in the supplied PostScript file *provided* you set -dSubsetFonts=false obviously.


"I could easily accept a gs that would obey -dSubsetFonts=false, as this would allow the current gs to remove duplicates of included fonts and subset them. "

With the stated exception, this is already the case. But note that if you set -dSubsetFonts=false, then pdfwrite (not Ghostscript, the pdfwrite device) will not then subset the fonts (obviously).


"Maybe it would be sufficient if someone could explain to me why ghostscript silently refrains to embed the full emmentaler* font found in the postscript documents"

As far as I can see, that's not what is happening. see below for more details.


"Of course, that would a solution that would only help lilypond users."

Well its seems to be only a problem for Lilypond users. Which is fine, I'm not averse to helping niche groups, but I really can't understand the problem here.


Taking a wild stab in the dark, it seems likely to me that the real problem is, as I'm fairly sure I've mentioned before, the bad habit Lilypond has of using glyphshow instead of proper font encoding and standard show operators. There is *NO* equivalent of glyphshow in PDF, so we cannot embed 'the whole font' in the PDF file and use glyphshow.

Instead we create a brand new font to hold the glyphs which are executed with a glyphshow operation (normal show operations proceed, well, normally, but the PostScript never uses the font normally). Usually glyphshow would not be used on a font which had the glyph mapped into the Encoding, so we cannot simply use the original font for the glyphshow operation, the whole point is that its using glyphs outside the Encoding, *that's* why we need a new font.

Note that we put all the glyphs used by glyphshow in the same brand new font. Since the file never uses the font in a normal fashion it 'looks like' we've subset the original font, but we haven't. We've made a new font to mimic a PostScript operation which PDF can't handle, if you look at the font you will see that it lacks a subset prefix, which is clear evidence that it isn't a subset! The original font never gets embedded, because it is not used (the glyphshow operations don't count because we use a different font for those)

So its not the Emmentaller font which is the problem, its the way Lilypond uses it.

Note that glypshow is normally a *rare* operation in PostScript, which is one reason that this doers not cause a problem for people other than Lilypond users.


So having said all that, its way too much effort, and would adversely impact too many people, to have pdfwrite check every font with the same name at the glyph level to see if it can assemble a new superset font from the supplied fonts. Sure it might make some files a little smaller, and it would obviously help you, but for the vast majority of cases it would slow performance for little or no gain.

Without scanning the whole of the input (which isn't realistically possible for PostScript anyway) we can't be certain that we can create a suitable Encoding for a font using glyphshow which will allow all the glyphs actually used in the program (by any show variant) to be encoded. So we can't reasonably change the way we handle glyphshow.

So I don't see any way for us to deal with this. In my opinion the correct solution is for Lilypond to output more normal PostScript, ie code which doesn't use glyphshow but creates a properly encoded font (or fonts if 255 glyphs is not enough for you) and uses normal show operations. Its not as if its hard.


Of course, I could be mistaken as to the problem. If you can supply me with more evidence that there is a problem here for us I'll be willing to look at it again (just reopen the bug report and attach the new information), but bear in mind the points above.
Comment 4 Ken Sharp 2014-12-12 02:39:08 UTC
As I suspected, I have commented on Lilypond's usage of glyphshow before, see:

http://bugs.ghostscript.com/show_bug.cgi?id=695259

On that occasion we were fortunately able to work around the problem.

http://git.ghostscript.com/?p=ghostpdl.git;a=commit;h=64dd281abf84ba7383aa85c99599b5aebea3998a

provides more background information on how glyphshow is handled, I don't see any realistic opportunity to work around it this time, and I'm still of the opinion that Lilypond should stop using glyphshow because its easy and expect us to fix it for them.
Comment 5 Knut Petersen 2014-12-12 05:22:49 UTC
(In reply to Ken Sharp from comment #4)
> As I suspected, I have commented on Lilypond's usage of glyphshow before,
> see:
> 
> http://bugs.ghostscript.com/show_bug.cgi?id=695259
> 
> On that occasion we were fortunately able to work around the problem.
> 
> http://git.ghostscript.com/?p=ghostpdl.git;a=commit;
> h=64dd281abf84ba7383aa85c99599b5aebea3998a
> 
> provides more background information on how glyphshow is handled, I don't
> see any realistic opportunity to work around it this time, and I'm still of
> the opinion that Lilypond should stop using glyphshow because its easy and
> expect us to fix it for them.


Thanks for your quick reaction .... I have to think a bit about your
comments.

Knut
Comment 6 Hin-Tak Leung 2014-12-12 06:14:06 UTC
(In reply to Ken Sharp from comment #3)
...
> Using the supplied PostScript file I do not see pdfwrite subsetting any
> fonts when -dSubsetFonts=false. I also don't see the message Hin-Tak claims
> in comment #1, whether subsetting is permitted or not.
...

The message appeared with both gs 9.14 that came with the system (linux), as well the dev head debug build.
Comment 7 Knut Petersen 2014-12-12 10:19:42 UTC
(In reply to Hin-Tak Leung from comment #6)
> (In reply to Ken Sharp from comment #3)
> ...
> > Using the supplied PostScript file I do not see pdfwrite subsetting any
> > fonts when -dSubsetFonts=false. I also don't see the message Hin-Tak claims
> > in comment #1, whether subsetting is permitted or not.
> ...
> 
> The message appeared with both gs 9.14 that came with the system (linux), as
> well the dev head debug build.

9.06, 9.07, 9.15 and master all show that message, probably Ken tried something else.

I read source code and experimented with edited ps files and found that using glyphshow is not the sole problem. An edited ps file with no drawing command except

20.000 -10.000 moveto magfontemmentaler-18mHABo /one glyphshow

is handled well by git master. No subsetting. If this command is changed to 

20.000 -10.000 moveto magfontemmentaler-18mHABo /clefs.G glyphshow

the font is subsetted. Any glyph  below 256 works, a single reference with glyphshow to the glyphs in the range 0xe000+ exposes the problem.
Comment 8 Ken Sharp 2014-12-12 11:28:06 UTC
(In reply to Knut Petersen from comment #7)

> I read source code and experimented with edited ps files and found that
> using glyphshow is not the sole problem. An edited ps file with no drawing
> command except
> 
> 20.000 -10.000 moveto magfontemmentaler-18mHABo /one glyphshow
> 
> is handled well by git master. No subsetting. If this command is changed to 

Unless you're planning to rename all 565 glyphs to have standard names, which are present in StandardEncoding, this won;'t help. And you can't do that anyway, because an Encoding can only have 256 entries.

 
> 20.000 -10.000 moveto magfontemmentaler-18mHABo /clefs.G glyphshow
> 
> the font is subsetted.

Its not subset, its a different font altogether. It has the same name as the original font but that's not really the point. Its a font created specifically for the purpose of drawing glyphs which are rendered with a glyphshow.

Other magic takes place as well, I haven't tried to expose the full awfulness of text handling when converting from PostScript to PDF, and the myriad of places where we have heuristics to try and do a 'better' job than a simple approach would create.

As was pointed out in the commit message I pointed at in comment #4, we cannot use an unencoded font, so if we get one the first thing we have to do is apply an Encoding. If it so happens that the Encoding we apply has the named glyph in it, then we simply use that encoding point, we don't have to care that its a glyphshow. The fact that its a glyphshow only matters if the glyph you are using are not in the font's Encoding. But then, if they *are* in the Encoding, why on Earth are you using a glyphshow anyway ?


> Any glyph  below 256 works

Glyphs don't have numbers they have names (except in CIDFonts where they have CIDs, TrueType or CFF fonts where they have glyph IDs, the complexity is never ending in the general case.....). Possibly you mean 'any glyph which is present in one of the standard encodings'.


> glyphshow to the glyphs in the range 0xe000+ exposes the problem.

Again, glyphs don't have numbers, I'm not at all certain where that number comes from, perhaps its a Unicode code point. PostScript doesn't deal with Unicode.

From my POV the problem *is* the use of glyphshow. Finding cases where we've come up with heuristics to work around similar problems in the past doesn't really help. If you continue to draw text by use of glyphshow you will continue to run up against these kinds of problems, because you simply can't do a glyphshow in PDF.

I could apply an Encoding to your font that would 'work' form your perspective for this file, because I can arrange that all the glyphs it uses will be in the Encoding. This would mean that we wouldn't have to spawn a new font to hold the unencoded glyphs executed by glyphshow (actually its more complicated than that, but I don't want to get bogged down in detail).

But this wouldn't help you in the general case because your font contains more than 256 glyphs, so I can't create an Encoding which contains all the glyphs.

I'm not clear on why you need so many glyphs, but it seems to me that you could reasonably group them by purpose or something, and create an encoded font for each purpose. Then this problem wouldn't arise. In addition your PostScript files would be smaller, and on at least some PostScript interpreters would execute faster.

Really, avoid the use of glyphshow.
Comment 9 Knut Petersen 2014-12-12 16:45:13 UTC
> > 
> > the font is subsetted.
> 
> Its not subset, its a different font altogether. It has the same name as the
> original font but that's not really the point. Its a font created
> specifically for the purpose of drawing glyphs which are rendered with a
> glyphshow.

Ok. It's a different font. 

> 
> As was pointed out in the commit message I pointed at in comment #4, we
> cannot use an unencoded font, so if we get one the first thing we have to do
> is apply an Encoding. If it so happens that the Encoding we apply has the
> named glyph in it, then we simply use that encoding point, we don't have to
> care that its a glyphshow. The fact that its a glyphshow only matters if the
> glyph you are using are not in the font's Encoding. But then, if they *are*
> in the Encoding, why on Earth are you using a glyphshow anyway ?

How should we use a "show"? Only a few standard letters and numbers defined in the Emmentaler fonts could be typed on a keyboard. Lilypond uses a markup language to define the musical structure of a score and then it generates a graphical representation of it. "show" asks fo a string. There is no such string for 540+ of those glyphs. 

> > glyphshow to the glyphs in the range 0xe000+ exposes the problem.
> 
> Again, glyphs don't have numbers, I'm not at all certain where that number
> comes from, perhaps its a Unicode code point. PostScript doesn't deal with
> Unicode.

When I load the font in fontforge I see those numbers associated to the glyphs.

> Really, avoid the use of glyphshow.

Maybe I'm blind, but I don't see how. What would be a reasonable encoding?

A brutal hack is to glyphshow all Emmentaler symbols used in a
project before any other output is done in all the postscript files to be converted to pdf. That forces identical subsets of Emmentaler in all
the pdfs included in the pdftex document, and a subsequent run of ghostscript
can then eliminate the duplicate fonts from the pdf generated by pdftex.

I would prefer to learn how to use "show" ... 

Thanks for your patience,

Knut
Comment 10 Ken Sharp 2014-12-13 02:48:33 UTC
Created attachment 11367 [details]
simple example
Comment 11 Ken Sharp 2014-12-13 02:48:55 UTC
(In reply to Knut Petersen from comment #9)

> > care that its a glyphshow. The fact that its a glyphshow only matters if the
> > glyph you are using are not in the font's Encoding. But then, if they *are*
> > in the Encoding, why on Earth are you using a glyphshow anyway ?
> 
> How should we use a "show"? Only a few standard letters and numbers defined
> in the Emmentaler fonts could be typed on a keyboard.

PostScript does not care in the slightest whether you type the letters on a keyboard.


> Lilypond uses a markup
> language to define the musical structure of a score and then it generates a
> graphical representation of it. "show" asks fo a string. There is no such
> string for 540+ of those glyphs. 

OK you *really* need to read a tutorial on font encoding, I'd suggest the excellent articles by John Deubert of Acumen Training. You can find his articles (Acumen Journal) here:

http://www.acumentraining.com/acumenjournal.html

I'd particularly recommend the November and December 2001 issues which deal extensively with encoding fonts.

However, here is an outline of how fonts work in PostScript.

The first thing you need to do is forget any relationship between character codes and the glyphs that get drawn. This should be clear, for example the hex value 0x23 is a # in ASCII, but its a Q (or 1) in Baudot. So the string argument to show that you refer to is *NOT* a series of characters typed on a keyboard, its a series of character *codes*.  

(Sure, to an English speaker they look like 'text' because PostScript was created by Americans and the default encoding is, essentially, ASCII, but it doesn't have to be)

Now, those character codes have to be interpreted in the context of an encoding (ASCII, Baudot, etc), it could be ASCII, but it need not be. PostScript has the neat ability to store the encoding used as part of the font and this (tada) is what the Encoding array is all about, it divorces the glyph data from the character codes and allows a way to map from one to the other.

When the interpreter processes a show operation, it reads one byte from the string (I won't go into mult-byte fonts here, lets stick with the easy cases), it then uses the numeric value of that byte as an index into the Encoding array. The Encoding array is just a big list of glyphs. So lets say we have an ASCII compatible Encoding array, the string argument to show contains the byte 0x42, we look up index 0x41 in the Encoding array and it says '/A'.

Notice we now have a name. Next we go to the Charstrings dictionary in the font. Dictionaries contain key/value pairs, and are indexed by the key, which is usually a name. In the case of a CharStrings dictionary they are always names (caveat: this is another simplification). So we look up the key '/A' and retrieve the value. That value is a glyph program which we then execute to draw the glyph.

So you can index the font any way you like. You want 0x41 to mean /clef.G ? No problem, simply put the name /clefs.G in the Encoding array at index 0x41, then do "(A) show" using that encoded font. The result will be clef.G drawn on the output.

Simples.....


> > Again, glyphs don't have numbers, I'm not at all certain where that number
> > comes from, perhaps its a Unicode code point. PostScript doesn't deal with
> > Unicode.
> 
> When I load the font in fontforge I see those numbers associated to the
> glyphs.

As I think I said, you can't have a font without an Encoding. If the font has no other Encoding it has StandardEncoding. I imagine FontForge is showing you which glyph is encoded at which position in the default Encoding of the font. I can't really be bothered to look, its not important since you can trivially re-encode the font to some other scheme. More likely its some internal scheme that FontForge is using to keep track of (index) the glyphs.

Given that 0xe000 is outside the range of permitted Encodings I would say the latter is the case, but it might be a Unicode value, its not really relevant in any case, and you want find that value used anywhere in the actual font data.

 
> > Really, avoid the use of glyphshow.
> 
> Maybe I'm blind, but I don't see how. What would be a reasonable encoding?

To be honest, as a non-musician, I don't know. As I said before you cannot use a single Encoding, since that is limited to 256 glyphs (and one of those needs to be /.notdef). But there is no reason you can't have the basic font encoded three (or more) ways at the same time, as different font instances.

So you could have (for example, this probably doesn't make sense) fonts called:

Emmentaler-scripts
Emmentaler-Noteheads
Emmentaler-accidentals
Emmentaler-rests

etc.

Or perhaps the Emmentaler font contains different styles (it sort of looks like it does, from the names of some of the glyphs), so you might have:

Emmentaler-medicaea
Emmentaler-vaticana
Emmentaler-mensural

I have no idea what would be a convenient grouping, that's your specialist area :-)

 
> A brutal hack is to glyphshow all Emmentaler symbols used in a
> project before any other output is done in all the postscript files to be
> converted to pdf. That forces identical subsets of Emmentaler in all
> the pdfs included in the pdftex document, and a subsequent run of ghostscript
> can then eliminate the duplicate fonts from the pdf generated by pdftex.

I'm not convinced that would work, because there are 565 glyphs in the font, and any given Encoding is limited to 255. So you would need at least 3 fonts to hold all the glyphs. I suppose it 'might' work if you did it as the first item, because the fonts will always contain the same glyph pattern. Its ugly and will be slow though.

 
> I would prefer to learn how to use "show" ... 

Well you cannot use a single font with show, you will need at least 3, but as I pointed out that isn't a huge problem because you can decide the arrangement you need.

Another somewhat horrifying approach would be to take advantage of the fact that PostScript is a programming language and write a PostScript program to do the job. You could create an array of fonts holding all the glyphs then redefine 'glyphshow' so that it takes the glyph name and searches the array of fonts looking for the existence of that glyphname in the font's Encoding. When it finds a match it sets that font as the current font and 'show's a string which contains the index retrieved from the Encoding. Kind of horrible, and rather slow, but I believe it would work. It does require some PostScript programming though.

If I were doing it I would create a number of different fonts encoded suitably and then select the font and show the character codes, but obviously I know absolutely nothing at all about Lilypond, so I've no idea how you would go about that.

As an example of how this can be done (not a prescription, just an example) I've attached a small PostScript program which re-encodes the Emmentaler font and uses show to draw some glyphs form it.
Comment 12 Knut Petersen 2014-12-15 00:33:35 UTC
Thanks a lot for your patience and explanations, they were _very_ helpfull.

Currently I don't know which way will be the best as the most efficient would require a lot of changes to the lilypond sources and break some low-level interfaces, but I have enough information to find a proper solution. 

Knut
Comment 13 Knut Petersen 2014-12-16 03:34:44 UTC
(In reply to Ken Sharp from comment #11)

> Another somewhat horrifying approach would be to take advantage of the fact
> that PostScript is a programming language and write a PostScript program to
> do the job. You could create an array of fonts holding all the glyphs then
> redefine 'glyphshow' so that it takes the glyph name and searches the array
> of fonts looking for the existence of that glyphname in the font's Encoding.
> When it finds a match it sets that font as the current font and 'show's a
> string which contains the index retrieved from the Encoding. Kind of
> horrible, and rather slow, but I believe it would work. It does require some
> PostScript programming though.

A pretty fast solution should be to define 565 commands like 

    /clefs.G {<01> show} def

corresponding with the encodings and then to use it like

    50 -20 moveto magfontemmentaler-18mPYo-Clefs clefs.G

There is one problem: When there are encodings A und B of font F and gs sees
e.g. some show commands for glyphs of encoding A it will put these into a new font for that encoding. That's ok. 

Now we switch to encoding B. 

If the first show command indexes a place in the encoding that is already used, everything works as expected - the pdf contains two copies of the emmentaler font with the expected encodings.

If the first show command indexes an unused place in the encoding vector A, it will put the glyph that should go into the B encoding into the A encoding, messing things up. 

A workaround would be to show eg <01> for every encoding scaled to nothing white on white at coordingates 0 0 at before any other show operators. This collision seems to help.

I don't know if this should be considered a bug - at least it is inconsistent and unexpected behaviour.

I'll upload a ps demonstrating the problem.

cu,
 Knut
Comment 14 Knut Petersen 2014-12-16 03:38:10 UTC
Created attachment 11377 [details]
gs not obeying encoding

use the follwing command line:

/usr/bin/gs -dNOPAUSE -sDEVICE=pdfwrite -dSubsetFonts=false -o bad.pdf bad.ps
Comment 15 Ken Sharp 2014-12-16 04:05:09 UTC
(In reply to Knut Petersen from comment #14)
> Created attachment 11377 [details]
> gs not obeying encoding

Ghostscript doesn't guarantee to preserve *any* encodings it is free to re-encode the font as it sees fit. It does this a *lot* and we can't change it now.

I think this particular case is caused by pdfwrite attempting to merge separate fonts which appear to be the same (I did mention we try to do this I believe). That's because the two base fonts are the same (Emmentaler-18), so if it see an empty Encoding position, and a glyph gets used that isn't in the Encoding, it places it in the Encoding at that point and 'merges' the two font instances.

Does this really cause a problem ? I doubt there is anything we can do about it. Too many people rely on the existing behaviour.

The bottom line, as I meant to mention earlier but forgot, is that pdfwrite is not intended to 'merge' or otherwise join PDF files together (or PostScript files for that matter).
Comment 16 Knut Petersen 2014-12-16 05:43:07 UTC
> 
> Does this really cause a problem ? I doubt there is anything we can do about
> it. Too many people rely on the existing behaviour.

Well, the stable branch notation.pdf has e.g 2011 subsets of the emmentaler font.
Without the workaround described above that boils down to a few hundred, with the workaround to exactly 27. 

> The bottom line, as I meant to mention earlier but forgot, is that pdfwrite
> is not intended to 'merge' or otherwise join PDF files together (or
> PostScript files for that matter).

I remember to have read something like this ;-) But do you know an open source tool available for linux that is better suited for that purpose?

cu,
 Knut
Comment 17 Ken Sharp 2014-12-16 05:52:08 UTC
(In reply to Knut Petersen from comment #16)

> Well, the stable branch notation.pdf has e.g 2011 subsets of the emmentaler
> font.
> Without the workaround described above that boils down to a few hundred,
> with the workaround to exactly 27. 

I guess then the work-around makes sense.

I'm not entirely certain why the font behaves at it does, its a very murky area of the code, not least because it has loads and loads of heuristics to try and optimise the output (by combining fonts if we think we can).

I could look into it, but it will take me a long time, and right at the moment I'm utterly swamped.

 
> > The bottom line, as I meant to mention earlier but forgot, is that pdfwrite
> > is not intended to 'merge' or otherwise join PDF files together (or
> > PostScript files for that matter).
> 
> I remember to have read something like this ;-) But do you know an open
> source tool available for linux that is better suited for that purpose?

pdftk, but do not count on it merging fonts *at all*.....

To be honest, PDF isn't really meant for this purpose either, its not meant to be a convenient 'container' for work in progress.
Comment 18 Knut Petersen 2014-12-16 06:41:54 UTC
(In reply to Ken Sharp from comment #17)

> I could look into it, but it will take me a long time, and right at the
> moment I'm utterly swamped.

As there is an easy workaround it is not be necessary if you feel that nobody else will benefit from it.


>  
> > > The bottom line, as I meant to mention earlier but forgot, is that pdfwrite
> > > is not intended to 'merge' or otherwise join PDF files together (or
> > > PostScript files for that matter).
> > 
> > I remember to have read something like this ;-) But do you know an open
> > source tool available for linux that is better suited for that purpose?
> 
> pdftk, but do not count on it merging fonts *at all*.....

I agree. A fast tool to split pdfs fast and to uncompress them for inspection and editing by hand (gs will repair them lateron, yes, I know, pdfs are not meant to be edited by hand ;-)

> To be honest, PDF isn't really meant for this purpose either, its not meant
> to be a convenient 'container' for work in progress.

As Ken Olsen said in 1977: "There is no reason for any individual to have a computer in his home." ;-)))

cu,
 Knut
Comment 19 Ken Sharp 2014-12-16 06:51:37 UTC
(In reply to Knut Petersen from comment #18)
> (In reply to Ken Sharp from comment #17)
> 
> > I could look into it, but it will take me a long time, and right at the
> > moment I'm utterly swamped.
> 
> As there is an easy workaround it is not be necessary if you feel that
> nobody else will benefit from it.

Well, never say never, but nobody else has complained, so it seems less likely.


> > To be honest, PDF isn't really meant for this purpose either, its not meant
> > to be a convenient 'container' for work in progress.
> 
> As Ken Olsen said in 1977: "There is no reason for any individual to have a
> computer in his home." ;-)))

"To a man with a hammer, everything looks like a nail"

He should have known better by 1977 too, I'm inclined to feel. I already had a personal computer by then.
Comment 20 Hin-Tak Leung 2014-12-16 07:10:55 UTC
(In reply to Knut Petersen from comment #16)
...
> I remember to have read something like this ;-) But do you know an open
> source tool available for linux that is better suited for that purpose?
...

since you are on linux you could also use pdfjam (and the family of it). It is LaTeX based. It does not does font merging either though, mostly just as an alternative to pdftk (and may not be a better choice, depends).
Comment 21 jsmeix 2014-12-17 00:38:03 UTC
FYI:

Regarding what Ken Olsen said in 1977, see
http://en.wikipedia.org/wiki/Ken_Olsen
--------------------------------------------------------------------------
Two quotes of his are frequently taken out of context:

from 1977:
There is no reason for any individual to have a computer in his home.

Referred to having the computer run the house, with automated doors,
voice-activated faucets et cetera.
He had a computer in his home for general use.
[http://www.snopes.com/quotes/kenolsen.asp]
--------------------------------------------------------------------------

Merry Christmas and a happy New Year!
Comment 22 Knut Petersen 2014-12-17 14:04:56 UTC
(In reply to Ken Sharp from comment #19)

> 
> "To a man with a hammer, everything looks like a nail"
> 

The ghostscript hammer optimizes my intermediate pdf from 116MB down 5.9 MB, the old code generates a 26MB pdf. 

But: This 5.9 MB pdf generated by pdfwrite seems to be perfectly readable by acroread and okular, there is no warning or error. Other tools like evince and pdffonts complain with tons of warnings:

   "Syntax Warning: Illegal annotation destination"

gs (a few day old git master) was called with nothing but 

    gs -dNOPAUSE -sDEVICE=pdfwrite -r1200 -dBATCH -o outfile infile

I'll upload the pdf - maybe you can decide if ghostscript generates a bad pdf or if evince and pdffonts are faulty.

cu,
 Knut
Comment 23 Knut Petersen 2014-12-17 14:08:26 UTC
Created attachment 11383 [details]
gs generated pdf that causes evince and pdffonts to complain
Comment 24 Ken Sharp 2014-12-17 23:52:47 UTC
(In reply to Knut Petersen from comment #22)

> But: This 5.9 MB pdf generated by pdfwrite seems to be perfectly readable by
> acroread and okular, there is no warning or error. Other tools like evince
> and pdffonts complain with tons of warnings:
> 
>    "Syntax Warning: Illegal annotation destination"
> 
> gs (a few day old git master) was called with nothing but 
> 
>     gs -dNOPAUSE -sDEVICE=pdfwrite -r1200 -dBATCH -o outfile infile
> 
> I'll upload the pdf - maybe you can decide if ghostscript generates a bad
> pdf or if evince and pdffonts are faulty.

This is a completely different problem (if it is a problem), please don't keep adding to closed bugs, and heading off in random directions. If you want me to look at this please open a new report.

The likely problem is gluing lots of PDF files together, when I have 4 files which all have annotations pointing to 'page 1' where should they all point in the final PDF file ?

We don't really try to address that.