This is a place holder for future work. Summary below is to assist my memory really. The code in scan_cmap_text which handles large fonts by creating a number of subset fonts could use improvement. This is especially true for the attached file, which has a single large CFF font composed with a CMap to produce a CIDKeyed instance with a single large descendant. The fonts created to hold the subsets default to a preferred encoding which result in us not filling every possible position from 0-255. We could create fewer fonts if we filled them completely. Secondly, we currently break when we detect a switch of descendant font, but not if we switch subsets. This means that all the glyphs in a given text string must be in the same font. I *think* this means that if some glyphs are already encoded in earlier subsets we won't detect that, and will embed them again in the new subset. Also, if we encounter more than 255 glyphs in a single text string which have not previously been encoded I think we will fail to emit the text. We should allow for a break to switch subsets when required, which should lead to fewer embedded subsets. Needs more investigation. Finally Acrobat works differently for these fonts. It embeds a single large FontFile, and a single FontDescriptor, and multiple type 1 fonts, each of which contains only a subset of glyphs in its Encoding. This is still more efficient, and would be 'nice to have'. However it may well be impossible without extensively rewriting the font handling code.
Created attachment 4617 [details] LargeCFF-Font.zip This test file demonstrates some of the issues, and contains a usefully large CFF font to serve as the basis for constructing further test files