695869 – missing glyph errors

Bug 695869 - missing glyph errors

Summary: missing glyph errors

Status:	RESOLVED INVALID

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Writer (show other bugs)
Version:	master
Hardware:	PC Linux

Importance:	P4 normal
Assignee:	Ken Sharp

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-03-15 06:50 UTC by brian.zbr
Modified:	2015-03-31 17:50 UTC (History)
CC List:	2 users (show)

See Also:
Customer:
Word Size:	---

Attachments
first file from example (85.76 KB, application/x-pdf) 2015-03-15 06:50 UTC, brian.zbr	Details
interesting portions of the PDF (character mapping, etc) (1.06 KB, text/plain) 2015-03-20 15:28 UTC, breidenbach	Details
ttf representation of the font (3.54 KB, application/octet-stream) 2015-03-20 15:35 UTC, breidenbach	Details
XML representation of font (from TTX) (57.21 KB, text/plain) 2015-03-20 15:35 UTC, breidenbach	Details
output of MS Font Validator tool (61.96 KB, application/xml) 2015-03-21 02:23 UTC, Ken Sharp	Details
zip file containing XML reports and PDF versions of those reports (349.62 KB, application/octet-stream) 2015-03-24 06:19 UTC, Ken Sharp	Details
slight revision to 128 entry font (3.54 KB, application/octet-stream) 2015-03-27 16:45 UTC, breidenbach	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description brian.zbr 2015-03-15 06:50:15 UTC

Created attachment 11516 [details]
first file from example

I have encountered a problem which someone else also reported for pdfsandwich, a wrapper which uses gs. (https://sourceforge.net/p/pdfsandwich/bugs/6/) I'm on Ubuntu 14.04.

pdfsandwich runs commands like this: 

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dDEVICEWIDTHPOINTS=612 -dDEVICEHEIGHTPOINTS=792 -dPDFFitPage -o /tmp/pdfsandwich4f3c37.pdf /tmp/pdfsandwich0889c7.pdf

This frequently results in errors like this: 

GPL Ghostscript 9.10: Missing glyph CID=0, glyph=0076 in the font GlyphLessFont . The output PDF may fail with some viewers.

I am attaching the first of the two temporary PDFs from the example given above. A further result is that when the output pdf is viewed in the Linux pdf viewer evince, the text can be selected and copied, but it is not visible when selected. I have also tried this with 9.15 and gotten similar results. 

Sorry if this is an unhelpful report or a known issue, but any information would be appreciated.

Comment 1 Ken Sharp 2015-03-15 08:59:23 UTC

Is the attached file the input file, or the *result* of running a different input file through Ghostscript ? If this isn't the input file then please supply the original, we can't tell anything from the output.

Note that the message you quote is a warning, not an error. The file you've supplied appears to be a scanned page with text in rendering mode 3 applied by an OCR package. Tr3 is not visible, the aim with pdfwrite is that the visual appearance of the resulting PDF file should be the same as the input. Anything else is nice, but not essential.

In particular the pdfwrite device will reprocess the text and embed font subsets. If the original font is lacking a ToUnicode CMap then the resulting 'text' will not be searchable and copy/paste will probably result in gibberish.

Given that the font is called 'GlyphLessFont' it suggests that the original font has no glyphs (presumably to reduce storage space), so this sort of message is indeed quite likely. Since the font does not mark the page, the warning can almost certainly be safely ignored. This is a special case (the text doesn't render), the warning is present so that you get some indication that this *might* be an error.

I notice that you quote warning messages from GS 9.10 but the creator of the PDF is GS 9.15, and the bug report is against master, which one are you actually using ?

Comment 2 Ken Sharp 2015-03-15 09:17:30 UTC

This looks likely to be a duplicate of 695461, the ToUnicode CMap in the original file probably has a single glyph which refers to 2 or more Unicode code points (eg an fi ligature), which our ToUnicode processing can't currently cope with.

However, without seeing the original file I can't be sure.

Comment 3 brian.zbr 2015-03-15 20:55:30 UTC

I'm just a pdfsandwich user with no experience using gs directly, so 90% of this is totally above my head here. But it sounds like gs is most likely behaving as it should, which is helpful to know. Thanks for your response, and sorry for the trouble.

Comment 4 Ken Sharp 2015-03-16 01:11:17 UTC

(In reply to brian.zbr from comment #3)
> I'm just a pdfsandwich user with no experience using gs directly, so 90% of
> this is totally above my head here. But it sounds like gs is most likely
> behaving as it should, which is helpful to know.

That's not completely the case. While we only promise to reconstruct the appearance of the input in a PDF file, we do go considerably further in trying to preserve other aspects too.

In this case we do read the ToUnicode CMap, if present, from the input (this tells us what Unicode code point a given character code represents) and we do preserve it into the output PDF file. However, our implementation was written against an older version of the specification and can't handle some complex cases.

The example is a 'fi' ligature, a ligature is a single shape which represents 2 characters, drawn as one shape for typographical reasons. Although that only has one character code, when you search for it you don;t look for 'fi', you look for 'f followed by 'i'. So the ToUnicode CMap has one character code with 2 Unicode code points. Our code specifically can't handle that.

Now it looks to me, from examining the output file, that this is the problem you are experiencing. However I can't be certain without seeing the input file. If you supply that I'll be able to tell for sure, and if it is the same problem then it'll get added to the first bug as a duplicate.

Why is that useful ? The more instances we see of a problem the more likely we are to take it seriously and fix it. In short if it starts to seem more common we'll give it a higher priority. If I can't be certain its the same problem, then I won't add it to the existing bug, which means it'll be longer before it gets fixed.

Comment 5 brian.zbr 2015-03-16 03:13:07 UTC

I have realized that *any* pdf that I make with Tesseract 3.03 triggers similar missing glyph messages from Ghostscript, regardless of what version I run it through. Here is an example you could look at: https://tesseract-ocr.googlecode.com/issues/attachment?aid=14340000001&name=output.pdf&token=ABZ6GAfh5XT--00RfaenlBX0GSYdFvEfCw%3A1426498417505

That example is attached to this bug report I filed with Tesseract. This seems to have nothing to do with Ghostscript, but just to provide some context: https://code.google.com/p/tesseract-ocr/issues/detail?id=1434&thanks=1434&ts=1426490574

Comment 6 Ken Sharp 2015-03-16 06:58:25 UTC

OK now that we have a file to look at its possible to see what the problem is :-)

It looks like Tesseract have decided to embed a 'dummy' font to hold the text. Because the text is in Tr mode 3 it isn't rendered, so it doesn't have to be a real font, a dummy one is good enough. I suspect they cut the font down so that it is as small as possible. They are already embedding a full page image in the PDF so they don't want to include a huge font as well.

Sadly, the font they embed is broken in a number of ways. The Microsoft Font Validator list 4 missing *required* tables in the embedded TrueType font, the checksums on three of the tables which are present are incorrect. The glyf table is broken (they've made it length 0 which is illegal).

There are a number of other problems which we won't worry about, but the important one is that maxp table declares the number of glyphs as 117. Since we start counting from 0 that means there are glyphs from GID 0 to GID 116. But the PDF file uses a glyph with CID 117, in the absence of a cmap table we assume that CID = GID. So we try to reference GID 117 and that, obviously, fails. I suspect the developers didn't realise the GIDs start from 0.

Now for rendering none of this matters, for the simple reason that the glyphs are never actually rendered, which means that simply drawing the PDF file (ala Acrobat Reader) works. For PDF processors, however, like Ghostscript/pdfwrite we can't know in advance if a font is actually used to render or not. So we have to assume it is and hence why you get a warning, its trying to access a glyph which isn't in the font.

In addition, the broken TrueType font messes up reading the ToUnicode CMap which means that some of the glyphs don't get copied correctly.

There really isn't anything we can do about this, we have to rely on *some* part of the information in the PDF file being correct. If it isn't we'll do our best to recover, but yes, stuff goes missing in this case.

I still suspect that your original file has a ToUnicode CMap which we can't process properly, and that is our bug, but the warning you are getting is because of the font embedded by Tesseract. Its broken.....

Comment 7 brian.zbr 2015-03-16 07:12:02 UTC

Hopefully all this information will help the Tessseract developers if they ever get around to resolving the issue, but it looks like there is a long backlog of issues over there. 

I don't suppose you can suggest any sort of quick fix to try in the meantime? Is there some way I could try to replace the bad font?

In any case, thanks again.

Comment 8 Ken Sharp 2015-03-16 07:28:36 UTC

(In reply to brian.zbr from comment #7)

> I don't suppose you can suggest any sort of quick fix to try in the
> meantime? Is there some way I could try to replace the bad font?

Not easily, PDF is a binary format with an index, (xref table) any change to the content would mean rebuilding the xref table, unless the new content was smaller than the old in which case it could be padded with white space.

Trying to fix this manually looks like a lot of effort to me I'm afraid.

Comment 9 Ken Sharp 2015-03-16 08:24:26 UTC

I had a quick go at hacking this up and its non-trivial as our code relies on both the length of the loca table and the numGlyphs entry in the font being correct and self-consistent. Changing either just causes other errors you have to change both, and that's not as easy as it sounds.

Sorry but there's no simple solution here.

Comment 10 Chris Liddell (chrisl) 2015-03-16 09:55:13 UTC

I'd also note that, for reference, fontforge complains about the font, as does freetype's ftlint tool, and freetype's ftview tool only lists the 117 glyphs (indices 0 - 116) that Ghostscript sees.

So this is not Ghostscript being overly picky about the font, it is, quite clearly, broken.

Comment 11 breidenbach 2015-03-20 15:23:29 UTC

I am the author of Tesseract's PDF module and am responsible for GlyphLessFont; the intent is complete coverage of the entire basic multilingual plane with pass through mapping. A design goal is terrific search and copy-paste performance, and for many PDF renderers, we get it. 

My font manipulation tool was TTX/FontTools with the modifications done in the XML representation. However, I am not a font guru and was unaware of some of the lint tools available. Ken, if you are willing I'd very much like to work with you to help make this font more valid. Tesseract is a relatively important program that produces quite a few PDFs. I'm also using the technique elsewhere. This is a big deal to me and any assistance is appreciated.

Comment 12 breidenbach 2015-03-20 15:28:40 UTC

Created attachment 11530 [details]
interesting portions of the PDF (character mapping, etc)

These are the interesting bits of the PDF that shows the pass through character set mapping. This is fairly unusual for PDF; we do it this way to support an enormous range of languages with a very simple and small invisible font.

Comment 13 breidenbach 2015-03-20 15:35:10 UTC

Created attachment 11531 [details]
ttf representation of the font

Comment 14 breidenbach 2015-03-20 15:35:55 UTC

Created attachment 11532 [details]
XML representation of font (from TTX)

Comment 15 Ken Sharp 2015-03-21 02:23:31 UTC

Created attachment 11533 [details]
output of MS Font Validator tool

(In reply to breidenbach from comment #12)

> These are the interesting bits of the PDF that shows the pass through
> character set mapping. This is fairly unusual for PDF; we do it this way to
> support an enormous range of languages with a very simple and small
> invisible font.

I mentioned earlier in the thread that I *do* understand the goal, and there's nothing in particular about the font that's going to cause a problem for a PDF rendering engine, because the font doesn't actually ever get rendered :-)

However, pdfwrite is a much more complex beast, its capable of subsetting fonts, altering font representation and a host of other stuff. But to do this (with fonts) it expects to get a font with all the information, it can't tell that the font isn't actually ever going to be rendered.

So, the font you've attached isn't the same as the one that came out of the PDF file, it has 128 glyphs instead of the 117 in the PDF file. Can I ask where you got it from ?

I still have a sneaking suspicion that there's an invalid assumption somewhere in our code, but I haven't had the time to go back and track it down. It would only exhibit in the case where the PDF file used the absolute last glyph in the font, which is extremely rare.

FWIW I use the Microsoft Font Validator tool on TrueType fonts, I've attached the XML output from that tool for the font you attached.

As I said, I'm suspicious that there's an 'off by one' in the PostScript code used to read TT fonts for PDF interpretation, but it'll take some time to track down. At the moment I'm involved in a different project, and we are also in the middle of a release, so it will be a short while before anyone can get back to this.

Comment 16 breidenbach 2015-03-23 12:19:35 UTC

I just cracked open a PDF produced with the Tesseract 3.03.02-3 being distributed on Ubuntu 14.04. The font situation does not look pretty. Even my old friend fontTools/TTX is unhappy.

   IndexError: array assignment index out of range

The Tesseract source repository, however, has an updated font (Oct 9, 2014) with 128 entries that appears to be in better shape. That is what I uploaded to this bug. I would highly appreciate if folks could run various lint tools on it and tell me if there are any problems. Especially if those problems are somehow fixable!

   https://code.google.com/p/tesseract-ocr/source/browse/#git%2Ftessdata

My working theory is I noticed something last year, fixed it, then forgot about it. But the revised font got stuck in Tesseract's interminable release cycle. If you guys guys give the revised font a clean bill of health, I will find a way to make it ship.

Note that no matter what, the Tesseract PDF will very likely have entries outside the fonts range, because that's what happens when you use a tiny font to cover the entire basic multilingual plane. You'll certainly see this outside of English. But other than that I am not expecting any weirdness whatsoever.

Comment 17 Ken Sharp 2015-03-24 06:13:45 UTC

(In reply to breidenbach from comment #16)
> I just cracked open a PDF produced with the Tesseract 3.03.02-3 being
> distributed on Ubuntu 14.04. The font situation does not look pretty. Even
> my old friend fontTools/TTX is unhappy.
> 
>    IndexError: array assignment index out of range

That 'may' be the same problem as we're seeing, but its hard to be sure.

> with 128 entries that appears to be in better shape. That is what I uploaded
> to this bug. I would highly appreciate if folks could run various lint tools
> on it and tell me if there are any problems. Especially if those problems
> are somehow fixable!

OK so I did upload the report from the MS Font Validator Tool, but apparently I missed the .xsl sytle sheet. I'll put the report from both these fonts (and the style sheet) as an attachment shortly.

> about it. But the revised font got stuck in Tesseract's interminable release
> cycle. If you guys guys give the revised font a clean bill of health, I will
> find a way to make it ship.

Its tricky to say for sure if this is better (though the font report is certainly improved). I implanted it into the original PDF in place of the font that was embedded, and GS no longer complained about the glyphs. However, that's no as good news as it may sound.If I'm correct and *we* have a problem, then it only exhibits when the character code is the same as the maximum glyph in the font. For the original font that was 117, the new font has 128 entries, so we will not try to index the 128th glyph, and so the problem isn't triggered.

I did modify one of the character codes from 0x0020 to 0x0080 and the same problem recurred, even with the new font, which does suggest that we might have a problem here, but it'll be some time before I get back to looking at this I'm afraid.

Comment 18 Ken Sharp 2015-03-24 06:19:01 UTC

Created attachment 11539 [details]
zip file containing XML reports and PDF versions of those reports

Comment 19 breidenbach 2015-03-24 10:07:35 UTC

Is there a recommended 'tidy' tool or similar that I can use to further improve the font? My faith in FontTools/TTX is waning. I'd like to get things in as good a shape as possible before shipping a revision.

Comment 20 Ken Sharp 2015-03-24 10:18:17 UTC

(In reply to breidenbach from comment #19)
> Is there a recommended 'tidy' tool or similar that I can use to further
> improve the font? My faith in FontTools/TTX is waning. I'd like to get
> things in as good a shape as possible before shipping a revision.

I tend to use multiple tools for different purposes, including ttfdump and reading the TT spec.....

While there are complaints from the MS tool, its *very* picky, I don't believe I've ever seen a font where it didn't have something to say.

As I said, I'm suspicious that we have a problem here, in that numGlyphs is *supposed* to be loca size + 1, and I'm not certain that's the number we're coming up with. The font validator says that this is what is in the font.

As far as the other 'errors' flagged up by the tool, they are probably not serious for the intended purpose of the font, I'm not sure what effect they will have on pdfwrite, as it will try to subset the font, and that might cause bad things to happen...

I will get back to this, but not until after our release is completed, and I've fixed my current problem, which may take some time, sorry :-(

Comment 21 Ken Sharp 2015-03-25 01:39:33 UTC

OK so I took some time to trek through our code, and I *think* we are OK, we do correctly calculate the number of glyphs from the LOCA table, and this seems to be consistent with the value of numGlyphs from the MAXP table.

The other tables don't, in this case, interest us because the glyphs are never actually rendered. We preserve the TrueType font embedded in the PDF file more or less 'as is' when creating a new output font when outputting a PDF file. The problems with the table checksums and missing tables don't bother us as we copy these verbatim from the original and don't worry about any errors, we assume the original is correct.

The one thing that *does* cause us a problem is the fact that, for the supplied file, the actual number of glyphs is 'incorrect'. If you look at the output from the Microsoft Font Validator tool for the font extracted form the PDF file you can see that numGlyphs is 117, and this is consistent with the size of the LOCA table.

So that is 117 glyphs numbered 0 to 116. No problem there, but the PDF file uses glyphs with character codes ranging from 1 to 117. GID 0 is the .notdef glyph and since the character codes match the glyph IDs, we see that character code 0 isn't used.

The problem arises when the PDF file uses character code 117, we try to lookup that glyph in the font using GID 117 and that can't be done, the maximum GID is 116.

Replacing the font with one which contains 128 glyphs resolves the problems for character codes 0-127 but again will fail when the character code reaches 128.

I don't know enough about Tesseract to have an opinion on whether this is sufficient, does it perform OCR for glyphs other than Latin ? Can it handle, for example, accented characters ? If so, I'm not clear how you can encode the requisite characters with only 127 codes available to you. NB since you are using a CIDFont rather than a simple TrueType font you could potentially use 65535 character codes, if the font was able to handle this.

In fact from earlier comments it seems you want to be able to cover 'the entire multilingual plane' and I really don't see any way you can do that with only 127 codes available in the font. You can create a PDF file which will render correctly, because the text is never actually rendered, but it will always fail any analysis by tools like pdfwrite. Of course you could use multiple fonts, but that seems like hard work to me.

Comment 22 breidenbach 2015-03-25 19:56:02 UTC

Thanks for the clear explanation. Tesseract supports dozens of languages, way outside of Latin. Pretty much anything up to 0xFFFF is fair game. I'm also using this exact same technique in some other important places, which we can discuss off-bug.

This is the first compatibility problem I've encountered so far (most of my testing used the 128 entry font.) Unfortunately it sounds like we won't be able to resolve the core compatibility problem in any easy way. I definitely don't want the complexity of dynamic or multiple fonts. I may look into a glyphless font with 65K entries if can convince it to compress really well.

In the immediate term, I'd still like to clean up the non-critical flaws such as broken checksums. That's still progress even if it doesn't solve the main compatibility problem. Any help or more suggestions (from anyone!) are hugely appreciated as I still don't have the tools in hand to make repairs.

Comment 23 Ken Sharp 2015-03-26 01:20:40 UTC

(In reply to breidenbach from comment #22)

> This is the first compatibility problem I've encountered so far (most of my
> testing used the 128 entry font.) Unfortunately it sounds like we won't be
> able to resolve the core compatibility problem in any easy way. I definitely
> don't want the complexity of dynamic or multiple fonts. I may look into a
> glyphless font with 65K entries if can convince it to compress really well.

The only legitimate way to use character codes up to 0xFFFF with a font which only contains 128 glyphs is to use the CMAP table to map the codes to the 128 glyphs.

NB you could easily do the same by having a large CMAP table which maps each code to one single glyph.

 
> In the immediate term, I'd still like to clean up the non-critical flaws
> such as broken checksums. That's still progress even if it doesn't solve the
> main compatibility problem. Any help or more suggestions (from anyone!) are
> hugely appreciated as I still don't have the tools in hand to make repairs.

None of these are particularly a problem, especially the ones in the 128 glyph font. As I said, I've never seen a font which that tool didn't complain about to some degree. 

You can pick up the majority of the tools I'm using by going to the MS typography website (there's a link at the end of the font reports)

Quickly looking at the report;

You can probably ignore the missing POST table, that's for PostScript compatibility and I can't see that you need it (even though its technically required). However if you use a version 3.0 post table you can make it pretty minimal.

The OS/2 table contains metrics information which isn't really relevant, but its trivial enough to add dummy values.

I believe it should be simple to add a NAME table.

I'd have to actually examine the glyf data to see why the tool is whining about control points and intersecting contours, I had rather assumed the glyphs were empty.

If you turn off the non-linear scaling flag in the head table that'll get rid of two more warnings.

Making the Descender value -1 would get rid of an error in the hhea table.

The remaining error is in the maxp table where the maxComponentElements is 8, but the calculated value is 0, changing the value to 0 should solve that.


Note that the checksums in this (the 128 glyph font) all appear to be correct.

Comment 24 breidenbach 2015-03-27 16:45:46 UTC

Created attachment 11550 [details]
slight revision to 128 entry font

maxComponentElements = 0, descender = -1, new flags in head table. Otherwise no changes, and certainly no big compatibility changes.

Comment 25 breidenbach 2015-03-31 17:50:50 UTC

Working on improvements.