Bug 693825 - gs adds spaces and breaks formatting of original pdf file
Summary: gs adds spaces and breaks formatting of original pdf file
Status: NOTIFIED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.06
Hardware: PC Linux
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-03-29 10:43 UTC by freecorvette
Modified: 2013-04-23 07:17 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
Input PDF file for gs (582.84 KB, application/pdf)
2013-03-29 10:43 UTC, freecorvette
Details
Reduced test file (1.38 MB, application/pdf)
2013-04-05 07:41 UTC, Ken Sharp
Details
reduced test file with Widths override removed (1.38 MB, application/pdf)
2013-04-05 07:44 UTC, Ken Sharp
Details

Note You need to log in before you can comment on or make changes to this bug.
Description freecorvette 2013-03-29 10:43:04 UTC
Created attachment 9472 [details]
Input PDF file for gs

I'm using gs for reducing the size of PDF files. I'm running the following command:

gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=output.pdf -dBATCH input.pdf

against the attached file. As you can see, the output file has a lot of  randomly added spaces, some removed spaces and the justify alignment of the text is lost.

Any ideas why this is happening?

Thank you.
Comment 1 Ken Sharp 2013-03-29 10:56:20 UTC
This is obviously relevant to the PDF Writer, nothing to do with images.
Comment 2 Ken Sharp 2013-04-04 15:59:10 UTC
The problem appears to have multiple causes. The Widths array for the DejaVuSans CIDFont does not appear to match the actual widths of the glyphs in the font, which is causing pdfwrite to emit most of the characters with small horizontal kernning.

This in itself is not a problem, however, there is a work-around (presumably for early versions of Acrobat) which limits the number of movements in a single 'TJ' operation to 50. When we reach this point we stop emitting the text, and start a new line of text, which is offset from the start of the previous line by a fixed 'x' amount.

It appears that the cumulative difference between the widths in the font and the Widths declared in the array is causing a miscalculation of the x-offset of the starting point of the additional line.

Of course the Widths are *supposed* to match the actual glyph widths in the font....

This is going to take some time to sort out.
Comment 3 freecorvette 2013-04-04 18:00:18 UTC
Ken,

Thanks for the investigation. The original document was generated with DOMPdf, a PHP library that converts HTML to PDF. DejaVu Serif is one of the few UTF-8 fonts they provide along with the library, and we're using the font to generate PDFs in a lot of languages that use UTF-8 characters.

I will forward this information to the DOMPdf team as well, in case it's easier to fix the font, instead of fixing ghostscript to accommodate broken fonts.
Comment 4 Ken Sharp 2013-04-05 07:32:54 UTC
(In reply to comment #3)
> Ken,
> 
> Thanks for the investigation. The original document was generated with
> DOMPdf, a PHP library that converts HTML to PDF. DejaVu Serif is one of the
> few UTF-8 fonts they provide along with the library, and we're using the
> font to generate PDFs in a lot of languages that use UTF-8 characters.
> 
> I will forward this information to the DOMPdf team as well, in case it's
> easier to fix the font, instead of fixing ghostscript to accommodate broken
> fonts.

Strictly speaking there is nothing wrong with the font, the 'problem' is simply that the /W array (which declares the widths of the glyphs) contains entries which don't apparently precisely match the values in the font. They are very similar but not the same. Note that the /W array is a PDF entity and not part of the font.

Some applications use modified /W entries to perform special effects such as kerning, which is not recommended (CorelDraw is a culprit here if I remember correctly). In cases where the Widths array does not match the values in the font the Reference Manual says the widths from the /W array are used.

From the PDF Reference (page 639 in the 1.7 version, under "Glyph Metrics in CIDFonts":

"These widths must be consistent with the actual widths given in the CIDFont program."


So I still need to fix this, and I need to understand first why the position tracking is incorrect.

By the way, pdfwrite isn't intended for making PDF files smaller. While it *may* do so there is no guarantee that it will, and for certain kinds of content it may well increase the file size.
Comment 5 Ken Sharp 2013-04-05 07:41:06 UTC
Created attachment 9489 [details]
Reduced test file
Comment 6 Ken Sharp 2013-04-05 07:44:10 UTC
Created attachment 9490 [details]
reduced test file with Widths override removed

In fact it appears there might actually be a problem with the font. I haven't investigated in detail, but removing the /W and /DW entries from the font results in a PDF file where some of the glyphs collide, suggesting that the widths in the font program are incorrect.
Comment 7 Ken Sharp 2013-04-05 10:36:59 UTC
It transpires that the metrics differences are due to rounding errors in the conversion from the TrueType design grid (2048x2048) to the PDF design grid (1000x1000). This explains why the differences are miniscule. I intend to address this, but I want to finish solving the underlying problem.

Part of this is that we are un-applying the word spacing (Tw) value to the CID with a value of 0x0032 which we should not do, as this is only applied to single byte fonts. Altering that improves the situation, in that the text appears correctly spaced, but the spaces are themselves too large. More to do.
Comment 8 Ken Sharp 2013-04-08 18:23:31 UTC
I do finally have a fix for this which I will commit tomorrow.

The problem is due to the fact that we split the line, because we we have a limit of 50 'move' operations per line,presumably to satisfy some ancient version of Acrobat, and the rounding error on the TrueType metrics results in a lot of frankly spurious miniscule movements to each glyph, which means we hit this limit far more than we should.

When we split the line we need to make sure that the new line starts from the correct point, so we track the current position as we do. The problem was that we were adding the word spacing (Tw) when we encountered a 0x20 byte. Normally this is correct, but for multi-byte fonts it is not, we never apply word spacing to a character code which is defined by more than one byte. As an added complication we apply this value to two similar, but different, metrics, and its important to make sure both are the same.

The fix I've been testing today solves this problem, exhibits small progressions in 2 of our test files and doesn't break any of the others.

I want to sort out the pointless tiny movements tomorrow, and will commit the fix then.

Well found by the way, this was a very subtle bug, though I strongly suspect the increasing use of CIDFonts will result in more files like this, so its good to get this fixed now.
Comment 9 Ken Sharp 2013-04-09 14:19:10 UTC
commit f7567c53867f01e9dd33a1f882bb489dc765b869 fixes the actual underlying problem here, if you plan to incorporate this and build Ghostscript then I would also reccomend that you take commit 5f5524b1f2ab76aff70b2b4a896b9474bdfb9501 as well as this results in a small but significant (~5%) decrease in file size and a slightly improved match with the original file.
Comment 10 freecorvette 2013-04-23 07:17:21 UTC
Confirming that the patch fixes the original issue. Thanks so much for looking into this and fixing it so fast!