Bug 696679 - Invalid characters when merging some PDF files
Summary: Invalid characters when merging some PDF files
Status: RESOLVED DUPLICATE of bug 694537
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.18
Hardware: PC MacOS X
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-27 13:13 UTC by Michael
Modified: 2016-04-08 06:24 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
Input and output files referred to in the bug description (2.51 MB, application/zip)
2016-03-27 13:13 UTC, Michael
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael 2016-03-27 13:13:37 UTC
Created attachment 12411 [details]
Input and output files referred to in the bug description

Merging some PDF files results in missing characters in the output file (squares rendered).

Attached are four 1-page input files (1.pdf, 2.pdf, 3.pdf and 4.pdf). In case it is relevant, each one has been generated with wkhtmltopdf.

I'm trying to merge those 4 input files into one PDF document using the following command:

gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -sOutputFile=merged_BUG.pdf 1.pdf 2.pdf 3.pdf 4.pdf 

The command gives following warnings:

GPL Ghostscript 9.18: Missing glyph CID=49, glyph=0031 in the font HelveticaNeue . The output PDF may fail with some viewers.
GPL Ghostscript 9.18: Missing glyph CID=50, glyph=0032 in the font HelveticaNeue . The output PDF may fail with some viewers.
GPL Ghostscript 9.18: Missing glyph CID=52, glyph=0034 in the font HelveticaNeue . The output PDF may fail with some viewers.
GPL Ghostscript 9.18: Missing glyph CID=53, glyph=0035 in the font HelveticaNeue . The output PDF may fail with some viewers.
GPL Ghostscript 9.18: Missing glyph CID=54, glyph=0036 in the font HelveticaNeue . The output PDF may fail with some viewers.
GPL Ghostscript 9.18: Missing glyph CID=55, glyph=0037 in the font HelveticaNeue . The output PDF may fail with some viewers.
GPL Ghostscript 9.18: Missing glyph CID=56, glyph=0038 in the font HelveticaNeue . The output PDF may fail with some viewers.
GPL Ghostscript 9.18: Missing glyph CID=57, glyph=0039 in the font HelveticaNeue . The output PDF may fail with some viewers.
GPL Ghostscript 9.18: Missing glyph CID=61, glyph=003d in the font HelveticaNeue . The output PDF may fail with some viewers.

The output file (merged_BUG.pgf - attached) contains invalid characters on the last page.

----
The interesting fact is that when I merge 1.pdf with 2.pdf (1_and_2.pdf) and 3.pdf with 4.pdf (3_and_4.pdf) and then merge the outputs, the resulting file is generated correctly (merged_OK.pdf).

gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -sOutputFile=1_and_2.pdf 1.pdf 2.pdf
gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -sOutputFile=3_and_4.pdf 3.pdf 4.pdf
gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -sOutputFile=merged_OK.pdf 1_and_2.pdf 3_and_4.pdf
Comment 1 Ken Sharp 2016-03-28 00:19:19 UTC
(In reply to Michael from comment #0)

> Merging some PDF files results in missing characters in the output file
> (squares rendered).

Ghostscript and the pdfwrite device do not perform 'merging' of PDF files. Please download the current version (9.19) and read the 'Overview' in ghostpdl/doc/VectorDevices.htm which attempts to explain what really happens.


> The command gives following warnings:
> 
> GPL Ghostscript 9.18: Missing glyph CID=49, glyph=0031 in the font
> HelveticaNeue . The output PDF may fail with some viewers.

Which immediately tells you there's a problem.....


> The interesting fact is that when I merge 1.pdf with 2.pdf (1_and_2.pdf) and
> 3.pdf with 4.pdf (3_and_4.pdf) and then merge the outputs, the resulting
> file is generated correctly (merged_OK.pdf).

Your files contain subset fonts with differing subsets, but the same subset prefix (1.pdf and 3.pdf). Because of the way Ghostscript and pdfwrite work, the pdfwrite device believes these are the *same* font and so the second subset is simply ignored.

(We do have some sketchy plans to improve this by using the object number from the PDF file, but this is still not guaranteed unique and in your case would not help since the fonts in the 2 files have the same name *and* the same object number)

Because the subsets are different, this leads to the 3rd file attempting to use glyphs which are not present in the first subset font, and thus gives rise to the warning above.

Because the 1st and 3rd files have the same font names, but the 1st and 2nd don't, there is no collision when you process the first and second files. In addition, the pdfwrite device writes the fonts for the output file using a better heuristic for the subset name, which makes it extremely unlikely that two fonts will share the same prefix, and yet be different.

Now, because the fonts have been renamed, when you process the third file there is no longer a name collision and so you do not see a warning.

There is nothing we can do about the name collision, its caused by poor practice on the part of the PDF creation tool you are using. As a work-around you can pre-process each PDF file individually, and then process all the outputs in one go. Because pdfwrite will have renamed all the font subsets sensibly there should be no further problems of this nature. However, multiple passes over files does increase the probability of quality degradation.

In any event, this is not a bug, its a limitation of the way which pdfwrite works and there is (currently) nothing we can do about it.
Comment 2 Michael 2016-03-28 01:52:28 UTC
(In reply to Ken Sharp from comment #1)

> Ghostscript and the pdfwrite device do not perform 'merging' of PDF files.

Ghostscript seems to be the de facto solution for merging (making one PDF file out of many or however you want to put it) PDF files on Unix based systems.

> Your files contain subset fonts with differing subsets, but the same subset
> prefix (1.pdf and 3.pdf). Because of the way Ghostscript and pdfwrite work,
> the pdfwrite device believes these are the *same* font and so the second
> subset is simply ignored.

Perhaps it would be a good idea if the pdfwrite did not *believe* the fonts are the same across files. It could treat fonts found in each input file as different despite the same subset prefixes.

> Now, because the fonts have been renamed, when you process the third file
> there is no longer a name collision and so you do not see a warning.

Could *renaming* fonts for each input file solve this issue?
Comment 3 Ken Sharp 2016-03-28 02:17:53 UTC
(In reply to Michael from comment #2)

> Ghostscript seems to be the de facto solution for merging (making one PDF
> file out of many or however you want to put it) PDF files on Unix based
> systems.

That's as may be, its not what its intended for and 'merging' isn't what it does. Why don't you use pdftk ? It *is* intended for doing this kind of operation and indeed is what *I* thought was the 'de facto solution' for this on Unix systems.

Of course, I'm assuming that pdftk doesn't have a problem with multiple files with subset fonts with the same name and object number, I have no idea if this is true.


> Perhaps it would be a good idea if the pdfwrite did not *believe* the fonts
> are the same across files. It could treat fonts found in each input file as
> different despite the same subset prefixes.

No it can't do that. The device has **NO** idea which 'input file' a marking operation has come from (and the situation is even more complex when it comes to fonts). In fact the PostScript interpreter itself (and therefore the PDF interpreter) has no concept of input files at all, it simply processes data from the input stream, its the application driving the interpreter which knows where the input comes from, not the interpreter. And before you ask, no we can't change that, its the way PostScript is defined as working.

As noted in the documentation I directed you to, there are disadvantages to working this way, but also advantages.

Even if we could do what you suggest, we would then face complaints from the many users who already use Ghostscript to process multiple files and expect that fonts will be merged, in order to reduce size.


> Could *renaming* fonts for each input file solve this issue?

Yes, and indeed it does, because that's *exactly* what pdfwrite does if you process each input file separately to a new file. If Cairo (I'm assuming here that wkhtmltopdf is using Cairo) named its subset fonts sensibly then this wouldn't be a problem.

Note that the PDF interpreter cannot rename the fonts, if that was about to be your suggestion, you need to rename the fonts in the input file *before* sending them to Ghostscript.
Comment 4 Hin-Tak Leung 2016-03-28 14:48:17 UTC
> That's as may be, its not what its intended for and 'merging' isn't what
> it does. Why don't you use pdftk ? It *is* intended for doing this kind
> of operation and indeed is what *I* thought was the 'de facto solution'
> for this on Unix systems.

FWIW, many unix/linux system these days no longer ship pdftk, and because it has gotten difficult to build - the latest version of pdftk depends on an up-to-date version of the java-based iText library, which in term requires gcj (the java-to-native-code member compiler of the gcc family), and it just gets too painful to maintain, when many linux distros are not even shipping gcj.

However, there are other tools for merging pdf's, such as the pdfjam family of tools (which depends on TeX/pdflatex), and qpdf (ansi C), and Apache pdfbox (in java).
Comment 5 Ken Sharp 2016-04-08 05:52:33 UTC

*** This bug has been marked as a duplicate of bug 694537 ***