Bug 694537 - Processing multiple PDF input files in one command may lead to character omissions
Summary: Processing multiple PDF input files in one command may lead to character omis...
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.06
Hardware: PC All
: P4 enhancement
Assignee: Ken Sharp
URL:
Keywords:
: 695506 696064 696679 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-08-22 16:50 UTC by pipitas
Modified: 2016-04-12 05:17 UTC (History)
3 users (show)

See Also:
Customer:
Word Size: ---


Attachments
Tarball containing input and output PDFs as well as PDFs with annotations pinpointing the problem spots. (8.10 MB, application/x-gzip)
2013-08-22 16:54 UTC, pipitas
Details
Four very simple PDF's that show the problem when concatenated (182.64 KB, application/x-zip-compressed)
2016-04-07 07:50 UTC, Arrigo Marchiori
Details

Note You need to log in before you can comment on or make changes to this bug.
Description pipitas 2013-08-22 16:50:07 UTC
This is the re-surrection of bug report #692541. (Ken closed it but asked: "if you can come up with an example of text going missing please [...] open a new report.")
 
So here we go.
 
I attached a tarball containing..

 *  ...all input files as well as 
 *  ...my results plus
 *  ...some PDFs with annotations (to help spot the problem zones more easily).
 
There were 36 input files, each one containing 1 page. I used the command to create the output file:
 
  gs                    \
    -o gs906-merged.pdf \
    -sDEVICE=pdfwrite   \
     faktura*.pdf

The resulting PDF contains character omissions on pages 13, 16, 24, 25, 26, 28 and 33.

I also tested with GS v9.06 and GS v9.10GIT (as of ca. yesterday), and the results are slightly different, but wrong too (on the same pages) -- so the problem very likely is also present in GS v9.07/08/09.

The problem seems to be caused by the fact that each input file contains subset fonts with identical name as compared with the other input files, but not necessarily equal subsets of the font data. (I know a workaround is to first re-distill the files with GS, which leads to different and "uniq" prefixes for the font names, which considerably lowers the chance of name clashes, but IMHO the current behaviour is still a bug...)
Comment 1 pipitas 2013-08-22 16:54:24 UTC
Created attachment 10135 [details]
Tarball containing input and output PDFs as well as PDFs with annotations pinpointing the problem spots.
Comment 2 Ken Sharp 2013-08-23 01:47:18 UTC
As far as I can see the characters are not missing, and indeed cut/paste into a text file displays the original text.

This seems simply to be yet another manifestation of the same old problem, if you supply multiple files containing fonts with the same name and prefix, but differe3nt subsets, we can't tell the difference between them. Which as we already know results in us using the glyph at Encoding position 'x' for the first font of that name that we see for all fonts of the same name.

In the case of the missing glyphs, they aren't missing, it just so happens that the glyph we are using from the first font is a space glyph.

As I've said before I really don't see a way around this given the way that Ghostscript and pdfwrite work. In previous examples even using the object number was not sufficient, because some of the fonts had the same name *and* the same object number. We would need to know that we were dealing with a different input file, and there's no way currently for us to know that.

I'm not entirely sure how we would deal with it even if we did. What about the case where someone interleaves pages from multiple files (not simply appending) ? In that case we would embed the same font from each file as many times as there were pages which used it.

What this really shows is that, when merging PDF files, you should use a tool intended for merging PDF files.

I know you;re aware of the work-around already but for anyone else reading this thread, if you first process each input file with pdfwrite individually, and *then* process all the resulting files into one, you will get the desired result. This is because pdfwrite creates font subsets whose prefix is derived from an MD5 hash of the fonts actual contents, which is hgihly likely to be unique.

While the current behaviour is undesirable, the bug lies in the creation of the original files. I will continue to bear this problem in mind, in case inspiration should strike, but at the moment I don't see any likely way to address this.
Comment 3 Ken Sharp 2013-08-23 04:51:06 UTC
Noticed in passing; in fact the fonts defined in the input PDF files are CIDFonts, not regular fonts, and although the descendant fonts have subset prefixes on their names, the CIDFont /BaseFont does not, so we can't immediately identify the CIDFont as a subset anyway.
Comment 4 Ken Sharp 2014-09-23 00:20:30 UTC
*** Bug 695506 has been marked as a duplicate of this bug. ***
Comment 5 Ken Sharp 2015-06-30 09:38:50 UTC
*** Bug 696064 has been marked as a duplicate of this bug. ***
Comment 6 Arrigo Marchiori 2016-04-07 07:50:36 UTC
Created attachment 12420 [details]
Four very simple PDF's that show the problem when concatenated

The archive contains four PDF's, together with the OpenOffice files from which they are generated.

Each PDF contains a bold number on a single line, and another line in regular text. The font is Times New Roman.

The concatenated PDF shows a bold `1' on the fourth page, where the `4' is expected.

The files were generated by OpenOffice, and they were concatenated by GPL Ghostscript 9.16 (2015-03-30) under FreeBSD 9-STABLE with the following command:

$ gs -sDEVICE=pdfwrite -sOutputFile=concatenated.pdf -DBATCH -DNOPAUSE 1.pdf 2.pdf 3.pdf 4.pdf

The problem does not show, for instance, if the font is changed from bold into regular. Or, if the first line of the fourth file is changed from `4' into `04'.
Comment 7 Arrigo Marchiori 2016-04-07 07:52:58 UTC
IMHO, this bug should be treated as extremely serious, because it leads to silent data corruption. I personally bumped into it while processing administrative documents!
Comment 8 Ken Sharp 2016-04-07 07:58:44 UTC
(In reply to Arrigo Marchiori from comment #7)
> IMHO, this bug should be treated as extremely serious, because it leads to
> silent data corruption. I personally bumped into it while processing
> administrative documents!

There is a stated workaround for the problem (see my comment #2), which is in any event caused by using Ghostscript and the pdfwrite device for a purpose for which they are not intended ('merging' PDF files). In addition, the problem is caused by the authoring software, which does a terribly poor job of selecting font names.

While we have (as indicated previously) some ideas for addressing this problem, none of them are easy, and none of them are guaranteed to work under all conditions, and they *will* cause complaints form other users because they will cause PDF files with sensibly named fonts to increase in size.

In short, you need to understand what is going on when you run multiple files through Ghostscript and the pdfwrite device (it absolutely does *NOT* concatenate its input), and be aware of this situation.
Comment 9 Ken Sharp 2016-04-08 05:52:33 UTC
*** Bug 696679 has been marked as a duplicate of this bug. ***
Comment 10 Ken Sharp 2016-04-08 11:20:44 UTC
I believe that commit 0ec0f1627b7f7f5ffa1347123a926cd1e32c9f19 will resolve this problem for *PDF* input only. The details of how this works are in the commit log and I don't propose to cover them again here. This probably marks the limit of our ability to address this problem.

However this will not have any impact on PostScript (or EPS) input, as the fix works by using the input filename and the PDF object number. Since PostScript need not have any input file, and does not contain object numbers, clearly we cannot use the same approach. In fact there is nothing we can do in that case.

For PostScript input the only solution we can see is the work-around of converting each EPS/PostScript file into a PDF file, and then generating a single output file from the total set of input files. In the case of bug 695506 the easier solution is simply to use Ghostscript's eps2write device to create the EPS from the original PDF files, in which case the font names in the EPS files will already be generated unique.
Comment 11 Arrigo Marchiori 2016-04-12 05:17:52 UTC
Thank you very much, Ken!