Bug 696064 - pdfwrite: wrong characters in merged PDF file
Summary: pdfwrite: wrong characters in merged PDF file
Status: RESOLVED DUPLICATE of bug 694537
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: master
Hardware: PC Linux
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-06-30 08:10 UTC by Michael Weghorn
Modified: 2015-06-30 09:49 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
input PDF files for merge (5.76 MB, application/zip)
2015-06-30 08:10 UTC, Michael Weghorn
Details
result of merging the single PDF files (5.02 MB, application/pdf)
2015-06-30 08:11 UTC, Michael Weghorn
Details
PostScript file processed by Ghostscript (25.63 MB, application/postscript)
2015-06-30 08:12 UTC, Michael Weghorn
Details
result of converting the PostScript file (4.30 MB, application/pdf)
2015-06-30 08:12 UTC, Michael Weghorn
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Weghorn 2015-06-30 08:10:52 UTC
Created attachment 11770 [details]
input PDF files for merge

When merging the PDF files in the attached zip file "input-files.zip", wrong characters appear in several pages of the resulting output document.
An example for a page where this is very obvious is page 247 of the attached document "merged.pdf". The content of the input PDF files differs only in the first line (different names).

The following command is used to merge the files:
gs -q -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH -dSAFER -sDEVICE=pdfwrite -sOutputFile=merged.pdf -f input-files/*


In our real scenario, the PDF files are not merged directly by Ghostscript. The input PDF files are created by LibreOffice's mail merge (form letter) feature and then sent to the CUPS-PDF printer in one print job. The PDF printer expects PostScript as input format. The CUPS filter chain creates the attached Postscript file "generated_postscript.ps" out of the single PDF files and sends it to the PDF printer. The CUPS-PDF printer then invokes Ghostscript to convert it to PDF.

When I use Ghostscript to manually convert the PostScript file to PDF the result looks similar to the PDF file that I get when merging the PDF files as they are (s. above). However, the wrong characters are not yet present in the PostScript file when I look at it with "gv".

Command used to convert the PostScript file to PDF:
gs -q -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH -dSAFER -sDEVICE=pdfwrite -sOutputFile=converted_from_postscript.pdf -f generated_postscript.ps


The bug occurs with different versions of Ghostscript. The attached PDF files "merged.pdf" and "converted_from_postscript.pdf" were generated using a current master build (Git commit: 2d54c07d327f8d7b3eabc8a15c80127dc847e949) on Debian Jessie.


To me, this looks a bit similar to #694537, but I do not know whether it is actually related.
Comment 1 Michael Weghorn 2015-06-30 08:11:32 UTC
Created attachment 11771 [details]
result of merging the single PDF files
Comment 2 Michael Weghorn 2015-06-30 08:12:28 UTC
Created attachment 11772 [details]
PostScript file processed by Ghostscript
Comment 3 Michael Weghorn 2015-06-30 08:12:57 UTC
Created attachment 11773 [details]
result of converting the PostScript file
Comment 4 Ken Sharp 2015-06-30 09:38:50 UTC
First thing to understand is that Ghostscript does *NOT* merge PDF files. It interprets its input, converts it to marking operations, and then sends those to the device. When the device is pdfwrite those marking operations are then written out as a PDF file.

This means that the output does not bear any particular resemblance to the input, other than visually.

As I keep on saying to people you should avoid re-processing the output from Ghostscript, do it once and don't do it again.

Your problem is that LibreOffice names the font subsets it embeds in the same way, no matter what the font name, or content is. Ghostscript assumes that two fonts subsets with the same name are the same font. In this case LibreOffice is behaving very poorly indeed.

As noted in bug #694537 there is nothing much we can do about this, the damage is done before we see the file. The work-around is to re-fry each PDF file before passing them to pdfwrite. The reason is that Ghostscript will produce a new subset font and will name it with a sensible name, which should reduce name collisions. Though since you have 250 subset fonts, all with practically the same glyph coverage but different encodings, that may not help you.

Basically you are trying to use Ghostscript in a way its not intended to be used, from an application which frankly isn't very good at what its doing.

*** This bug has been marked as a duplicate of bug 694537 ***
Comment 5 Michael Weghorn 2015-06-30 09:49:13 UTC
(In reply to Ken Sharp from comment #4)
> As noted in bug #694537 there is nothing much we can do about this, the
> damage is done before we see the file. The work-around is to re-fry each PDF
> file before passing them to pdfwrite. The reason is that Ghostscript will
> produce a new subset font and will name it with a sensible name, which
> should reduce name collisions. Though since you have 250 subset fonts, all
> with practically the same glyph coverage but different encodings, that may
> not help you.
> 

Thank you very much for your quick reply. I had actually tried to first process each PDF file individually before passing all files to pdfwrite. The result was better than without first processing each file, but there were still wrong characters.

When I tried this workaround last time, i had used version 9.06 of Ghostscript. With the current master build, the resulting PDF is indeed OK.

In fact, we do not only have 250 documents, but many more...