Bug 692795 - concatenating PDF files doesn't always correctly load the fonts from the PDF file
Summary: concatenating PDF files doesn't always correctly load the fonts from the PDF ...
Status: RESOLVED WONTFIX
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.04
Hardware: PC Linux
: P4 major
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-01-17 08:19 UTC by quamis
Modified: 2014-02-17 04:40 UTC (History)
2 users (show)

See Also:
Customer:
Word Size: ---


Attachments
archive with test files and results (501.44 KB, application/zip)
2012-01-17 08:19 UTC, quamis
Details
tests.pdf - generated by pdftk (347.76 KB, application/gzip)
2012-01-22 18:33 UTC, quamis
Details

Note You need to log in before you can comment on or make changes to this bug.
Description quamis 2012-01-17 08:19:47 UTC
Created attachment 8285 [details]
archive with test files and results

I attached 2 PDF files(internal testcases for a module in our application, that uses http://xmlgraphics.apache.org/fop/).
If i run 

gs -sPAPERSIZE=a4 -dNOPAUSE -sDEVICE=pdfwrite -dBATCH -dNOPLATFONTS -dSAFER -sOutputFile='20files.pdf' 't2/9.pdf' 't2/3.pdf' 

i get a valid output


If i run 

gs -sPAPERSIZE=a4 -dNOPAUSE -sDEVICE=pdfwrite -dBATCH -dNOPLATFONTS -dSAFER -sOutputFile='20files.pdf' 't2/3.pdf' 't2/9.pdf' 


I get some warnings in the output like 
GPL Ghostscript 9.04: Missing glyph CID=48, glyph=0030 in the font EAAAAD+LiberationSans . The output PDF may fail with some viewers.


Basically, the order in which files are added matters, and the fonts embedded in the file aren't correctly handled(the original files are correctly rendered). 


$ gs -v
GPL Ghostscript 9.04 (2011-08-05)
Copyright (C) 2011 Artifex Software, Inc.  All rights reserved.

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 11.10
Release:        11.10
Codename:       oneiric
Comment 1 Ken Sharp 2012-01-17 08:40:33 UTC
Can we have some more detail about your company please ? Are you an Artifex licensee ? 

How are you using Ghostscript in your application ? 

Feel free to mail support at artifex.com if you would rather not post this information publicly.
Comment 2 quamis 2012-01-17 09:02:41 UTC
(In reply to comment #1)
> Can we have some more detail about your company please ? Are you an Artifex
> licensee ? 
> 
> How are you using Ghostscript in your application ? 
> 
> Feel free to mail support at artifex.com if you would rather not post this
> information publicly.

No, we're not an Artifex licensee.

I'm using it to concatenate a list of PDF files, into one big file.
Comment 3 Ken Sharp 2012-01-17 09:12:35 UTC
Well, it isn't going to work. The files contain embedded subsets with the same name and prefix. 

The point of the prefix is to minimise name clashes between subsets, which is why they are supposed to be random. Its quite clear that 'Apache FOP 1.0' is creating these according to a sequential scheme with the same initial value each time which results in the fonts having the same prefixes.

This makes it impossible for pdfwrite to differentiate between two different subsets, which is why you get the warning, and why your glyphs disappear.

You should take this up with the Apache maintainers.
Comment 4 quamis 2012-01-17 09:26:42 UTC
(In reply to comment #3)
> Well, it isn't going to work. The files contain embedded subsets with the same
> name and prefix. 
> 
> The point of the prefix is to minimise name clashes between subsets, which is
> why they are supposed to be random. Its quite clear that 'Apache FOP 1.0' is
> creating these according to a sequential scheme with the same initial value
> each time which results in the fonts having the same prefixes.
> 
> This makes it impossible for pdfwrite to differentiate between two different
> subsets, which is why you get the warning, and why your glyphs disappear.
> 
> You should take this up with the Apache maintainers.

If gs treats embeded subsets with the same name & prefix in different files as if they were one and the same, wouldn't this lead eventually to whole files being rendered with a different font than the one used in the original file?

Even if FOP makes these names more random, the possibility would still exist in gs.

I cannot test atm other tools to see how would they handle this case, but i'll get back to you.
Comment 5 Ken Sharp 2012-01-17 10:09:17 UTC
Fonts are *supposed* to be uniquely identified by name. Obviously this is a problem with subsets, which is what the prefix is about. If the font has the same name and prefix then it is the same font.

Rendering is a different issue to creating a brand new PDF file and attempting to embed fonts. For a start, each new file is treated separately, the fonts from the previous file are discarded before beginning the new file, that's what a job server loop is for. Obviously this doesn't work in the context of combining files.

Please do not alter the resolution of this issue.
Comment 6 quamis 2012-01-18 07:50:25 UTC
(In reply to comment #5)
> Fonts are *supposed* to be uniquely identified by name. Obviously this is a
> problem with subsets, which is what the prefix is about. If the font has the
> same name and prefix then it is the same font.
> 
> Rendering is a different issue to creating a brand new PDF file and attempting
> to embed fonts. For a start, each new file is treated separately, the fonts
> from the previous file are discarded before beginning the new file, that's what
> a job server loop is for. Obviously this doesn't work in the context of
> combining files.
> 
> Please do not alter the resolution of this issue.

I reported this on the FOP bugtracker ( https://issues.apache.org/bugzilla/show_bug.cgi?id=52477 ), and it seems that everybody seems to think they are doing things right. Except that they don't actually work as advertised.

I gave pdftk a try last night (http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/), and it worked flawlessly. pdftk is supposed to use iText for doing things, so it might be the iText lib that handles multiple fonts correctly.
Comment 7 Hin-Tak Leung 2012-01-22 12:35:30 UTC
(In reply to comment #5)
> Fonts are *supposed* to be uniquely identified by name. Obviously this is a
> problem with subsets, which is what the prefix is about. If the font has the
> same name and prefix then it is the same font.

I agree that it is laziness on the part of FOP developers'; OTOH, genuine collisions for prefixes can happen when concatenating pdf files, especially for subsets of common fonts like Times, etc. Would it make sense for ghostscript to rewrite and flatten-the-namespace the subsetting prefixes when concatenating? Just a thought. Maybe that's what pdftk does.
Comment 8 Ray Johnston 2012-01-22 17:50:36 UTC
It would be interesting to look at the output from pdftk.

There are two approaches that it might be using:

1) create a new (unique) subset name if the subset comes from a different
   file.

More involved, and thus less likely:

2) use the subset to create a merged subset AS LONG AS GLYPHS ARE NOT
   REPLACED -- and then use approach 1 only if required due to glyph
   collision.

The latter may result in a slightly reduced file size.

IMHO, it wouldn't be too difficult for the PDF interpreter to name mangle
the subset prefix to factor in the input filename (or even just a number
for the  input file). It could even be done so that the first file does not
change the subset prefix (today's behaviour), but only subsequent files
do the name mangling.

Adding Alex to the CC list for this bug in case he agrees with that approach
and wants to re-open this bug.

Since the pdfwrite device doesn't know what file a font comes from this is
not possible (afaict) to do in pdfwrite.
Comment 9 quamis 2012-01-22 18:31:45 UTC
(In reply to comment #8)
> It would be interesting to look at the output from pdftk.
> 
> 2) use the subset to create a merged subset AS LONG AS GLYPHS ARE NOT
>    REPLACED -- and then use approach 1 only if required due to glyph
>    collision.

Not sure, but i think this is how pdftk does it. I will attach the 2 files generated by pdftk.
The files were generated with

pdftk 't3/1.pdf' 't3/2.pdf' cat output 'test 1,2.pdf';
pdftk 't3/2.pdf' 't3/1.pdf' cat output 'test 2,1.pdf';

I used Okular(i'm using linux atm) to look at the font info, and it seems to report the same font as it would be used twice. Not sure how its possible, as long as the font names should be unique... It might be an Okular bug though:)
Comment 10 quamis 2012-01-22 18:33:26 UTC
Created attachment 8303 [details]
tests.pdf - generated by pdftk
Comment 11 Alex Cherepanov 2012-01-22 19:01:50 UTC
I can mangle the prefix indeed, but this approach doesn't solve all problems.
The fonts can be un-subsetted but have the same problem.
Comment 12 Hin-Tak Leung 2012-01-23 02:57:37 UTC
Comment on attachment 8303 [details]
tests.pdf - generated by pdftk

tgz'ed
Comment 13 Hin-Tak Leung 2012-01-23 03:00:23 UTC
It does not appear that pdftk does any mangling - it seems that it just preserves the order and reference of the objects and modifies the inputs less.