690355 – pdfwrite ignores the "#copies" setting in PostScript input

Bug 690355 - pdfwrite ignores the "#copies" setting in PostScript input

Summary: pdfwrite ignores the "#copies" setting in PostScript input

Status:	RESOLVED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PS Interpreter (show other bugs)
Version:	8.64
Hardware:	All All

Importance:	P4 enhancement
Assignee:	Ken Sharp

URL:	https://bugs.launchpad.net/bugs/320391
Keywords:

Depends on:
Blocks:

Reported:	2009-03-24 15:11 UTC by Till Kamppeter
Modified:	2009-04-05 14:51 UTC (History)
CC List:	1 user (show)

See Also:
Customer:
Word Size:	---

Attachments
OOo-2copies.ps (312.75 KB, application/postscript) 2009-03-24 15:16 UTC, Till Kamppeter	Details
tiger2.pdf (96.83 KB, application/pdf) 2009-03-29 21:25 UTC, Ray Johnston	Details
page_copies.pdf (3.03 KB, application/pdf) 2009-04-02 03:59 UTC, Ken Sharp	Details
A 3-copies Tiger that seems to work. (42.67 KB, application/pdf) 2009-04-02 12:42 UTC, SaGS	Details
690355.patch (4.09 KB, patch) 2009-04-03 07:25 UTC, Ken Sharp	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Till Kamppeter 2009-03-24 15:11:30 UTC

If a PostScript file contains

/#copies 2 def

it is expected that the PostScript interpreter displays/prints it twice. This
works perfectly when I send the file unfiltered to my HP LaserJet P3005
(PostScript printer). If I display the file with Ghostscript or print it on a
non-PostScript printer with Ghostscript I get only one copy.

See the Ubuntu bug report linked under "URL". OpenOffice.org uses "#copies" as
its standard method to request multiple copies (most other programs send an IPP
attribute to CUPS for the number of copies). So this results in the case that
printing multiple copies from OpenOffice.org only works on PostScript printers
and does not work on all Linux distros for all non-PostScript printers.

Comment 1 Till Kamppeter 2009-03-24 15:16:47 UTC

Created attachment 4863 [details]
OOo-2copies.ps

PostScript output generated by OpenOffice.org when printing. 2 copies were
requested and therefore the file contains the line "/#copies 2 def".

Comment 2 Marcos H. Woehrmann 2009-03-26 10:00:01 UTC

This appears to be a driver problem.  When I convert the attached file to TIFF it works as expected (each 
page is repeated twice).

The command line I'm using:

  bin/gs -sDEVICE=tiff24nc -o test.tif ./OOo-2copies.ps

Comment 3 Till Kamppeter 2009-03-26 10:30:22 UTC

Then this must be checked for each output device of Ghostscript. I can at least
say that this bug is valid for all X output devices and for the pdfwrite output
device.

Comment 4 Marcos H. Woehrmann 2009-03-26 10:47:04 UTC

I'm changing the title and assigning this to Ken to fix the pdfwrite issue and I've opened a new bug for the 
X11 device.  Please open additional bugs for any other devices that fail.

Comment 5 Ken Sharp 2009-03-26 13:28:30 UTC

Making multiple copies in a PDF file doesn't make any sense (any more than it
does for a screen display). Additionally, Acrobat Distiller ignores both the
/#copies operator and the /NumCopies page device parameter.

In my opinion our pdfwrite behaviour is correct.

Comment 6 Till Kamppeter 2009-03-26 13:50:01 UTC

We are entering a new dimension here. PDF is overtaking the role of PostScript
in the printing workflow. It is getting the standard print job format. See

https://www.linuxfoundation.org/en/OpenPrinting/PDF_as_Standard_Print_Job_Format

I have a CUPS environment (in Ubuntu Jaunty) where the standard print job format
is PDF. This especially means that all jobs get turned to PDF and processed by
the pdftopdf CUPS filter.

Unfortunately some applications (like OpenOffice.org) send PostScript and select
the number of copies by embedding "/#copies 2 def" in the PostScript. All jobs
get converted to PDF, PostScript jobs by Ghostscript with the "pdfwrite" output
device. For this situation I need the PDF containing the requested number of
copies (or containing some parameter telling that this PDF has to be rendered X
times). To not break other things I suggest that you make this functionality
optional, like only being active if "-dUSENUMCOPIES" is set on the Ghostscript
command line.

Comment 7 Ralph Giles 2009-03-26 14:51:51 UTC

The number of copies to print is job metadata. That's what /#copies (and
NumCopies in the page device dictionary) represents in Postscript, and how
pdfwrite should propagate it. As far as I can tell, PDF relegates that sort of
metadata to the various Job Ticket format extensions, all of which are
unfortunately very complicated.

So ideally, pdfwrite would check the number of copies and generate a JDF for
PDTF or whatever which reproduces it, and then support would be added for that
feature to the PDF interpreter to set NumCopies again when rendering the PDF. A
hack to duplicate the actual high-level output, as Till suggests, is worthwhile
if we're not willing to do that, although I suspect it would be a similar amount
of work for less maintainable code. Or if there's something simple and
equivalent to NumCopies in PDF, let's do that instead!

Till, you say this is an artefact of OpenOffice relying on this feature in its
Postscript driver, presumedly because it doesn't talk to cups directly. What's
the normal way number of copy information is propagated e.g. from the common
printing dialog? Can you describe in more detail what happens when an
application using cairo generates pdf directly, for example? Can gs, or the gs
wrapper, somehow covert this to an IPP attribute when it hands the pdf back to
cups? That might be easier than trying to implement the heavier embedded job
ticket formats.

Comment 8 Till Kamppeter 2009-03-26 16:20:57 UTC

The normal way is that the number of copies is accompanied with the job as an
IPP attribute and not embedded in the PostScript. Apps link with libcups and use
functions from this library to poll printer lists and PPDs and also to set the
options and send the job. OpenOffice.org also links against the CUPS library and
loads the list of available printers and the PPDs from CUPS and it even sends
the option settings as IPP attributes, only the number of copies is sent as
embedded PostScript.

All CUPS filters of a filter chain to process a print job are called with the
same command line, where the forth argument is the number of copies and the
fifth argument a string of space-separated key=value pairs for the options. In
the case of OpenOffice.org the forth argument is always 1, as OpenOffice.org
does not send the IPP attribute for the copies, it expects that the PostScript
interpreter (independent whether on the CUPS server or in the printer) generates
the copies. It is not possible for a CUPS filter (in our case pstopdf) to modify
the command line of the following filters. So pstopdf cannot search for /#copies
and then set the forth argument for the rest of the filters to appropriate
number. The only way how a CUPS filter can react to something in the input data
is to modify the output data. This is also the only way how a CUPS filter can
communicate with the subsequent filters.

I hope we do not need to wait for the JTAPI (Job Ticket API) library
implementation to be able to fix the problem with OpenOffice.org (probably OOo
will earlier send print jobs in PDF).

Comment 9 Ken Sharp 2009-03-27 02:10:42 UTC

Ralph is correct, the number of copies is something which is job metadata, and
should be sent as Adobe PJTF, CIP4 JDF or CIP3 PPF or similar.

Adding support for a switch which emitted multiple copies of each page would be
non-trivial, as well as making the PDF file much larger (potentially very much
larger). That's not to say its impossible, merely difficult, because we would
need to reserve more entries in the pages tree and the xref table, and emit each
page content stream multiple times with different object numbers.

I think we would be better to add the ability to embed PJTF in the PDF file
(Acrobat Distiller can do this), and put the copies parameter in there. This is
the nearest thing there is in a PDF file to NumCopies in PostScript. The PDF
Interpreter could then optionally read the PJTF and set #copies from it. However
that also begs the question of what to do with all the other content of a PJTF
such as resolution.

This really is a workflow problem, and I think should be tackled by adopting
workflow solutions rather than hacking the behaviour of Ghostscript and pdfwrite.

Presumably the OpenOffice developers will face the same problem themselves if
they emit PDF files directly, either they will need to embed multiple copies of
the pages in the PDF file (inefficient) a PJTF in the PDF file or set the IPP
parameters to CUPS properly....

Comment 10 Ray Johnston 2009-03-29 21:24:11 UTC

In general, I agree with Ken that this is really a workflow issue, not something
that is ideally solved in Ghostscript because it provides support for an older
(PS) printing workflow for something that is not supported with the PDF workflow.

Regarding Ken's comment (in comment #9):
> Adding support for a switch which emitted multiple copies of each page would
> be non-trivial, as well as making the PDF file much larger (potentially very
> much larger).

The PDF would not be much larger. The contents for the copies would be shared,
as would the image, font and other resources. Essentially what would be
needed would be to duplicate the indirect reference in the 'Kids' array of
Pages (and of course double the Count). I generated a 'tiger.pdf' using
Ghostscript, inflated it with toolbin/pdfinflt.ps and the changed:

2 0 obj
<</Kids [ 5 0 R
]
/Count 1
/Type /Pages
>>
endobj

to:

2 0 obj
<</Kids [ 5 0 R 5 0 R
]
/Count 2
/Type /Pages
>>
endobj

and, sure enough, the tiger shows up on both pages. I've attached this file,
with the caveat that it has a broken xref, but Ghostscript is able to repair
it without a problem.

I'm still not sure whether or not this is a good idea, but it doesn't seem that
hard and sure doesn't increase the file size appreciably.

Comment 11 Ray Johnston 2009-03-29 21:25:09 UTC

Created attachment 4879 [details]
tiger2.pdf

A doubled 'tiger' PDF

Comment 12 Ken Sharp 2009-03-30 00:21:23 UTC

Ray, you beat me to reporting it. I'd just discovered at the weekend that
Acrobat was happy with simply modifying the Pages tree, I'd expected it to be
upset....

So its feasible, but its still ugly. I'll try and see how much work it'll be to
implement when I finally get my current problem resolved. It may not be too bad
if it only involves hacking the Pages tree. Not sure what to do about producing
a balanced tree, might be a little more effort.

Comment 13 SaGS 2009-03-30 14:50:46 UTC

This bug report reminded me this old post from comp.text.pdf:

http://groups.google.com/group/comp.text.pdf/browse_thread/thread/28db6c2ee5dd8
d46#eaa2bdf31e867403 (Message-ID: <3dec7fcb.1431823813@reading.news.pipex.net">3dec7fcb.1431823813@reading.news.pipex.net>)

Didn't check if current Adobe Acrobat/ Reader versions have this anomaly or 
not. Anyway, I think it's safer to create multiple PDF Page dictionaries that 
share the values (the /Contents, /Recources, etc), instead of just refering 
the same PDF Page dictionary multiple times from the Pages tree.

Comment 14 Ken Sharp 2009-03-30 23:58:13 UTC

Well, I did check a couple of recent versions of Acrobat, and they do seem to
handle this situation acceptably. Also the PDF file will only (I think ?) be
used inside CUPS, so as long as Ghostscript handles it correctly its probably
mostly OK anyway.

I haven't thoroughly checked GS yet to find out what it does with this, but I
think its OK.

In passing I checked with an ex-colleague who works on a different PS/PDF rip,
and he had co-incidentally recently been working with a PDF file which exhibited
exactly this setup.

All the same, thanks for the pointer to the old Usenet postings. I'm reluctant
to duplicate the page content streams  because they can comprise quite large
amounts of the PDF content. Also it does significantly increase the complexity
of the code in pdfwrite keeping track of all the objects.

Comment 15 Ken Sharp 2009-04-02 03:07:14 UTC

Till, I need your input with regard to CUPS and how you see this being used.

I've made a quick change for the purpose of investigation which simply
duplicates the entries in the pages tree enough times to satisfy the #copies or
NumCopies values in force at the time the page is completed. There is a new
switch for controlling this behaviour which defaults to false.

The resulting PDF file works well with Ghostscript, which seems to be perfectly
happy with the resulting PDF file, and in my tests so far correctly produces the
expected number of pages on output of a PDF file produced with NumCopies > 1.

However no version of Acrobat that I've checked (I've tried 4 versions ranging
from acrobat 4.0 to 9.0) is completely happy with PDF files created like this.
In general they will only display copy #1 correctly and produce varying errors,
and blank pages, when trying to view later duplicate pages. In general the first
copy of each page is OK, the subsequent copies do not display. 

Clearly a file which doesn't display well in Acrobat is not a very useful PDF
file, I don't think we should produce such files except for the very specific
reason of workflow problems. So we should only produce these files if they are
not intended as the final output, but merely an intermediate stage.

So, is this acceptable for your purposes ? That is, can you determine whether a
PostScript file is intended to terminate at producing a PDF file (in which case
do not preserve NumCopies) or is intended for some kind of further processing
with GS ? Is there any circumstance under which these PDF files could be sent to
a different PDF consumer such as Acrobat or xpdf ?

NB if you take the PDF file with the duplicated pages and run it back through GS
using the pdfwrite device it will happily duplicate the content streams
producing a PDF file which Acrobat is then happy with.

If this is not an acceptable solution, then we would need to duplicate the page
content streams instead of the page tree entries, which will lead to bigger PDF
files and take considerably more effort to code for.

Let me know what you think please.

Comment 16 Till Kamppeter 2009-04-02 03:37:21 UTC

The switch to activate the new functionality will only be used in the pstopdf
CUPS filter, so Ghostscript/pdfwrite called by other applications will not
suffer any regressions.

The further workflow is to pass the PDF through the pdftopdf filter, a
Poppler-based page management filter (rearranging of pages for N-up, reverse
order, selected pages, ...). After that it goes to the driver. For
non-PostScript printers it is usually Ghostscript what renders the PDF, but it
is not excluded that it can also be Poppler, as there is for example a
Poppler-based pdftoopvp CUPS filter under development. For PostScript printers
the PDF gets converted to back to PostScript by the pdftops filter which is
based on Poppler in CUPS 1.3.x and optionally based on Ghostscript in CUPS 1.4.x
(in Ubuntu Jaunty it is based on Ghostscript).

So you see that both Ghostscript and Poppler are used to render PDF, and which
one is actually used depends on the printer/driver in use.

So if both Ghostscript and Poppler works with the output, the whole workflow
should work.

Comment 17 Ken Sharp 2009-04-02 03:59:50 UTC

Created attachment 4882 [details]
page_copies.pdf

I'm afraid I don't know how to drive Poppler, I've attached a PDF file here,
could you try it ? It should produce 3 copies of each of two pages. Page one
says 'Test', page 2 says 'Test1'.

I did try the file with xpdf, which I think Poppler is based on, and it does
not work, it gives four errors 'Loop in Pages tree' and one 'Page count in top
level pages object is incorrect'. It displays each page once only.

So it looks to me like this is not going to be a solution. I have briefly
looked at what would be required for duplicating the page content streams and I
think this will be several days work, possibly more. It'll take me a day or so
just to work out what needs to be done. If this is required I don't think it
will happen soon.

Comment 18 SaGS 2009-04-02 12:25:47 UTC

> ... what would be required for duplicating the page content streams ...

My understanding of that news:// post I mentioned in comment #13 is that 
Reader has trouble when the same PDF Page object is referenced more the once 
from the Pages tree, not when multiple PDF Page objects share values like 
the /Contents stream.
- PDF Page = the dictionary described in table 3.27 ‘Entries in a page object’,
  (PDF1.7 page 145).
- I understand there's no problem with sharing indirect objects referenced 
  from these PDF Page. Of course direct objects cannot be shared, but the 
  contents stream, being a PDF Stream, can never be a direct object.

So what I think you need is to output the same PDF Page dictionary multiple 
times, each time with a different object #, and reference all these copies 
from the Pages tree. There won't be much duplication, as most content 
(the /Contents and the resources themselves - fonts, XObjects, etc) will be 
shared by all copies.

Maybe I'll test more this weekend to be sure. I have about 30 versions of 
Reader for _WIN32, starting with 3.01. I don't have any non-Windows version.

Comment 19 Till Kamppeter 2009-04-02 12:41:14 UTC

The file attached to comment #17 really does not get rendered correctly with
Poppler. The command line converter pdftops gives the same result as XPDF:

till@till-laptop:~/ghostscript/gpl/testfiles$ pdftops page_copies.pdf
Error: Loop in Pages tree
Error: Loop in Pages tree
Error: Loop in Pages tree
Error: Loop in Pages tree
Error: Page count in top-level pages object is incorrect
till@till-laptop:~/ghostscript/gpl/testfiles$ 

The resulting PostScript file displays each page only once.

Comment 20 SaGS 2009-04-02 12:42:42 UTC

Created attachment 4887 [details]
A 3-copies Tiger that seems to work.

There are 3 PDF Page objects (#4, #9, #10), and these share the /Contents
stream (#5) and resources (here a single one, PDF ExtGState #8). Direct objects
like the /MediaBox PDF array, the /Resources PDF dictionary, and the PDF array
that's the value for /ProcSet have to be duplicated (unless replaced with
indirect objects), but these are not large.

Comment 21 Till Kamppeter 2009-04-02 12:52:17 UTC

In principle the 3-copies tiger works, it displays three times in XPDF, but it
gives the following console messages:

till@till-laptop:~/ghostscript/gpl/testfiles$ xpdf tiger-3-copies.pdf
Error: PDF file is damaged - attempting to reconstruct xref table...
XtUngrabButton(drawArea,3,0)
Warning: Attempt to remove nonexistent passive grab
till@till-laptop:~/ghostscript/gpl/testfiles$ 

Also gv shows it three times but with the following console message:

till@till-laptop:~/ghostscript/gpl/testfiles$ gv tiger-3-copies.pdf
   **** Warning:  An error occurred while reading an XREF table.
   **** The file has been damaged.  This may have been caused
   **** by a problem while converting or transfering the file.
   **** Ghostscript will attempt to recover the data.
till@till-laptop:~/ghostscript/gpl/testfiles$ 

gs gives:

till@till-laptop:~/ghostscript/gpl/testfiles$ gs tiger-3-copies.pdf
GPL Ghostscript 8.64 (2009-02-03)
Copyright (C) 2009 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
   **** Warning:  An error occurred while reading an XREF table.
   **** The file has been damaged.  This may have been caused
   **** by a problem while converting or transfering the file.
   **** Ghostscript will attempt to recover the data.
Processing pages 1 through 3.
Page 1
   **** Warning: stream Length incorrect.
>>showpage, press <return> to continue<<

Page 2
   **** Warning: stream Length incorrect.
>>showpage, press <return> to continue<<

Page 3
   **** Warning: stream Length incorrect.
>>showpage, press <return> to continue<<


   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> GPL Ghostscript SVN PRE-RELEASE 8.64 <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

GS>quit
till@till-laptop:~/ghostscript/gpl/testfiles$ 

Console output of pdftops:

till@till-laptop:~/ghostscript/gpl/testfiles$ pdftops tiger-3-copies.pdf
Error: PDF file is damaged - attempting to reconstruct xref table...
till@till-laptop:~/ghostscript/gpl/testfiles$ 

After that gv shows the resulting PostScript file without any further console
output.

Comment 22 SaGS 2009-04-02 13:01:34 UTC

> Error: PDF file is damaged - attempting to reconstruct xref table...
> ... etc

That's normal, because I edited the file in a text editor, and haven't 
recomputed object offsets to fix the xref. (Yes, text editor; the file uses 
only ASCII characters, because the contents stream is ASCII85-encoded.)

Comment 23 Ken Sharp 2009-04-03 00:27:23 UTC

OK, first up the 'quick hack' isn't going to work since few PDF consumers like
it. Thanks for checking for me Till, I was pretty sure it wasn't going to work,
but it was quick to code.

SaGS' suggestion of manufacturing new page dictionaries is feasible, but
probably about as much work as duplicating the content streams, though it has
the obvious advantage of not massively increasing the file size.

As I said, I'll look into what it will take to do this or, if there's no other
solution, to duplicate the entire content stream. Since I think its several days
work either way, its not going to be soon, sorry.

Comment 24 Ken Sharp 2009-04-03 07:25:28 UTC

Created attachment 4891 [details]
690355.patch

OK, here is a preliminary patch to implement a new switch 'DoNumCopies', this
switch is only relevant to the pdfwrite device, and causes it to emit multiple
copies of each page. It 'should' keep track of both /#copies and /NumCopies
through the course of jobs, so you can have different numbers of copies of each
page.

I haven't finished testing it yet, but so far it seems OK. Documentation to
follow when I check it in. 

You should *not* use this with any file containing pdfmarks which refer to
pages (eg /Dest) as these definitely will not work as expected. Obviously if
(like CUPS) you only intend to print/process the file this isn't a concern
either.

The output PDF file is slightly bigger, as the code currently duplicates the
Resources dictionary for each copy of each page (the resources themselves are
not duplicated though). I don't think this is a major concern so I'm unlikely
to try and address it.

Comment 25 Ralph Giles 2009-04-03 17:09:08 UTC

Ken, if you're going to do the plumbing to rewrite the page tree and/or content
streams, it may be worth going a little bit further to support some of the
reordering an imposition features? After all, half of this is just so they can
call the poppler-based impose filter.

Comment 26 Ken Sharp 2009-04-04 00:42:49 UTC

Ralph, I'm not rewriting or re-ordering the content streams, all I'm doing is
duplicating the individual page dictionaries and adding the extra dictionaries
to the Pages tree.

This is fairly straightforward, modulo some fiddling to reproduce all the
required resource dictionaries. But there's no reordering going on, each page is
added to the tree in order, then a number of duplicates added, then we move on
to the next page. Re-ordering would be an additional task. Not impossible, but a
fair degree harder because we don't know when we start how many pages there are
(unless the input is PDF or DSC compliant PostScript).

I think we should add that as a different enhancement if we want to do it.

Comment 27 Ken Sharp 2009-04-04 03:19:35 UTC

Enhancement added in revision 9615, patch here:

http://ghostscript.com/pipermail/gs-cvs/2009-April/009190.html

Please note that this differs from the patch in comment #24 above, the flag has
changed name from 'DoNumPgaes' to 'DoNumCopies'

Till, it would be useful if you could test this with CUPS & Poppler, my testing
has been limited to GS and Acrobat, and not very many files with multiple copies
set. Because I'd changed code which affects general file writing I spent most
time checking that there were no regressions.

Comment 28 Ray Johnston 2009-04-04 09:38:49 UTC

On Ralph's comment #25, it seems that the only reason that poppler is part of
the workflow is to take the PDF (possibly created by gs from a PS file), and
munge it to apply page ordering, N-up and the like.

If we want to replace Poppler in the pipeline, the gs rendering which is the
recipient of the munged file from Poppler, would perform the page ordering,
N-up etc. Page re-ordering is simple (reversal of the 'dopdfpages' loop in
pdf_main.ps). Unfortunately, there is a bug against gsnup that uses 'BeginPage',
'EndPage' to perform N-up that makes it not work with PDF (bug 688318).

Comment 29 Till Kamppeter 2009-04-05 12:33:59 UTC

ray, CUPS has always a page management filter, which does the N-up, selected
pages, multiple copies, software collate, ... Before the introduction of the PDF
printing workflow this was pstops, completely written by the CUPS developers and
not using any renderer libraries like libpoppler or libgs. For the PDF printing
workflow we replace pstops by pdftopdf, which does the same on a PDF data
stream. Currently it uses libpoppler because Poppler's API allows easy
manipulation of pages. To get rid of Poppler a Ghostscript based program hs to
replace the current pdftopdf filter.

Comment 30 Till Kamppeter 2009-04-05 14:51:15 UTC

Thank you very much for the fix.

It works great. I have tested it with a real CUPS workflow (see the Ubuntu bug
report). For that I have taken the patched Ghostscript 8.64 and I have added
"-dDoNumCopies" to the ps2pdf13 command line in the pstopdf CUPS filter.

All this will appear in Ubuntu Jaunty.