Bug 692886 - Using -dFirstPage/-dLastPage with Portfolio PDFs
Summary: Using -dFirstPage/-dLastPage with Portfolio PDFs
Status: NOTIFIED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Interpreter (show other bugs)
Version: master
Hardware: PC All
: P1 enhancement
Assignee: Alex Cherepanov
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-02-28 19:49 UTC by Marcos H. Woehrmann
Modified: 2014-02-17 04:44 UTC (History)
4 users (show)

See Also:
Customer: 531
Word Size: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marcos H. Woehrmann 2012-02-28 19:49:43 UTC
As a followup to Bug 692859 the customer would like -dFirstPage= and -dLastPage= to work with Portfolio PDF files.  The customer suggests the following behaviour:

For example, say the
portfolio contained (3) PDF files, where:
File 1 has one page
File 2 has 10 pages
File 3 has 5 pages

Say I want pages 1 thorough 5, so the command would be something like:
  -dFirstPage=1 -dLastPage=5

From the situation described above, I think we should expect Page 1
from file 1, and pages 1 through 4 from file 2
Comment 1 Alex Cherepanov 2012-02-29 01:01:26 UTC
I can implement this approach but it violates the principle of the least
astonishment. 

An alternative, selection of the page range(s) by the file number
and a range within a file will likely to grow into a JCL-like language
-sPages=(,,(FirstPage=3,LastPage=5),2(,LastPage=1))
Where the number before the bracket indicates a repeat counter.

Does anybody have better ideas?
Comment 2 Robin Watts 2012-02-29 13:27:11 UTC
Personally, I'd prefer a -dPortfolioEntry to say what portfolio entry to use, where 0 (the default) means 'the top level file', 1 means the first file, 2 means the second etc. Then FirstPage and LastPage select within them.
Comment 3 Marcos H. Woehrmann 2012-02-29 21:19:06 UTC
I think the -dPortfolioEntry option makes sense, but I'm not sure how it would be used to extract the pages as in the Description (i.e. page 1 of the first document and pages 1 through 4 of the second docment).

Perhaps:

  -dPortfolioEntry=1 -dPortfolioEntry=2 -dFirstPage=1 -dLastPage=4

(the first -dPortfolioEntry doesn't need FirstPage/LastPage since all the pages from that document will be extracted).
Comment 4 Marcos H. Woehrmann 2012-03-01 16:06:42 UTC
The customer suggests that they would prefer we treat Portfolio PDFs as multi page PDFS:

Thinking of this purely from a RIP standpoint, a portfolio PDF is simply collection of PDF files (actually can contain files that ARE NOT PDF too but....) that have, at the end of the day, x PDF pages total. Very similar to a multipage PDF.

Specifying the "part" of portfolio in which to operate on, seems to defeat the purpose of making a simple pagerange request against the whole thing.  Then again, one could argue the point that each PDF within the portfolio should be treated as a separate entity entirely.

I think of these portfolios more or less as a zip file or similar.....

I will discuss this with some people on this side to get more opinions, but wanted to offer my two cents.
Comment 5 Customer 531 2012-03-02 17:28:59 UTC
To further the (In reply to comment #4)
> The customer suggests that they would prefer we treat Portfolio PDFs as multi
> page PDFS:
> 
> Thinking of this purely from a RIP standpoint, a portfolio PDF is simply
> collection of PDF files (actually can contain files that ARE NOT PDF too
> but....) that have, at the end of the day, x PDF pages total. Very similar to a
> multipage PDF.
> 
> Specifying the "part" of portfolio in which to operate on, seems to defeat the
> purpose of making a simple pagerange request against the whole thing.  Then
> again, one could argue the point that each PDF within the portfolio should be
> treated as a separate entity entirely.
> 
> I think of these portfolios more or less as a zip file or similar.....
> 
> I will discuss this with some people on this side to get more opinions, but
> wanted to offer my two cents.

Another thing to perhaps further this point: 
In using a standard command i.e.: -sOutputFile=FOLIO-%03d.tif against a portfolio document that contains multiple PDFs (some of which may be multipage or not) we get the expected number of output pages total, without having to do anything special.

I think because of this point -dFirstPage/-dLastPage should follow and may actually honor the principle of the least astonishment.
Comment 6 Robin Watts 2012-03-02 17:59:01 UTC
I'm going to disagree here, sorry.

> Thinking of this purely from a RIP standpoint, a portfolio PDF is simply
> collection of PDF files (actually can contain files that ARE NOT PDF too
> but....)

No buts. A Portfolio PDF is a 'cover' PDF, and several embedded files, which may or may not be PDFs.

To attempt to treat the whole thing as a single PDF is destined to fail.

Consider the fact that I could make a Portfolio PDF that contains a Postscript file, a PCL file, an XPS and another PDF file.

GhostPDL can cope with all those formats; is it reasonable to expect us to attempt to extract pages from all those subtypes?

Even if they *are* all PDFs, what happens if one is malformed or corrupt? Do we suddenly lose the ability to access the other ones? And what if they have different permissions (one might be forbidden to print, for example)?

> Specifying the "part" of portfolio in which to operate on, seems to defeat the
> purpose of making a simple pagerange request against the whole thing.

My argument is that we shouldn't allow a simple pagerange request against the whole thing. Any pagerange request against a portfolio PDF will access just the 'cover' document, as you'd expect.

If you want to access one of the embedded files, then you explicitly specify which file to access, and (optionally) the page range within that document.

> Then again, one could argue the point that each PDF within the portfolio
> should be treated as a separate entity entirely.

That's exactly what I'd argue.

> I think of these portfolios more or less as a zip file or similar...

Spot on. Just because I put a bunch of word documents into a zipfile doesn't mean it's reasonable for me to expect word to support printing from them all at once.
Comment 7 Marcos H. Woehrmann 2012-03-04 23:04:38 UTC
I've been looking into how Adobe Acrobat 10.1.2 handles Portfolio PDF files and my conclusion is that it handles the 74.pdf file similarly to how Customer 531 expects Ghostscript to act.

First of all the cover PDF page is not accessible, so Robin's statement that "Any pagerange request against a portfolio PDF will access just the 'cover' document, as you'd expect." doesn't match what a user of Acrobat would expect.  The only reference to the cover document is the in the Document Properties screen, which reports the Page Size as 7 x 5, which is the size of the cover document.  It's possible that there is a way of viewing the cover PDF file in Acrobat, but I couldn't find it and it's certainly not the default.

Second, Acrobat is perfectly happy printing a Portfolio PDF as a series of pages.  In fact if you have none of the individual documents selected when you chose print the only option is to print "All PDF files".  You can select a subset of the documents (either by dragging or command-clicking) then the default print option is to print "Selected PDF files", with "All PDF files" being an option.  There doesn't seem to be a way of printing a subset of pages from within a Portfolio PDF using the Adobe Print dialog box, however, if you select the Printer Dialog Box you can specify a range of pages.  I don't have a Portfolio PDF file with multiple pages per document to test, but going back to the example in original description in this bug report I presume if I told Acrobat to print all the PDF documents and then set the page range in the Printer Dialog Box from 1 to 5 it would print the first document followed by pages 1 through 4 of the second document.

I can't predict how Acrobat would handle a Portfolio PDF file that consist of multiple different document types but I'm not sure that such a thing exists in the real world, so perhaps it's a moot point.  I also don't see how Ghostscript handles it matters from the command line interface point of view; whether we support FirstPage/LastPage or PortfolioEntry it's going to come down to the same thing, either we RIP the document or we don't.

Similarly I don't see how the FirstPage/LastPage vs PortfolioEntry command line option matters as how we handle damaged PDF files within a Portfolio PDF.  If you specify a FirstPage/LastPage that includes a damaged file presumably Ghostscript will fail when the applicable document is printed, you could then specify a starting page that skips that page or that document entirely and try printing again.

I don't see why we can't support both command line options:

1.  If you have Portfolio PDF and specify only FirstPage and/or LastPage Ghostscript treats the document as a multiple page PDF document and acts the way Customer 531 prefers.  This emulates how Acrobat works and therefore I suggest "honors the principle of the least astonishment" 

2.  If you specify PortfolioEntry on the command line Ghostscript acts as Robin prefers.  This an improvement over Acrobat in that you have a much richer way of specifying what pages to print (i.e. print page 2 of document 3 followed by pages 4 through 5 of document 2).
Comment 8 Customer 531 2012-03-09 21:22:45 UTC
In speaking with colleagues and industry partners, we would like to see portfolios treated as multipage.

Another thing to perhaps further this point: 
In using a standard command i.e.: -sOutputFile=FOLIO-%03d.tif against a
portfolio document that contains multiple PDFs (some of which may be multipage
or not) we get the expected number of output pages total, without having to do
anything special. If the portfolio contains a other bits (I made a few that contain subfolders, containing xls, PCL, etc etc) GS happy ignores them and simply carves out the pages.

> First of all the cover PDF page is not accessible, so Robin's statement that
> "Any pagerange request against a portfolio PDF will access just the 'cover'
> document, as you'd expect." doesn't match what a user of Acrobat would expect

This is truth.

> Second, Acrobat is perfectly happy printing a Portfolio PDF as a series of
> pages.  In fact if you have none of the individual documents selected when you
> chose print the only option is to print "All PDF files".  You can select a
> subset of the documents (either by dragging or command-clicking) then the
> default print option is to print "Selected PDF files", with "All PDF files"
> being an option.  There doesn't seem to be a way of printing a subset of pages
> from within a Portfolio PDF using the Adobe Print dialog box.

This is also truth.
Comment 9 SaGS 2012-03-10 05:23:36 UTC
> First of all the cover PDF page is not accessible

Does the PDF contain a Collection dictionary (referenced from the document's Catalog) with a /D entry? This /D entry defines which of the embedded files to display initially, and may give the impression the cover is skipped. The default is to display the cover PDF.
Comment 10 Alex Cherepanov 2012-03-29 06:26:33 UTC
Use sequential page numbering for -dFirstPage and -dLastPage parameters
when they are used with PDF Collections.

A patch for this enhancement has been committed as:
http://git.ghostscript.com/?p=ghostpdl.git;a=commitdiff;h=23e8552bb2c1849c118d9f5d81f5629ebe436acb