692250 – Add -sPages option

Bug 692250 - Add -sPages option

Summary: Add -sPages option

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Interpreter (show other bugs)
Version:	master
Hardware:	All All

Importance:	P4 enhancement
Assignee:	Alex Cherepanov

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-06-01 23:59 UTC by Robin Watts
Modified:	2015-06-15 05:13 UTC (History)
CC List:	3 users (show)

See Also:
Customer:
Word Size:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Robin Watts 2011-06-01 23:59:52 UTC

Currently, it's impossible to use ghostscript to extract more than a simple page range from a PDF, or to reorder a PDF.

We can extract a simple page range using:

  gs -sDEVICE=pdfwrite -o out.pdf -dFirstPage=5 -dLastPage=10 in.pdf

but we have no way to specify a reversal of pages, or a more complex set of extractions.

Would it be possible to allow specifications such as:

 -sPages="-4,7,9,10,23,33-37,45-"

(to mean, pages 1 to 4, 7, 9, 10, 23, 33 to 37 inclusive and 45 onwards)

We could even allow things like:

 -sPages="odds,evens"

to mean "all the odd pages followed by all the even ones"

And maybe (for bonus credit)

 -sPages="20-10"

to mean pages 20 to 10 in reverse order.

(If we adopt the last one, we might also want to allow N to be used to mean the highest numbered page in a pdf, so we can do -sPages="N-1", and maybe even -sPages="revodds,revevens" for reverse odds and evens respectively).

The exact functionality etc can be discussed here, but this does seem to be a hole in ghostscripts skillset that should be relatively simple to plug.

Comment 1 Robin Watts 2011-06-02 00:01:48 UTC

Of course, there is no reason this needs to be restricted to the PDF interpreter - if we come up with a sensible way of working, we can implement the same thing in the PCL engine too.

Page reversal etc may be useful for printers with collating engines etc.

Comment 2 Ray Johnston 2011-06-02 04:06:32 UTC

While it is very simple to do the -sPages type of thing for PDF (and probably
XPS) where the document has a page oriented structure, doing this in a more
general fashion for PS or PCL would require the use of the 'saved page' type
of mechanism which would work for 'gx_device_printer' sub-class devices that
use the clist. This (plus other advanced features such as imposition, tiling and
copy collation) was part of what I used at CalComp.

We are planning on discussing saved page design at the next staff meeting.

Handling page reordering in the pdfwrite also wouldn't be too difficult
(and might be possible for ps2write since the temp files are assembled
when the device closes).

Comment 3 Alex Cherepanov 2011-06-02 04:53:38 UTC

We have pdf2dsc script that converts PDF to a DSC-conforming PS.
The latter can be easily processed using DSC parsers or scripting
languages and converted back to PDF.

Does any customer ask for this enhancement?

Comment 4 Ken Sharp 2011-06-02 07:42:58 UTC

Having looked at the IRC log, I think the requested behaviour can be achieved in 2 different ways.

1) Use of a custom EmdPage, which only transfers the raterised data to the output if a particular criterion is met. You can program this any way you want, but it requires some PostScript. I'm not sure this will work reliably with pdfwrite, but I *think* it does.

2) The PDF interpreter is capable of extracting an arbitrary page from the document, which is how FirstPage and LastPage work now, as well as the regular page rendering in fact. You can trivially write PostScript code which will extract pages as required, in any order, and send them to the device. This *will* work reliably with pdfwrite. I did this for someone on comp.lang.postscript some time back to take groups of n pages from a PDF file, trim them to a given size, and output each page to pdfwrite.

For PostScript input this is not possible in some cases, as later pages can depend on the content of earlier pages, unless the file is DSC compliant. For PCL there isn't any way to find a page without interpreting the data (unless I'm mistaken ?) And again I think its possible that for example bitmap fonts can be declared on one page, and used on later pages. So this probably wouldn't be a useful feature for those languages (and for PostScript you can use the EndPage trick, though you might have to process the file multiple times)

Sorry I wasn't around on irc at the time to explain this stuff, but I don't think this is worthy of an enhancement myself.

Comment 5 Robin Watts 2011-06-02 11:46:20 UTC

Ken: I am sure that what you are describing is possible using postscript, but it's beyond most users (certainly, I wouldn't be able to do it easily, and I count myself as above averagely technically competent).

Having a 'simple' mechanism to achieve the result seems like a win to me.

alexcher: No customer has asked yet, as far as I know, but we had a user asking for it yesterday.

Using a 2 stage process might be suitable for a hack, but it's not as nice as being able to do it directly. (Does going to DSC postscript and back lose us transparency or metadata for instance?)

ray: I appreciate that it may not be possible for all languages (or may not be possible trivially, or using the same mechanism at least), but having a standard way of specifying the pages we want seems sensible.

Comment 6 Henry Stiles 2011-06-02 14:56:47 UTC

> 
> Sorry I wasn't around on irc at the time to explain this stuff, but I don't
> think this is worthy of an enhancement myself.

I was not in favor of this either, the pages can be extracted with first page and last page in multiple passes using a script and then other tools can be used to concatenate the results.  Feature bloat.

Comment 7 Ray Johnston 2011-06-02 16:19:39 UTC

The PostScript sequence to process pages from a PDF in arbitrary order relies 
on two simple Ghostscript specific operators:

   (filename.pdf) (r) file runpdfbegin % this loads the tables for the PDF

After the above, the number of pages will be in 'pdfpagecount'

   <first> <last> dopdfpages % processes page ranges in arbitrary

Examples:

   1 pdfpagecount dopdfpages % the default order

   pdfpagecount -1 1 { dup dopdfpages } for % process pages reverse order

   1 2 pdfpagecount { dup dopdfpages } for % process odd pages
   2 2 pdfpagecount { dup dopdfpages } for % process even pages

   [ 1 5 9 2 6 10 3 7 4 8 ] { dup dopdfpages } forall % arbitrary order

I guess that we can just tell folks to use 'pdfwrite' on whatever the input
file is (PS, PDF, PCL, XPS), then run again with the above "trick" to
whatever output format (-sDEVICE) is desired.

Comment 8 cryptopsy 2011-06-23 13:50:45 UTC

Assume a 10 page document called original.pdf, then splitting pdf ranges is very easy.

The split command unit is;

' gs -sDEVICE=pdfwrite -dBATCH -dNOPAUSE -q -dFirstPage=$page -dLastPage=$page -o original.pdf-$page original.pdf' 

and the ranges are managed with shell (for loops for specific ranges). For automated splitting, there is still a problem - the error code is reported incorrectly. Assume the document is 10 pages in lenght, attempt to extract the 12th page with this automated statement ;

' while [[ "$?" == "0" ]]; do   ... ';

...would produce;
'Requested FirstPage is greater than the number of pages in the file: 12
   No pages will be processed (FirstPage > LastPage).'

This is wrong because
1)  FirstPage > LastPage is wrong, as proven by specifying FirstPage=LastPage in the range of the document.
2) The error code for a succesful split is also 0. 

Proof, via the 10 page document;
# gs -sDEVICE=pdfwrite -dFirstPage=12 -dLastPage=12 -o test.pdf-12 test.pdf ; echo $? 
0
# gs -sDEVICE=pdfwrite -dFirstPage=9 -dLastPage=9 -o test.pdf-12 test.pdf ; echo $? 
0

The documentation has something to say, could someone clarify this?
# elinks /usr/share/doc/ghostscript-gpl-8.71-r6/html/Use.htm
...
   Note however that the one page per file feature is not supported by all     
   devices; in particular it does not work with document-oriented output       
   devices like pdfwrite and pswrite. See the -dFirstPage and -dLastPage       
   switches below for a way to extract single pdf pages. 
...

Comment 9 cryptopsy 2011-06-23 13:51:21 UTC

Assume a 10 page document called original.pdf, then splitting pdf ranges is very easy.

The split command unit is;

' gs -sDEVICE=pdfwrite -dBATCH -dNOPAUSE -q -dFirstPage=$page -dLastPage=$page -o original.pdf-$page original.pdf' 

and the ranges are managed with shell (for loops for specific ranges). For automated splitting, there is still a problem - the error code is reported incorrectly. Assume the document is 10 pages in lenght, attempt to extract the 12th page with this automated statement ;

' while [[ "$?" == "0" ]]; do   ... ';

...would produce;
'Requested FirstPage is greater than the number of pages in the file: 12
   No pages will be processed (FirstPage > LastPage).'

This is wrong because
1)  FirstPage > LastPage is wrong, as proven by specifying FirstPage=LastPage in the range of the document.
2) The error code for a succesful split is also 0. 

Proof, via the 10 page document;
# gs -sDEVICE=pdfwrite -dFirstPage=12 -dLastPage=12 -o test.pdf-12 test.pdf ; echo $? 
0
# gs -sDEVICE=pdfwrite -dFirstPage=9 -dLastPage=9 -o test.pdf-12 test.pdf ; echo $? 
0

The documentation has something to say, could someone clarify this?
# elinks /usr/share/doc/ghostscript-gpl-8.71-r6/html/Use.htm
...
   Note however that the one page per file feature is not supported by all     
   devices; in particular it does not work with document-oriented output       
   devices like pdfwrite and pswrite. See the -dFirstPage and -dLastPage       
   switches below for a way to extract single pdf pages. 
...

Comment 10 Ken Sharp 2011-06-23 13:56:51 UTC

(In reply to comment #9)

> and the ranges are managed with shell (for loops for specific ranges). For
> automated splitting, there is still a problem - the error code is reported
> incorrectly. 

 Which error code ?


> ...would produce;
> 'Requested FirstPage is greater than the number of pages in the file: 12
>    No pages will be processed (FirstPage > LastPage).'
> 
> This is wrong because
> 1)  FirstPage > LastPage is wrong, as proven by specifying FirstPage=LastPage
> in the range of the document.
> 2) The error code for a succesful split is also 0. 

There is a limit to how much granularity is available in error reporting. Just because this particular extraction failed does not mean that the whole job failed. 

> The documentation has something to say, could someone clarify this?
> # elinks /usr/share/doc/ghostscript-gpl-8.71-r6/html/Use.htm
> ...
>    Note however that the one page per file feature is not supported by all     
>    devices; in particular it does not work with document-oriented output       
>    devices like pdfwrite and pswrite. See the -dFirstPage and -dLastPage       
>    switches below for a way to extract single pdf pages. 
> ...

What clarification do you require ?

Comment 11 cryptopsy 2011-06-23 14:22:27 UTC

(In reply to comment #10)
> (In reply to comment #9)
> 
> > and the ranges are managed with shell (for loops for specific ranges). For
> > automated splitting, there is still a problem - the error code is reported
> > incorrectly. 
> 
>  Which error code ?
> 
> 
> > ...would produce;
> > 'Requested FirstPage is greater than the number of pages in the file: 12
> >    No pages will be processed (FirstPage > LastPage).'
> > 
> > This is wrong because
> > 1)  FirstPage > LastPage is wrong, as proven by specifying FirstPage=LastPage
> > in the range of the document.
> > 2) The error code for a succesful split is also 0. 
> 
> There is a limit to how much granularity is available in error reporting. Just
> because this particular extraction failed does not mean that the whole job
> failed. 
> 
> > The documentation has something to say, could someone clarify this?
> > # elinks /usr/share/doc/ghostscript-gpl-8.71-r6/html/Use.htm
> > ...
> >    Note however that the one page per file feature is not supported by all     
> >    devices; in particular it does not work with document-oriented output       
> >    devices like pdfwrite and pswrite. See the -dFirstPage and -dLastPage       
> >    switches below for a way to extract single pdf pages. 
> > ...
> 
> What clarification do you require ?

Spoke about it on IRC
< cryptopsy> why not implement a CONTINUE or KEEPGOING feature if they don't want to fail the 100 page job?
< kens> cryptopsy : that's a lot of effort, espcially to write something in PostScript.
< kens> But the PDF itnerpreter expert is alexcher, you can try and persuade him.

Comment 12 cryptopsy 2011-06-23 14:42:37 UTC

 gs -q -c "(11p1.pdf) (r) file runpdfbegin pdfpagecount = quit" is an acceptable way to do what i wanted, i retract my statements

Comment 13 Ken Sharp 2013-06-11 09:43:07 UTC

I'm still unconvinced by this one.

My major quibble is that it is not possible to (easily) extract pages from some kinds of input, and users simply aren't going to understand that the nice -sPages="" command which worked so well with PDF input doesn't work at all with some random PostScript (or PCL).

While its possible to use the various methods described here (EndPage, convert to PDF first etc) they are in my opinion either unsuitable for novice users, or too complex (and too much feature creep) for inclusion in standard Ghostscript.

The feature might well be incorporated into a GSView like application where multiple passes and format conversions can be hidden from the user.

So I'm closing it as wontfix.

Comment 14 Donatas Olsevičius 2015-06-15 05:13:36 UTC

(In reply to Ray Johnston from comment #7)
>    [ 1 5 9 2 6 10 3 7 4 8 ] { dup dopdfpages } forall % arbitrary order

This wasn't easy for me (never even touched gs before) to understand, but if someone else has the same problem, here's a complete command:

  gs -sDEVICE=pdfwrite -o "output.pdf" -c "(input.pdf) (r) file runpdfbegin [ 1 3 5 ] { dup dopdfpages } forall runpdfend"

where "1 3 5" are page numbers.