I have successfully used GS to extract first page images for the majority of my 9000+ PDF files. A small percentage fail with the following error: -------------------------------------------------------------------------------- gswin32c.exe -dBATCH -dMaxBitmap=300000000 -dNOPAUSE -dSAFER -sDEVICE=jpeg - TextAlphaBits=4 -dGraphicsAlphaBit s=4 -dFirstPage=1 -dLastPage=1 - OutputFile=00010103.jpg 00010103.pdf GPL Ghostscript 8.64 (2009-02-03) Copyright (C) 2009 Artifex Software, Inc. All rights reserved. This software comes with NO WARRANTY: see the file PUBLIC for details. Processing pages 1 through 1. Page 1 *** C stack overflow. Quiting... -------------------------------------------------------------------------------- Was hoping to attach "00010103.pdf" to this bug but I'm not sure if that is possible?
Created attachment 4980 [details] Offending PDF File - 00010103.pdf
I thought this might have been a colour problem, because I had a lot of these types of problem when I reworked that area. However it actually appears to be a JBIG2 decode problem. The PDF file is one where each page is a JBIG2 image, and text has been (presumably OCR'ed) and laid on top with a text rendering mode which draws nothing, resulting in apparently searchable text in an image document. There seems to be some kind of recursion going on in the stream handling, which goes out of control leading to the C stack overflow. I'm afraid I'm not familiar enough with this to know say more. Its definitely a bug though. My first thought was the strange DecodeParms : /DecodeParms<</__pdfnet_jbig2 true>> but this doesn't seem to be an issue, I tried removing them with no effect. Most likely its some characteristic of the JBIG2 encoding which JasPer doesn't like. FWIW the offending image is Im0, this is the first marking object on page 1... A breakpoint on s_jbig2decode_process works pretty well. Using the Luratech decoder instead of JasPer works as expected, so it does look pretty much like this is a JasPer problem. Assigning to Ralph as the owner.
The stack overflow bug is quite easy to fix. The function jbig2_build_huffman_table() allocates 256K on the stack. Ghostscript allocates 128K for the stack. Changing jbig2_build_huffman_table() as following resolves the stack overflow. Production quality code should, indeed, use Ghostscript heap instead of C heap and free the block. Jbig2HuffmanTable * jbig2_build_huffman_table (Jbig2Ctx *ctx, const Jbig2HuffmanParams *params) { int *LENCOUNT = malloc(1 << LOG_TABLE_SIZE_MAX); ... } There is another issue with the file. Some of the characters are placed at the wrong places.
I forgot to multiply by sizeof(int) int *LENCOUNT = malloc((1 << LOG_TABLE_SIZE_MAX)*sizeof(int)); but this doesn't help with the misplaced characters.
Alex's analysis is correct. In fact, the histogram only needs 256 elements. This is fixed upstream. See http://git.ghostscript.com/?p=jbig2dec;a=commitdiff;h=63e0436a711c59f7fae6cfd721b90428ae19a7b3 for the dynamic allocation fix, and http://git.ghostscript.com/?p=jbig2dec;a=commitdiff;h=f1d00697525dd2d7a5f63f96e01ad0d99e673b13 for the size correction. We still don't decode the file correctly, but this at least corrects the stack overflow.
appear to be a jbig issue, hence assigning to Masaki.
I am not seeing an issue with the current code (svn http://svn.ghostscript.com/ghostscript/trunk/gs).