Bug 699216

Summary: JPEG passthrough appears to truncate JPEGs in some cases
Product: Ghostscript Reporter: James R Barlow <jim>
Component: PDF WriterAssignee: Ken Sharp <ken.sharp>
Status: RESOLVED FIXED    
Severity: normal CC: piotr
Priority: P4    
Version: 9.23   
Hardware: Macintosh   
OS: MacOS X   
Customer: Word Size: ---
Attachments: in.pdf
Another test case -- pdf with JPEG image

Description James R Barlow 2018-04-12 15:01:34 UTC
Created attachment 15010 [details]
in.pdf

When the attached file in.pdf is passed through Ghostscript 9.23 with this command line:

gs -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite  -o out.pdf in.pdf

qpdf reports that the output file is damaged. (The input file has no issues.)

$ qpdf --check out.pdf
checking out.pdf
PDF Version: 1.5
File is not encrypted
File is not linearized
WARNING: out.pdf (offset 304198): error decoding stream data for object 12 0: invalid jpeg data reading from buffer
WARNING: out.pdf (offset 304198): stream will be re-processed without filtering to avoid data loss

To confirm, I extracted the JPEG with pdfimages -j, and used jpeginfo -c to check the JPEG separately. jpeginfo reports:

_img-000.jpg 1232 x 1728 24bit JFIF  N   42469  Premature end of JPEG file  [WARNING]

I extracted the JPEG from in.pdf as well. jpeginfo reports no error and shows the file length as 42471 bytes instead of 42469.  It appears Ghostscript omitted two bytes from the end of the JPEG.

out.pdf still opens in PDF viewers and nothing is obviously wrong with it, compared to in.pdf.

in.pdf sets "/Filter [ /FlateDecode /DCTDecode ]", which is unusual and likely the cause of the issue. in.pdf was generated by a HP Officejet 8620 scanner.

"mupdf clean" cannot detect or correct the error.

This is a regression. Previous versions of Ghostscript processed this file without damaging it.
Comment 1 James R Barlow 2018-04-17 13:48:21 UTC
I found another example of this issue in a file I cannot share, that had /Filter /DCTDecode, and a complex /ColorSpace /Separation with CMYK.

What the two files have in common is that both JPEGs are used with image/stencil masks.
Comment 2 Piotr Strzelczyk 2018-04-18 06:25:56 UTC
Created attachment 15045 [details]
Another test case -- pdf with JPEG image

I also spotted the same problem -- attached PDF (with DCT stream, /Length 301160), after processing by new Ghostscript results in PDF with truncated DCT stream i.e. /Length 301158. Some viewers accept generated PDF, but Adobe Reader fails.
Comment 3 James R Barlow 2018-04-18 16:01:14 UTC
When -dPassThroughJPEGImages=false is added to arguments, the issue does not occur.
Comment 4 Ken Sharp 2018-05-22 13:49:14 UTC
(In reply to Piotr Strzelczyk from comment #2)
> Created attachment 15045 [details]
> Another test case -- pdf with JPEG image
> 
> I also spotted the same problem -- attached PDF (with DCT stream, /Length
> 301160), after processing by new Ghostscript results in PDF with truncated
> DCT stream i.e. /Length 301158. Some viewers accept generated PDF, but Adobe
> Reader fails.

This must be some difference in Acrobat Pro and Reader, or some recent change. My version of Acrobat Pro is entirely happy with the original files.

In any event fixed in commit b61071c9411c3f6aa0dd594da2c5a20ff4ecd914