Bug 694853 - PDF/A generation does not fix all errors
Summary: PDF/A generation does not fix all errors
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: master
Hardware: PC Linux
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-12-17 19:13 UTC by Florian Breitwieser
Modified: 2013-12-20 18:02 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
original PDF file (not modified) (1.19 MB, application/pdf)
2013-12-17 19:13 UTC, Florian Breitwieser
Details
File written by gs PDF/A conversion (1.24 MB, application/pdf)
2013-12-17 19:14 UTC, Florian Breitwieser
Details
pdfbox preflight errors (txt format) (18.54 KB, text/plain)
2013-12-17 19:15 UTC, Florian Breitwieser
Details
pdfbox preflight errors (XML format) (2.30 KB, text/xml)
2013-12-17 19:16 UTC, Florian Breitwieser
Details
PDF/A definition ps used in gs process (864 bytes, application/postscript)
2013-12-17 19:18 UTC, Florian Breitwieser
Details
File written by gs PDF/A conversion (updated) (1.23 MB, application/pdf)
2013-12-18 05:20 UTC, Florian Breitwieser
Details
pdfbox preflight errors (txt format) (410 bytes, text/plain)
2013-12-18 05:21 UTC, Florian Breitwieser
Details
pdfbox preflight errors (XML format) (921 bytes, text/xml)
2013-12-18 05:22 UTC, Florian Breitwieser
Details
PDF/A definition ps used in gs process (1.15 KB, application/postscript)
2013-12-18 05:22 UTC, Florian Breitwieser
Details
File written by gs PDF/A conversion w/ -dCompressPages=false (3.59 MB, application/pdf)
2013-12-18 11:48 UTC, Florian Breitwieser
Details
gs PDF/A output PDF (really uncompressed) (3.59 MB, application/pdf)
2013-12-18 12:58 UTC, Florian Breitwieser
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Florian Breitwieser 2013-12-17 19:13:22 UTC
Created attachment 10473 [details]
original PDF file (not modified)

Dear Ghostscript-Team,

I try to make two PDF files PDF/A-1b compatible using gs. In the first file, created with Acrobat Distiller 8.1.0, I have only 3 problems after the conversion concerning EOL/endobj: 

1.2.2 : Body Syntax error, Expected 'EOL' before the endstream keyword
1.2.2 : Body Syntax error, Expected 'endstream' keyword but found 'endobj'
1.0 : Syntax error, Object (325:0) at offset 908286 does not end with 'endobj'.

The second file, created with Acrobat Distiller 10.0.0, is really awful: it has a very strange structure beforehand - a xref table in the beginning and end, and the trailer is after the xref in the beginning. It produces 252 errors in 11 categories. Attached are this PDF file (file-before.pdf, file-after-gs.pdf) and the preflight error messages (in XML and txt format). Visually, file-after-gs.pdf looks fine in the PDF viewer _and_ the text viewer.

I would be happy if you can help me debug and fix this. If necessary, I can hack preflight to return better error messages with the object context.

Best,
Florian

------
Used version: gs 9.10
Command line: gs -dPDFA -dNOOUTERSAVE -dUseCIEColor -dCompatibilityLevel=1.4 -sDEVICE=pdfwrite -sProcessColorModel=DeviceRGB -sPDFACompatibilityPolicy=1 -o pdfa.pdf pdfa-def.ps nopdfa.pdf
Comment 1 Florian Breitwieser 2013-12-17 19:14:53 UTC
Created attachment 10474 [details]
File written by gs PDF/A conversion
Comment 2 Florian Breitwieser 2013-12-17 19:15:34 UTC
Created attachment 10475 [details]
pdfbox preflight errors (txt format)
Comment 3 Florian Breitwieser 2013-12-17 19:16:24 UTC
Created attachment 10476 [details]
pdfbox preflight errors (XML format)
Comment 4 Florian Breitwieser 2013-12-17 19:18:37 UTC
Created attachment 10477 [details]
PDF/A definition ps used in gs process
Comment 5 Ken Sharp 2013-12-17 23:21:12 UTC
Ghostscript does not 'fix' errors in PDF files. When you create a PDF file as output it is a completely new PDF file, it does not inherit any structure from the original file.

The original goal of pdfwrite is the same as Adobe Acrobat Distiller, to produce a PDF file from PostScript input, the PDF file should have the same visual appearance as the input. We do now also produce PDF files from any input source, including PDF, but the mechanism is exactly the same; the input is interpreted to produce a series of marking operations, which are then processed to create the output, in this case a PDF file. Again the goal is the same visual appearance.

So any errors in the PDF file are nothing to do with the input.

Its been a long time since any preflight tool complained about the pdfwrite output, what tool are you using for this ?

There is no object 325 with a generation number of 0 in the file, the xref only has 304 entries, there is no object in the xref at an offset of 908286. Looking at offset 0xDDBFE we are in the middle of a binary stream. I have no clue as to what this error is trying to tell me, it doesn't look valid.

The EOL problem looks genuine and I'll fix that.
Comment 6 Ken Sharp 2013-12-17 23:43:08 UTC
I am unable to reproduce the file you have supplied, the file you have supplied does not appear to be a PDF/A file. I'm somewhat surprised that your preflight tool didn't mention this.....

Checking the PDF/A output I create here, I see that the file *is* PDF/A and all the 'endstream' keywords are preceded by EOL. Note that for a file which is not a PDF/A file it is perfectly valid for the endstream not to be preceded by an EOL.

You have also said that the version is 'master' yet the file says it was created by Ghostscript 9.10, which is most certainly not the master version.
Comment 7 Florian Breitwieser 2013-12-18 05:18:29 UTC
Sorry, it seems it was too late last night. I should have checked the PDF itself. The main errors I made were using 
-sPDFACompatibilityPolicy=1 (instead of ->d<PDFACompatibilityPolicy), and then not regarding the warning 'not permitted in PDF/A, reverting to normal PDF output'. Thus the PDF was not rewritten as PDF/A.

I now retested w/ the correct command arguments and latest code from the git repo. I use Apache pdfbox preflight [1] for validation, and also pdf-tools.com [2]. Validates at pdf-tools.com! Great. pdfbox preflight still shows two errors (which are at the same position), which I think occur when it checks the xref table references. 

1.2.1 : Body Syntax error, Single space expected 
1.0 : Syntax error, Error: Expected a long type

I updated the files to reflect this update.

Best,
Florian

-----------------------------------------------
Related:

In the template PDFA_def.ps, could you change the line to determine /N? The current one

[{icc_PDFA} <</N systemdict /ProcessColorModel get /DeviceGray eq {1} {4} ifelse >> /PUT pdfmark

has the number 4, which has to manually changed to 3 when using DeviceRGB (which is not documented). One possibiliy is [4]:

[{icc_PDFA} <</N systemdict /ProcessColorModel get /DeviceGray eq {1} {systemdict /ProcessColorModel get /DeviceRGB eq {3} {4} ifelse} ifelse >> /PUT pdfmark










[1] http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx
[2] http://pdfbox.apache.org/
[3] http://superuser.com/a/578605/118347
Comment 8 Florian Breitwieser 2013-12-18 05:20:49 UTC
Created attachment 10480 [details]
File written by gs PDF/A conversion (updated)
Comment 9 Florian Breitwieser 2013-12-18 05:21:29 UTC
Created attachment 10481 [details]
pdfbox preflight errors (txt format)
Comment 10 Florian Breitwieser 2013-12-18 05:22:01 UTC
Created attachment 10482 [details]
pdfbox preflight errors (XML format)
Comment 11 Florian Breitwieser 2013-12-18 05:22:30 UTC
Created attachment 10483 [details]
PDF/A definition ps used in gs process
Comment 12 Ken Sharp 2013-12-18 06:06:26 UTC
(In reply to comment #7)

> -sPDFACompatibilityPolicy=1 (instead of ->d<PDFACompatibilityPolicy), and
> then not regarding the warning 'not permitted in PDF/A, reverting to normal
> PDF output'. Thus the PDF was not rewritten as PDF/A.

Yep, that would explain it...


> pdf-tools.com [2]. Validates at pdf-tools.com! Great. pdfbox preflight still
> shows two errors (which are at the same position), which I think occur when
> it checks the xref table references. 
> 
> 1.2.1 : Body Syntax error, Single space expected 
> 1.0 : Syntax error, Error: Expected a long type

I have no clue what its complaining about here. To the best of my knowledge extra white space is generally permitted in PDF files, and numbers are not typed, they are just numbers. I don't see any obvious problems with the xref table either.

That doesn't mean there aren't problems, but I can't see what they are, and the decompressed file is 5Mb. Even if I wanted to, I couldn't hand check that amount. Can you give us any further information ?

You might try using -dCompressPages=false which will write4 the file uncompressed, it may be easier to see what's wrong that way, if you can figure out a file offset where the error occurs.
Comment 13 Florian Breitwieser 2013-12-18 11:48:26 UTC
Created attachment 10486 [details]
File written by gs PDF/A conversion w/ -dCompressPages=false

> You might try using -dCompressPages=false which will write4 the file
> uncompressed, it may be easier to see what's wrong that way, if you can
> figure out a file offset where the error occurs.

That is a handy flag. It complains:

1.2.2 : Body Syntax error, Expected 'EOL' before the endstream keyword at offset 3719169 but found '10'
1.2.2 : Body Syntax error, Expected 'endstream' keyword at offset 3719176 but found 'endobj'
1.0 : Syntax error, Object (185:0) at offset 3714247 does not end with 'endobj'.

That's around lines 131154 and 131180 in the again updated file (object 185 0). I cannot see anything special or wrong there, but my eyes are rather virgin at looking at PDF internals.
Comment 14 Ken Sharp 2013-12-18 12:33:26 UTC
(In reply to comment #13)


> That's around lines 131154 and 131180 in the again updated file (object 185
> 0). I cannot see anything special or wrong there, but my eyes are rather
> virgin at looking at PDF internals.

Can you attach the uncompressed file please ? So I can see what's at that file position :-)

I'll look at it in  the morning, getting late now
Comment 15 Florian Breitwieser 2013-12-18 12:58:42 UTC
Created attachment 10488 [details]
gs PDF/A output PDF (really uncompressed)

> Can you attach the uncompressed file please ? So I can see what's at that
> file position :-)

Oh, sorry. Now it is really not uncompressed (options   -dCompressEntireFile=false -dCompressPages=false -dCompressFonts=false), and the errors are:

1.2.1 : Body Syntax error, Single space expected [offset=3519216; key=3519216; line=/R167 cs; object=COSObject{169, 0}]
1.0 : Syntax error, Error: Expected a long type, actual='/R167'

It looks a bit different this time - the position is in the middle of an object (128461). The reasons why it looks there, I think, are two entries in the xref table:

0003519216 00000 n 
0003519216 00000 n
Comment 16 Ken Sharp 2013-12-19 01:45:02 UTC
This seems to have been caused by the work to prevent duplicate shadings, so wouldn't have affected any released code (this is a new feature). The commit 81b246414c4624cf476793c2590201de408ea33a should resolve it.

Please note that you should not use -dUseCIEColor with the current pdfwrite or ps2write devices. Its no longer necessary, potentially produces larger, slower output and doesn't even produce better quality. Instead set ColorConversionStrategy (you also should not set ProcessColorModel in this case)

Also CompressEntireFile only has any effect with ps2write, not pdfwrite.
Comment 17 Florian Breitwieser 2013-12-19 06:14:26 UTC
Thanks a lot, this is the first time that I see

"The file test-uncompressed.pdf is a valid PDF/A-1b file."

from the pdfbox preflight. Feels good. 

However this did use -dUseCIEColor. With -dColorConversionStrategy=/sRGB, I get a lot of error messages in preflight on the generated PDF, as it seems the color CMYK:

"2.4.2 : Invalid Color space, The operator "k" can't be used with RGB Profile"

I use eciRGB_v2.icc as ICCProfile. When I use /CMYK (just for testing), gs complains ColorConversionStrategy is incompatible to ProcessColorModel.
Comment 18 Ken Sharp 2013-12-19 06:24:39 UTC
(In reply to comment #17)

> However this did use -dUseCIEColor. With -dColorConversionStrategy=/sRGB,

Use RGB if you want RGB output, not sRGB which is a calibrated ICC space.
Comment 19 Florian Breitwieser 2013-12-19 07:23:01 UTC
> Use RGB if you want RGB output, not sRGB which is a calibrated ICC space.

Also with /RGB I have color space issues:

2.3.2 : Unexpected key in Graphic object definition, The ColorSpace is unknown
2.4.2 : Invalid Color space, The operator "k" can't be used with RGB Profile
2.3.2 : Unexpected key in Graphic object definition, The ColorSpace is unknown
2.4.2 : Invalid Color space, The operator "k" can't be used with RGB Profile
...


Command line:

gs -dPDFA=1 -dPDFACompatibilityPolicy=1 \
  -dBATCH -dNOPAUSE -dNOOUTERSAVE \
  -dCompressPages=false -dCompressFonts=false \
  -dColorConversionStrategy=/RGB -dColorImageResolution=300  \
  -sDEVICE=pdfwrite -sOutputFile=$OUTPUTPDF $PDFA_DEF $INPUTPDF
Comment 20 Ken Sharp 2013-12-19 07:31:24 UTC
(In reply to comment #19)
> > Use RGB if you want RGB output, not sRGB which is a calibrated ICC space.
> 
> Also with /RGB I have color space issues:


That shouldn't happen. I suggest you  open a new report though, as this is a different problem.
Comment 21 Florian Breitwieser 2013-12-20 18:02:18 UTC
> That shouldn't happen. I suggest you  open a new report though, as this is a
> different problem.

Ok, thanks so far for your much appreciated help her.