Bug 692588 - Some watermarked PDF files are rasterized when converting to PDF/A - #1524
Summary: Some watermarked PDF files are rasterized when converting to PDF/A - #1524
Status: RESOLVED INVALID
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.04
Hardware: PC Windows 7
: P4 minor
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-10-13 12:18 UTC by jritmeijer
Modified: 2011-10-13 16:09 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
A simple PDF file that contains an image. The source of the image is a JPG file (84.94 KB, application/pdf)
2011-10-13 12:19 UTC, jritmeijer
Details
The same file after converting to PDF/A. Note that it has been rasterized (220.71 KB, application/pdf)
2011-10-13 12:19 UTC, jritmeijer
Details
A PDF file of a form before conversion. Very basic, no images (62.12 KB, application/pdf)
2011-10-13 12:20 UTC, jritmeijer
Details
The same form after conversion to PDF/A. It is not clear to me why this is rasterized. (392.48 KB, application/pdf)
2011-10-13 12:20 UTC, jritmeijer
Details

Note You need to log in before you can comment on or make changes to this bug.
Description jritmeijer 2011-10-13 12:18:44 UTC
Perhaps not necessarily a bug, more a question.

I am working on a solution that uses Ghostscript to post process PDF files in order to convert them to PDF/A. This works quite well, but occasionally the resulting PDF/A file is rasterized.

I assume this is because the source file contains some kind of content that can not be expressed in PDF/A unless an image is used. 

I have done some tests to get a feeling what is causing a page to be rasterized, but I'd like to get a definitive list so I can document it.

No rasterisation
- Text Rotation
- Overlapped text

Rasterization
- Text (or any object) using transparency.
- The addition of any image, see attached file.

So my question is: What determines if a page is rasterized when converting to PDF/A?

Attached are the following files:
* SourcePDF-ImageWatermarked.pdf: A simple PDF file that contains an image. The source of the image is a JPG file.
* PDFA-ImageWatermarked.pdf: The same file after converting to PDF/A. Note that it has been rasterized.
* Source-Form.PDF: A PDF file of a form before conversion. Very basic, no images
* PDFA-Form.PDF: The same form after conversion to PDF/A. It is not clear to me why this is rasterized.

My PDF/A GS Command line options and definition file are as per bug #692587
Comment 1 jritmeijer 2011-10-13 12:19:34 UTC
Created attachment 7992 [details]
A simple PDF file that contains an image. The source of the image is a JPG file
Comment 2 jritmeijer 2011-10-13 12:19:57 UTC
Created attachment 7993 [details]
The same file after converting to PDF/A. Note that it has been rasterized
Comment 3 jritmeijer 2011-10-13 12:20:20 UTC
Created attachment 7994 [details]
A PDF file of a form before conversion. Very basic, no images
Comment 4 jritmeijer 2011-10-13 12:20:43 UTC
Created attachment 7995 [details]
 The same form after conversion to PDF/A. It is not clear to me why this is rasterized.
Comment 5 Ken Sharp 2011-10-13 12:31:46 UTC
(In reply to comment #0)
> Rasterization
> - Text (or any object) using transparency.

PDF/A-1 does not support transparency *at all*. You can either accept the current approach which produces an opaque representation, or use -DNOTRANSPARENCY whihc will ignore all transparent operations, but obviously the output will be incorrect.


> - The addition of any image, see attached file.

This is certainly not the case, I have many examples from customers creating PDF/A-1 output which contain images and which do not fall back to rendering the entire document.

 
> So my question is: What determines if a page is rasterized when converting to
> PDF/A?

If it can't be represented in PDF/A-1 format. Modulo bugs of course, but in general we need to take very specific action in pdfwrite to render any content, so its unlikely this is accidental.
Comment 6 jritmeijer 2011-10-13 13:00:11 UTC
(In reply to comment #5)

Thanks for getting back so quickly.

Other files that contain images do convert to PDF/A just fine. How can I find out why this one (see attachment) does not? Interestingly when I specify "-DNOTRANSPARENCY" the problem goes away and the image is still visible.

This same switch solves the problem with the other file as well (Source-Form.pdf) and the resulting PDF/A file looks identical to the source file. 

So if there is no list readily available of elements that cause a page to be rasterised, is there some kind of diagnostics output I can enable this so I can run tests on documents that don't behave as expected.

Thanks again for your assistance, this is very helpful.
Comment 7 Ken Sharp 2011-10-13 13:21:17 UTC
(In reply to comment #6)

> Other files that contain images do convert to PDF/A just fine. How can I find
> out why this one (see attachment) does not? Interestingly when I specify
> "-DNOTRANSPARENCY" the problem goes away and the image is still visible.

That means the input contains transparency. It may not do anything useful, but the code can't tell that, it contains transparency so we assume it will have some effect and treat it accordingly.

NB I haven't actually looked at the file, but this is what must be happening.
 

> This same switch solves the problem with the other file as well
> (Source-Form.pdf) and the resulting PDF/A file looks identical to the source
> file. 

Same problem then.

 
> So if there is no list readily available of elements that cause a page to be
> rasterised, 

There are no elements which will cause a page to be rasterised, but the presence of transparency will, because the specification doesn't permit transparency.

>is there some kind of diagnostics output I can enable this so I can
> run tests on documents that don't behave as expected.

Hmm, I don't think so, no. Transparency in PDF documents is unfortunately complicated, the elements can appear in all sorts of places and there is no overriding 'this document contains transparency' in the PDF file. 

Ah, the pdf_info.ps file supplied as part of Ghostscript will tell you if a given page uses transparency. This tool reports that both your files use transparency.
Comment 8 jritmeijer 2011-10-13 13:25:41 UTC
Awesome, thanks.

BTW, I will be reporting a few more issues in the next couple of days, I have been saving them up till the end of my project. I am fairly sure at least some of them are real bugs :-)
Comment 9 Ken Sharp 2011-10-13 13:28:09 UTC
(In reply to comment #8)
> Awesome, thanks.
> 
> BTW, I will be reporting a few more issues in the next couple of days, I have
> been saving them up till the end of my project. I am fairly sure at least some
> of them are real bugs :-)

May I ask what the purpose of your project is ? Are you writing something for in-house conversion, or an academic exercise or something ?
Comment 10 jritmeijer 2011-10-13 13:34:08 UTC
Nothing academic I am afraid. Evaluating PDF to PDF/A converters for a customer who needs to archive off a bunch of forms.
Comment 11 Ray Johnston 2011-10-13 16:09:55 UTC
Note that Ghostscript has a little utility that will tell you if a PDF has
transparency.

gs -- toolbin/pdf_info.ps _____.pdf

where _____.pdf is the file you want information on.