Bug 692773

Summary: Increase in size with PDF/A output
Product: Ghostscript Reporter: Marcos H. Woehrmann <marcos.woehrmann>
Component: PDF WriterAssignee: Ken Sharp <ken.sharp>
Status: NOTIFIED INVALID    
Severity: normal CC: dgrillo
Priority: P2    
Version: master   
Hardware: PC   
OS: All   
Customer: 670 Word Size: ---
Attachments: input PDF. Transparency removed from stamp and font embedded
Output pdfa. rotated 90 degrees

Description Marcos H. Woehrmann 2012-01-05 21:17:41 UTC
I haven't tested this but the customer reports that the attached PDF file increases in size to 11 megs when converted to PDF/A by Ghostscript.  Here's are some details from the customer's email:

You’ll notice that the input pdf has a watermark on it. I believe there is something wrong with how this watermark is done and is causing the bad interaction with GhostScript.  When I configure the engine to leave out the watermark, the PDFA comes out acceptable.  Was wondering if there was any insight you could give. Thanks very Much!

Here's the command line they are using:

gswin32c -dPDFA -dUseCIEColor -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=2 -dNOPAUSE -r600 -dBATCH -sOutputFile=out.pdf ocred.pdf
Comment 2 Ray Johnston 2012-01-05 22:47:50 UTC
This is probably due to transparency being used to draw the watermark.

The PDF/A version we currently generate doesn't allow transparency, so the
entire page is "flattened" to an image.

I'm sure Ken will correct me if this is not the cause, and he can comment on
the support for the newer (not yet finalized, AFAIK) PDF/A standard that does
allow transparency (PDF 1.4)
Comment 3 Ken Sharp 2012-01-06 08:24:47 UTC
(In reply to comment #2)
> This is probably due to transparency being used to draw the watermark.

This is precisely the problem. The 'Demonstration licence' test is drawn using Text rendering mode 2 (fill then stroke). The graphics state sets both CA and ca to 0.399, these are the 'constant alpha' for both stroke and fill operations. Since these are less than 1, the text is not opaque.

 
> The PDF/A version we currently generate doesn't allow transparency, so the
> entire page is "flattened" to an image.

PDF/A-1 does not permit transparency. Because the watermark covers so much of the page, the whole page is indeed rasterised to an image. Obviously this will also render any text unsearchable, since it will no longer be text.

 
> I'm sure Ken will correct me if this is not the cause, and he can comment on
> the support for the newer (not yet finalized, AFAIK) PDF/A standard that does
> allow transparency (PDF 1.4)

I believe the PDF/A-2 specification is complete, I did try to purchase a copy before Christmas but had some hiccups with the web site, I'll try again now that people should be back at work.

PDF/A-2 permits transparency, but it may not be generally accepted yet, since it is so very new.
Comment 4 Marcos H. Woehrmann 2012-01-12 08:12:09 UTC
The customer has an additional comment/question:

This is a follow up question to bug 692773.   That’s an issue where transparency in a watermark of a pdf converted to PDFA comes out huge.  To get around that, we’re doing our stamping without the transparency.  But when we do that, the stamp seems to be double printed and misaligned after running it through GhostScript and converting it to PDFA. We’re using GS9.04 win32 but we have fixes applied for bugs 692717,  692569 and 692422.
 
 
The commandline we’re using is
 
gswin32c -dPDFA -dUseCIEColor -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -dNOPAUSE -r600 -dBATCH -sOutputFile=out.pdf NYTimes_ocr.pdf
 
Note that when I run the above commandline gs gives many “PDFA Doesn’t allow images with interpolate true” warnings. I don’t know if this relates to the stamp issue we’re seeing. I assumed that this is just a warning telling us a feature was omitted for PDFA.
 
 
Attached is NYTimes_ocr.pdf which is the input and has our altered watermark without transparency.
Out.pdf is the output showing the watermark printed very oddly
Comment 6 Ken Sharp 2012-01-12 09:31:19 UTC
(In reply to comment #4)

> around that, we’re doing our stamping without the transparency.  But when we do
> that, the stamp seems to be double printed and misaligned after running it
> through GhostScript and converting it to PDFA. We’re using GS9.04 win32 but we
> have fixes applied for bugs 692717,  692569 and 692422.

The text is not drawn twice, it is drawn (in effect) 3 times. Two of these are performed by what one might reasonably consider to be the 'watermark', which is contained in a form XObject :

192 0 obj
<<
  /BBox [ -0.240005 -3.92409 180.876 13.2359 ]
  /Subtype /Form
  /Length 180
  /Matrix [ 1 0 0 1 0 0 ]
  /Resources <<
    /Font <<
      /TT0 198 0 R
    >>
    /ProcSet [ /PDF /Text ]
  >>
>>
stream
BT
1 G
0.48 w 
/TT0 1 Tf
0 Tc 0 Tw 0  Ts 100  Tz 1 Tr 12 0 0 12 0 0 Tm
(DEMONSTRATION LICENSE)Tj
ET
BT
0.39999 G
0.24001 w 
/TT0 1 Tf
12 0 0 12 0 0 Tm
(DEMONSTRATION LICENSE)Tj
ET
endstream
endobj

The font used here is 'Arial-Black' which is *NOT* embedded in the input PDF file. As a result (and because PDF/A requires that fonts be embedded) pdfwrite embeds Helvetica-Bold as a substitute. 

This works perfectly well. However, the image underlying this form has been created with the same watermark in the same location, as part of the image. Obviously as this is part of the image, it is not text, so no font substitution is required, or takes place.

Because the underlying image text was rendered using the correct font, it does not match up with the watermark which uses a substituted font.

You can open the original file in Adobe Acrobat Professional, use the Touch-Up object tool to select and delete the watermark, and you will see that the watermark text remains underneath, but now the surrounding white area which makes the original watermark stand out has vanished (clearly this wasn't present when the original image was marked).

The underlying page is not a single image, so its possible to select and delete portions of it, if you do this you will see that some of the watermark disappears each time, demonstrating that this watermark is in fact part of the image.

Embedding the Arial-Black font, or making it available in an embeddable format so that pdfwrite can embed it as required should eliminate this problem.
Comment 7 Dan G. 2012-01-13 16:18:30 UTC
Created attachment 8274 [details]
input PDF. Transparency removed from stamp and font embedded
Comment 8 Dan G. 2012-01-13 16:19:04 UTC
Created attachment 8275 [details]
Output pdfa. rotated 90 degrees
Comment 9 Dan G. 2012-01-13 16:21:05 UTC
We're so close. And I know I'm being a pain - I'm sorry.

We've removed the transparency from the watermark. Embedded the font we want as well.

Now when converting the pdf to pdfa with the commandline noted previously, the output gets rotated 90 degrees.

We notice that when we remove the rotation from the watermark, the output pdfa does not get rotated.

Should we just quit while we're ahead and not rotate the watermark or is there something else we could do to get GS to not rotate.
Comment 10 Ken Sharp 2012-01-13 17:48:43 UTC
The only text in the file now is the watermark.

By default pdfwrite will rotate the output page in order to make the majority of the text run left to right and top to bottom. The watermark text runs left to right, bottom to top so pdfwrite rotates the page.

Try setting "-dAutoRotatePages=false" see "/ghostpdl/gs/doc/Ps2pdf.htm#Orientation"
Comment 11 Dan G. 2012-01-13 17:56:18 UTC
Thanks for the explanation Ken. I appreciate it.