Bug 695786 - Add support to BMC/EMC and BDC when processing PDF2PDF
Summary: Add support to BMC/EMC and BDC when processing PDF2PDF
Status: RESOLVED DUPLICATE of bug 693691
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Interpreter (show other bugs)
Version: master
Hardware: PC Windows 8
: P4 enhancement
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-13 13:05 UTC by Rodrigo Terra
Modified: 2015-01-21 06:45 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
PDF Interpreter support to BMC/BDC/EMC implementation (2.76 KB, patch)
2015-01-13 13:05 UTC, Rodrigo Terra
Details | Diff
implements BDC references to page property dict (810 bytes, patch)
2015-01-14 10:18 UTC, Rodrigo Terra
Details | Diff
Revert changes in opdfread from the first diff (1.44 KB, patch)
2015-01-14 10:21 UTC, Rodrigo Terra
Details | Diff
PS,PDF and BATCH for BMC/BDC/EMC test case (19.15 KB, application/x-zip-compressed)
2015-01-14 11:46 UTC, Rodrigo Terra
Details
Implements forward BDC information to no stream properties (3.76 KB, patch)
2015-01-21 06:39 UTC, Rodrigo Terra
Details | Diff
Extra BDC/EMC test case (736.92 KB, application/pdf)
2015-01-21 06:45 UTC, Rodrigo Terra
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rodrigo Terra 2015-01-13 13:05:06 UTC
Created attachment 11415 [details]
PDF Interpreter support to BMC/BDC/EMC implementation

If pdfwrite is reading from a PDF to generate a new PDF although supports BMC/EMC and BDC pdfmarks doesn't preserve such information from original PDF file.
Comment 1 Ken Sharp 2015-01-14 00:49:20 UTC
Hi Rodrigo, and thanks for the patch.

Do we have a contributor's agreement from you already ?

Could you attach an example PDF file which demonstrates the use of this please ? Its easier than trying to find one myself and I assume you tested this. I'd particularly like to see an example using BDC.

I notice you have modified opdfread.ps as well as opdfread.h. Since these are only used with the PostScript output device I'm doubtful about the utility of adding support for marked content here. We don't support any other pdfmarks in the PostScript output code, and we don't test in that code to see if the PostScript interpreter even supports pdfmark. If it doesn't then this would lead to a PostScript error.
Comment 2 Rodrigo Terra 2015-01-14 10:18:54 UTC
Created attachment 11416 [details]
implements BDC references to page property dict

Should be applied after first commit that implements BMC/BDC/EMC process
Comment 3 Rodrigo Terra 2015-01-14 10:21:05 UTC
Created attachment 11417 [details]
Revert changes in opdfread from the first diff

Should be applied as third patch
Comment 4 Rodrigo Terra 2015-01-14 11:46:08 UTC
Created attachment 11418 [details]
PS,PDF and BATCH for BMC/BDC/EMC test case
Comment 5 Rodrigo Terra 2015-01-14 12:17:34 UTC
(In reply to Ken Sharp from comment #1)
> Hi Rodrigo, and thanks for the patch.
> 
> Do we have a contributor's agreement from you already ?
> 
> Could you attach an example PDF file which demonstrates the use of this
> please ? Its easier than trying to find one myself and I assume you tested
> this. I'd particularly like to see an example using BDC.
> 
> I notice you have modified opdfread.ps as well as opdfread.h. Since these
> are only used with the PostScript output device I'm doubtful about the
> utility of adding support for marked content here. We don't support any
> other pdfmarks in the PostScript output code, and we don't test in that code
> to see if the PostScript interpreter even supports pdfmark. If it doesn't
> then this would lead to a PostScript error.


Hi Ken,

Thank you for call my attention about contributor's agreement, I sent to Mile Jones and it ok now I guess.

Sorry I didn't explain in details the problem before. The problem is if you have a PDF marked and use ghostscript to handler pdf in anyway for example change title metadata new pdf created remove all BMC information. I attach some files to test case and you can try:

gswin32c -dNOBATCH -dNOPAUSE -sDEVICE=pdfwrite -o pdf.WithActualTextFromPDF.pdf pdf.test.WithActualText.pdf -c "[/Title (test BMC) /DOCINFO pdfmark"

pdf.test.WithActualText.pdf is generate from attached Postscript using batch attached also.

With Ghostscript 9.15 resulted pdf pdf.WithActualTextFromPDF.pdf has no BMC marks inside. One way to check is just cut-and-past the result with BMC should be 1. Oranges 2. Apples and without (1) Oranges (2) Apples depend of fonts you have installed.

I add pdf created with Distiller also from same PS change adjust font names for those installer in my machine and the reason is Distiller creates BMC in a slightly different way. In Page /Content Ghostscript does /Span /R10 BDC where /R10 is defined at /Resources /Properties dictionary. Distiller on the other hand has a more straight approach and add /Span <</ActualText (2.)>> BDC not using the double indirection. My new commit it is adjust to support both Ghostscript and Distiller ways.

About opdfread... I follow your advice and revert all changes and it is my third commit file.

Please Ken let me know if this time I was able to explain matter reason of suggested patch.

Thanks.
Best Regards Rodrigo.
Comment 6 Ken Sharp 2015-01-15 00:12:22 UTC
(In reply to Rodrigo Terra from comment #5)

> Thank you for call my attention about contributor's agreement, I sent to
> Mile Jones and it ok now I guess.

Yes, Miles confirmed that, we're good to go there, thanks for following up on that.

 
> Sorry I didn't explain in details the problem before.

Your explanation was absolutely fine, I understood the problem. There are any number of parts of a PDF file which the PDF interpreter doesn't push forward to the pdfwrite device, which results in certain kinds of metadata going missing.


> With Ghostscript 9.15 resulted pdf pdf.WithActualTextFromPDF.pdf has no BMC
> marks inside.

You don't happen to have a case with BDC marks do you ? Having read through the relevant portion of the specification, that was the area that concerned me most. Ah, I see that there *is* a BDC in the test file, that's great.

> I add pdf created with Distiller also from same PS change adjust font names
> for those installer in my machine and the reason is Distiller creates BMC in
> a slightly different way. In Page /Content Ghostscript does /Span /R10 BDC
> where /R10 is defined at /Resources /Properties dictionary. Distiller on the
> other hand has a more straight approach and add /Span <</ActualText (2.)>>
> BDC not using the double indirection. My new commit it is adjust to support
> both Ghostscript and Distiller ways.

Yes, both are legal, you have to be ready to cope with indirect objects at any time.

 
> About opdfread... I follow your advice and revert all changes and it is my
> third commit file.
> 
> Please Ken let me know if this time I was able to explain matter reason of
> suggested patch.

Looks good right now, I'll follow it up this morning.
Comment 7 Ken Sharp 2015-01-15 07:39:19 UTC
OK I'm afraid there are a number of problems here.

Firstly your patch drops the "/BMClevel BMClevel 1 add store" from the definition of BDC. That's not going to work because we use BMClevel in order to deal with dropping optional content groups which aren't enabled. If you don't keep the levels right then things will get very confused.

Secondly, you are only dealing with BDC which has a tag of '/Span'. That's too limited, BDC can have any tag, and I'm not keen on only supporting /Span. (We already use tag /OC for switching on/off optional content groups for rendering).

If you drop the 'BDC', you *must* also drop the corresponding 'EMC' or Acrobat will get cross when you open the PDF file. You don't do that in this patch, anything which doesn't have a tag of /Span is dropped, which *will* result in problems if you don;t drop the EMC as well.

As you noted we may need to pull the properties dictionary from the page resources Properties dictionary. There is, unfortunately a serious problem here. The problem is that the Properties dictionary can include apparently anything. I have a number of PDF files here which have Properties dictionaries which reference stream objects.

I've struggled with this all day and at the moment I cannot see a way to add a stream object to a PDF file using a pdfmark. If I can't do that, then I can't preserve this kind of marked content (FYI this is a 'PlacedGraphic' tag whose Properties dictionary includes a /Metadata key, the value of that is a stream object).

Even if I could come up with a way to do that, it would take me too long to add this to the existing patch. I'd also have to consider how to cope with a Properties dictionary whose values can be (apparently) anything at all, and this is going to be hard to do in general.

I'd say this is several weeks work for me to implement.

So I'm sorry but I'm not able to adopt this patch as it stands. I'm instead going to close this as a duplicate of Bug #693691, hopefully the information here will be of use if I ever get the time to add that feature properly.

*** This bug has been marked as a duplicate of bug 693691 ***
Comment 8 Rodrigo Terra 2015-01-21 06:39:58 UTC
Created attachment 11426 [details]
Implements forward BDC information to no stream properties

The old patch was updated to skip properties that does references to objstream and now supports all tags not only /Span tag.
Comment 9 Rodrigo Terra 2015-01-21 06:45:24 UTC
Created attachment 11427 [details]
Extra BDC/EMC test case

This file tests nested BDC marks and /tag <property reference to stream object> BDC

It not expected preserve BDC marks but nicely process pdf stripping BDC content in case of stream object.

Today all BDC information is stripped with this patch all no stream information will be preserved. Stream object will still wait futures improvements and it is quite possible that I little more implementation using PUTSTREAM pdfmark could handle it case too. At least I have this felling.