690066 – Create PDF metadata from PS metadata

Bug 690066 - Create PDF metadata from PS metadata

Summary: Create PDF metadata from PS metadata

Status:	NOTIFIED WONTFIX

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Writer (show other bugs)
Version:	8.63
Hardware:	PC Windows XP

Importance:	P4 enhancement
Assignee:	Ken Sharp

URL:
Keywords:	bountiable

Depends on:
Blocks:

Reported:	2008-09-10 01:48 UTC by artifex
Modified:	2012-05-28 18:30 UTC (History)
CC List:	2 users (show)

See Also:
Customer:	870
Word Size:	---

Attachments
PostScript file that keeps Metadata (186.53 KB, application/pdf) 2008-09-10 01:50 UTC, artifex	Details
PDF created from PS by Adobe Distiller (15.93 KB, application/pdf) 2010-11-12 09:07 UTC, artifex	Details
Patch implementing basic metadata support (1.87 KB, patch) 2010-12-05 00:48 UTC, brian m. carlson	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description artifex 2008-09-10 01:48:44 UTC

The attached PostScript file is a sample how Adobe writes Metadata into a
PostScript file. It would be nice if ghostscript could take over the meta data
into the created PDF-file.
Command line: gs -sDEVICE=pdfwrite -o output.pdf -f metadata_sample.ps

Comment 1 artifex 2008-09-10 01:50:04 UTC

Created attachment 4386 [details]
PostScript file that keeps Metadata

A sample PostScript file that keeps Metadata (search for "packet begin")

Comment 2 artifex 2008-09-10 01:57:31 UTC

It's not a bug. It's an enhancement wish.

Comment 3 Ray Johnston 2008-09-11 09:58:36 UTC

At least one other customer, #1, LIKES the fact that we throw away the MetaData
since this makes our PDF smaller.

If this gets done, it should be an option.

Comment 4 brian m. carlson 2010-11-11 18:16:00 UTC

I'm interested in working on this.  I've got some preliminary code so far.  Is there anything specific that you want an implementation of this to do?

Comment 5 brian m. carlson 2010-11-11 22:28:46 UTC

Also, the testcase is invalid.  XMP must be serialized as UTF-8 for PostScript, according to the XMP specification.  Nevertheless, the metadata contains a raw 0xae byte.  Since I presume my code is going to need to handle this, do you want to raise an error if the metadata is invalid, ignore the metadata altogether, remove the offending bytes, or something else altogether?

Comment 6 artifex 2010-11-12 09:07:35 UTC

Created attachment 6904 [details]
PDF created from PS by Adobe Distiller

The attached PDF-file is what Adobe Distiller creates from that PostScript-file. It detects the problem with the (R)-sign in the Metadata and corrects it to C2 AE. Would be nice if GhostScript could do that too.

Comment 7 brian m. carlson 2010-12-05 00:48:14 UTC

Created attachment 7003 [details]
Patch implementing basic metadata support

Here is a preliminary patch to implement the Metadata pdfmark.  I looked at trying to autoconvert non-UTF-8 but realized that character conversion is in itself a whole project (and in the case of XMP, it *should* be unnecessary).  It also does not modify the metadata in any way, because this requires at the very least an XML parser and more likely a specialized toolkit (like Exempi).  If there's a specific direction that you want me to go from here, some guidance would be great.

Comment 8 Ken Sharp 2010-12-05 09:45:20 UTC

(In reply to comment #7)

> Here is a preliminary patch to implement the Metadata pdfmark.  I looked at
> trying to autoconvert non-UTF-8 but realized that character conversion is in
> itself a whole project (and in the case of XMP, it *should* be unnecessary). 
> It also does not modify the metadata in any way, because this requires at the
> very least an XML parser and more likely a specialized toolkit (like Exempi). 
> If there's a specific direction that you want me to go from here, some guidance
> would be great.

I haven't properly reviewed the code yet (I will do so on Monday) but it looks to me like this simply dumps the XMP data into the PDF file. The problem with that is that the Info dictionary needs to be updated with any data in the XMP packet which overrides the data in the Info dictionary (eg Creator, Keywords, CreationDate etc).

The XMP and PDF data are required to be the same, and various validation tools will fail if they aren't, in particular PDF/A and PDF/X validation.

Comment 9 brian m. carlson 2010-12-05 17:32:39 UTC

(In reply to comment #8)
> I haven't properly reviewed the code yet (I will do so on Monday) but it looks
> to me like this simply dumps the XMP data into the PDF file. The problem with
> that is that the Info dictionary needs to be updated with any data in the XMP
> packet which overrides the data in the Info dictionary (eg Creator, Keywords,
> CreationDate etc).

It does exactly that.  I would like to use Exempi to pull information from the XMP data into the Info dictionary because it makes it trivially easy to do.  (My test program consists of 27 lines.)  Also, it handles the character set issue gracefully and converts the output into valid UTF-8.  Nevertheless, I don't want to add a dependency that y'all aren't okay with.

> The XMP and PDF data are required to be the same, and various validation tools
> will fail if they aren't, in particular PDF/A and PDF/X validation.

That does make sense.

Comment 10 Ken Sharp 2010-12-06 08:38:47 UTC

(In reply to comment #9)

> It does exactly that.  I would like to use Exempi to pull information from the
> XMP data into the Info dictionary because it makes it trivially easy to do. 
> (My test program consists of 27 lines.)  Also, it handles the character set
> issue gracefully and converts the output into valid UTF-8.  Nevertheless, I
> don't want to add a dependency that y'all aren't okay with.

I don't think we're going to be happy with adding another dependency, and I very much doubt if we'll accept Exempi. Its written in C++ (as is the Adobe XMP SDK), it apparently only builds on Unix systems and it adds a fair chunk of code for not a lot of benefit.

If we were to do a lot of work with XMP it would be well worth adding this or something else based on the Adobe XMP SDK, but its a lot to add just in order to support the Metadata pdfmark.

We normally use Expat for an XML parser, but we only build it in when XPS interpretation is supported. I guess we could add it as a dependency for pdfwrite as well, which would at least get you an XML parser, the alternative being to write some quick and dirty parsing to pick out the data from the XMP which needs to be reflected into the Info dictionary.

As for the invalid data, I'm inclined to leave that and treat the XMP packet as a black box. So garbage in, garbage out. I agree that it would be nice to fix this problem, but I really think the onus is on the creator to get it right.

Looks like the original Creator was Microsoft Office, no surprise that they can't manage to write valid XML. Interestingly, although Distiller apparently fixes the XML when converting PostScript to PDF, Acrobat doesn't seem to do so when embedding the XMP in a PostScript file.

Its actually a bit poor that the Acrobat embedded the XMP MetaData pdfmark, but didn't embed a /DOCINFO pdfamrk to set the Info dictionary separately. 

Hmm, according to the pdfmark reference, the Metadata pdfmark should only affect the entry in the Catalog, so theoretically we shouldn't alter the Info dictionary based on the content of the Metadata. However, Distiller clearly does.