705962 – Uncompressed XMP metadata

Bug 705962 - Uncompressed XMP metadata

Summary: Uncompressed XMP metadata

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Writer (show other bugs)
Version:	unspecified
Hardware:	PC Windows 10

Importance:	P4 enhancement
Assignee:	Ken Sharp

URL:
Keywords:

Depends on:
Blocks:

Reported:	2022-10-07 13:10 UTC by Ulrike Fischer
Modified:	2022-10-07 18:15 UTC (History)
CC List:	0 users

See Also:
Customer:
Word Size:	---

Attachments
test files (22.38 KB, application/x-zip-compressed) 2022-10-07 13:10 UTC, Ulrike Fischer	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ulrike Fischer 2022-10-07 13:10:50 UTC

Created attachment 23273 [details]
test files

With the hyperxmp package it is possible in LaTeX to embed extended XMP metadata into a PDF and to reference them in the catalog. This works also with LaTeX+dvips+ghostscript: a current ghostscript honors the new metadata stream. But unlike the default metadata the new stream uses /Filter/FlateDecode and is compressed. 

Is it possible somehow to tell in the postscript file Ghostscript to leave the new metadata stream uncompressed too? 

Attached a zip file with the tex, ps and pdf file.

Comment 1 Ken Sharp 2022-10-07 15:34:02 UTC

(In reply to Ulrike Fischer from comment #0)

> Is it possible somehow to tell in the postscript file Ghostscript to leave
> the new metadata stream uncompressed too? 

No, sorry, there is no pdfmark which says 'don't compress this stream'. Also the stream is created in one place and attached to the Catalog in another, there is no way in the pdfwrite device to associate the two in advance so that it knows not to compress the stream because it is later going to be attached to the Catalog as Metadata.

You could use -dCompressStreams=false but that would write *all* streams uncompressed which I strongly suspect isn't what you would want.


I'd also point out that the PDF file you are producing by doing this has problems; the XMP metadata and the Info dictionary are not consistent. For instance, your XMP Metadata has 

<pdf:Producer>dvips + Distiller</pdf:Producer>

whereas the Info dictionary has:

/Producer(GPL Ghostscript 10.00.0)

It also looks like the Metadata you are including states that the file is a PDF/A file, and I'm not convinced it is a valid PDF/A file.

Unless your command line tells the pdfwrite device to create a PDF/A file it's not a good idea to have the Metadata declare it as such. In particular a conforming PDF/A file which has both Metadata and an Info dictionary *must* have the properties be consistent between them.


Can I ask why you are doing this ? The inconsistency between the Metadata and the Info dictionary contents is the reason why we don't use this approach for creating ZUGFeRD PDF files.

If all you want to do is add additional XML to the XMP Metadata then you can (with Ghostscript) use the non-standard pdfmark 'Ext_Metadata' key to add to the existing XML generated by the pdfwrite device.

Documented here :

https://ghostscript.readthedocs.io/en/latest/VectorDevices.html#pdfmark-extensions


This has the benefit that the XML and Info are synchronised, minimises the amount you have to write yourself and means that the resulting Metadata is not compressed. It also works with PDF/A output.

Comment 2 Ulrike Fischer 2022-10-07 16:46:57 UTC

(In reply to Ken Sharp from comment #1)
> (In reply to Ulrike Fischer from comment #0)
> 
> > Is it possible somehow to tell in the postscript file Ghostscript to leave
> > the new metadata stream uncompressed too? 
> 
> No, sorry, there is no pdfmark which says 'don't compress this stream'. 

Pity ;-) I was hoping that one could use some trick, e.g. setting /Filter to some value which ghostscript would then honor. 

 
> You could use -dCompressStreams=false but that would write *all* streams
> uncompressed which I strongly suspect isn't what you would want.

No, that is easy. 
 
> 
> I'd also point out that the PDF file you are producing by doing this has
> problems; the XMP metadata and the Info dictionary are not consistent. For
> instance, your XMP Metadata has 

Sorry I didn't want to confuse it. I know that the file is not conformant, it was only meant as a short example showing the compressed metadata stream. 


> 
> <pdf:Producer>dvips + Distiller</pdf:Producer>
> 
> whereas the Info dictionary has:
> 
> /Producer(GPL Ghostscript 10.00.0)
> 
> It also looks like the Metadata you are including states that the file is a
> PDF/A file, and I'm not convinced it is a valid PDF/A file.
> 
> Unless your command line tells the pdfwrite device to create a PDF/A file
> it's not a good idea to have the Metadata declare it as such. In particular
> a conforming PDF/A file which has both Metadata and an Info dictionary
> *must* have the properties be consistent between them.

Yes I know. Does ghostscript already has an option to suppress the Info dictionary altogether? With pdf/A-4 this will be required.


> If all you want to do is add additional XML to the XMP Metadata then you can
> (with Ghostscript) use the non-standard pdfmark 'Ext_Metadata' key to add to
> the existing XML generated by the pdfwrite device.
> 
> Documented here :
> 
> https://ghostscript.readthedocs.io/en/latest/VectorDevices.html#pdfmark-
> extensions
> 
> 
> This has the benefit that the XML and Info are synchronised, minimises the
> amount you have to write yourself and means that the resulting Metadata is
> not compressed. It also works with PDF/A output.

I need code that works (if possible) with all TeX engines and also need to take distiller into account so non-standard extension are normally not really an option.

Ulrike

Comment 3 Ken Sharp 2022-10-07 17:28:44 UTC

(In reply to Ulrike Fischer from comment #2)

> Yes I know. Does ghostscript already has an option to suppress the Info
> dictionary altogether? 

No. There are various aspects which can be controlled, these are documented. See:

https://ghostscript.readthedocs.io/en/latest/VectorDevices.html#pdf-file-output

-dOmitInfoDate
-dOmitID
-dOmitXMP

Note there's something screwy with the formatting there and the 'boolean' for the value type has been concatenated into the switch name. I'll get that fixed.


> With pdf/A-4 this will be required.

As far as I'm aware the PDF/A-4 spec still isn't ratified, but I haven't been keeping track. I imagine we will simply roll that into the PDF/A-4 'package' if we decide to support it, rather than requiring it to be set separately.

 
> I need code that works (if possible) with all TeX engines and also need to
> take distiller into account so non-standard extension are normally not
> really an option.

OK then you are stuck with what's in the pdfmark reference, and I cannot see any way to do what you want. Sorry.

Comment 4 Ulrike Fischer 2022-10-07 18:15:55 UTC

(In reply to Ken Sharp from comment #3)
> (In reply to Ulrike Fischer from comment #2)
> 
> > Does ghostscript already has an option to suppress the Info
> > dictionary altogether? 

> > With pdf/A-4 this will be required.
> 
> As far as I'm aware the PDF/A-4 spec still isn't ratified,

It has been released two years ago:

https://www.iso.org/standard/71832.html

But I don't own it, so all I know about is from hearsay.

> OK then you are stuck with what's in the pdfmark reference, and I cannot see
> any way to do what you want. Sorry.

No problem. Thanks for the information.

Ulrike