Bug 698208 - PDF contains current timestamps - Bad for reproducible builds
Summary: PDF contains current timestamps - Bad for reproducible builds
Status: RESOLVED WONTFIX
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: master
Hardware: PC All
: P4 enhancement
Assignee: Ken Sharp
QA Contact: Bug traffic
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-07-06 03:55 UTC by Danny Milosavljevic
Modified: 2017-07-06 09:25 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
Make "/ID" optional. (3.16 KB, patch)
2017-07-06 03:55 UTC, Danny Milosavljevic
Details | Diff
Don't write UUIDs (1.21 KB, patch)
2017-07-06 03:55 UTC, Danny Milosavljevic
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Danny Milosavljevic 2017-07-06 03:55:03 UTC
Created attachment 13928 [details]
Make "/ID" optional.

PDFs generated by ghostscript contain the current time - which is bad for reproducible builds.

We patch this out ourselves in the GNU Guix distribution but I thought I also bring it up here.

What we changed is:
- We emit "/ID" in the trailer only when encrypting.
- We don't emit the document UUID in the RDF header.
- We always set the instance UUID to empty (PDF/A already did that).

What would you think about carrying these?
Comment 1 Danny Milosavljevic 2017-07-06 03:55:44 UTC
Created attachment 13929 [details]
Don't write UUIDs
Comment 2 Ken Sharp 2017-07-06 04:09:59 UTC
Nope, not doing this.(In reply to Danny Milosavljevic from comment #0)

> PDFs generated by ghostscript contain the current time - which is bad for
> reproducible builds.

A produced PDF file is hardly a build.

> What would you think about carrying these?

Nope, not implementing this.
Comment 3 Danny Milosavljevic 2017-07-06 04:24:51 UTC
>Nope, not doing this.

Fair enough.

>A produced PDF file is hardly a build.

Hmm... many packages we carry have documentation.  Some of it is documentation that generates a PDF (among other formats).  We then install the documentation.

But because this PDF keeps changing every time the package is built, the package (for the same version) will never be done.

We think it's important for security that one actually knows that a binary package corresponds to a given source package.  That's why we have a "challenge" check which tries to build a binary package from source even though it already has a binary package - and then compares both binary packages.

That's how we found this problem, because groff kept failing the challenge.

Do you think groff should be changed instead?
Comment 4 Ken Sharp 2017-07-06 04:35:28 UTC
(In reply to Danny Milosavljevic from comment #3)

> We think it's important for security that one actually knows that a binary
> package corresponds to a given source package.  That's why we have a
> "challenge" check which tries to build a binary package from source even
> though it already has a binary package - and then compares both binary
> packages.
> 
> That's how we found this problem, because groff kept failing the challenge.
> 
> Do you think groff should be changed instead?

Frankly, not my problem.

You could exclude the documentation from the package, exclude the documentation from the check, distribute documentation in a format other than PDF, or use a different tool to create your PDF. There are probably other solutions this is just off the top of my head.

There is nothing wrong with the PDF produced by Ghostscript, the information is valuable to other PDF consumers, even if you find it a problem, so I don't plan to remove it. You should also consider the possibility of future PDF revisions *requiring* a UUID or time stamp.
Comment 5 jsmeix 2017-07-06 04:37:41 UTC
FYI:

See also the issue
https://bugs.ghostscript.com/show_bug.cgi?id=696765
which is basically a duplicate of this one,
in particular regarding UUIDs and IDs in PDF see
https://bugs.ghostscript.com/show_bug.cgi?id=696765#c15
and Ken's reply
https://bugs.ghostscript.com/show_bug.cgi?id=696765#c16

Additionally regarding "reproducible builds"
see also the related issue
https://bugs.ghostscript.com/show_bug.cgi?id=697484
that explains some crucial distinction about what
is acually meant with "reproducible builds".

Finally regarding PDFs in some software packages
and "reproducible builds" of that software
have a look at the bottom of my
https://bugs.ghostscript.com/show_bug.cgi?id=696765#c15
and the comments starting at
https://bugs.ghostscript.com/show_bug.cgi?id=696765#c20
Comment 6 Ken Sharp 2017-07-06 04:40:32 UTC
(In reply to jsmeix from comment #5)
> FYI:
> 
> See also the issue
> https://bugs.ghostscript.com/show_bug.cgi?id=696765
> which is basically a duplicate of this one,

And is also resolved - wontfix


> Finally regarding PDFs in some software packages
> and "reproducible builds" of that software
> have a look at the bottom of my
> https://bugs.ghostscript.com/show_bug.cgi?id=696765#c15
> and the comments starting at
> https://bugs.ghostscript.com/show_bug.cgi?id=696765#c20

The answer is still 'no'.
Comment 7 Danny Milosavljevic 2017-07-06 08:36:20 UTC
jsmeix: Thank you for the links, the discussions were very illuminating! But it's interesting that the possibility of just leaving the UUID off wasn't brought up at all.

Ken Sharp: Thanks for the information and also for the frankness. I'll pass the matter on to the groff maintainers then.

We as distributor don't want to patch anyone's packages needlessly (neither yours nor groff nor other packages) - but there's too much at stake with people using random unverifiable binaries.  So the question will be where the change should be made - it could also make sense to patch groff to patch the finished pdf files instead. I'll see what they say.

>You should also consider the possibility of future PDF revisions *requiring* a UUID or time stamp.

We'll cross that bridge when we come to it. But thanks!

Right now I'm trying to give all the other packages a fighting chance of being reproducible - but I don't want to individually patch like 4000 packages to change how they do their documentation.
Comment 8 Ken Sharp 2017-07-06 08:54:58 UTC
(In reply to Danny Milosavljevic from comment #7)

> >You should also consider the possibility of future PDF revisions *requiring* a UUID or time stamp.
> 
> We'll cross that bridge when we come to it. But thanks!

Here's another thought, though I'm not certain if its relevant; different versions of the pdfwrite device have different optimisations, and different defaults. It seems likely to me that, in order to get a 'reproducible' 'binary' (where binary includes the documentation) you would have to use a specific version of Ghostscript to produce the PDF file.

For example, a relatively recent optimisation emits rectangular subpaths for clips as 're' operators instead of a sequence of lines and moves. So its easily possible that a PDF produced before this optimisation, and one produced after would be significantly different internally, though visually they would be identical. The 're' version of the PDF file is smaller, which is why its an optimisation for some files.

Similarly the default subsampling filter has changed several times over the years, though that would likely produce visually different output as well.

I think its also possible for the numbers emitted by the pdfwrite device to depend on the C math library (its certainly true that the Windows and Linux ones differ, and the PDF files produced can be different too) which can produce different numbers in the PDF file. Again these are not visible differences, since the differences are in small decimal values, but they are there, and will mean the PDF files produced by 2 different Ghostscript executables could well be different.

The only way they will be definitely the same is if you use *exactly* the same Ghostscript source, built using exactly the same compiler (and possibly linker and other tools).
Comment 9 Danny Milosavljevic 2017-07-06 09:25:27 UTC
>different versions of the pdfwrite device have different optimisations, and different defaults. It seems likely to me that, in order to get a 'reproducible' 'binary' (where binary includes the documentation) you would have to use a specific version of Ghostscript to produce the PDF file.

>The only way they will be definitely the same is if you use *exactly* the same Ghostscript source, built using exactly the same compiler (and possibly linker and other tools).

Yes, and that is what we indeed do.

Every package [derivation] metadata in Guix includes all its dependencies' content (really their sha hash value), so indeed if anything that the package depends on changes (that includes the compiler or libc), the package will be rebuilt and get a new installed name (the hash value). (It could be that the build will end up with the same result as before - in that case, the hash value will be the same, all is well)

When you ask for a package to be installed, it generates a hash of the package spec, the build scripts, all the dependent packages' stuff and the source code of the program, and use this to look up the corresponding binary. If there's no binary with that hash it will build what is needed, recursively.

There's a major downside of reproducible builds as we do them. For example when stack clash was announced, we patched glibc and that meant we had to rebuild everything (it's still ongoing).

Still, the peace of mind of being able to do *anything* without fear to the system is worth it. Whatever I do, whatever I install or remove, the already-installed packages are immutable. An installed package is available as something like /gnu/store/32523532f32523532f (which is mounted read-only) where 32523532f32523532f is the hash of the entire contents, including all dependencies and transitive dependencies.

And when a program of that package is run, it will use *exactly* the dependencies it was built with, at all times (it will not pick up some random shared libraries).