Bug 696765

Summary: Support SOURCE_DATE_EPOCH for reproducible builds
Product: Ghostscript Reporter: infinity0
Component: GeneralAssignee: Default assignee <ghostpdl-bugs>
Status: RESOLVED WONTFIX    
Severity: normal CC: chris.liddell, cloos, ghostscriptbmw, jsmeix, kaz, mehmetgelisin, stefan.bruens
Priority: P4    
Version: unspecified   
Hardware: All   
OS: All   
URL: https://wiki.debian.org/ReproducibleBuilds/TimestampsProposal
Customer: Word Size: ---
Attachments: Allow the build timestamp to be externally set
rebased patch to 9.52

Description infinity0 2016-05-10 02:51:19 UTC
Created attachment 12529 [details]
Allow the build timestamp to be externally set

Hi, we at the Reproducible Builds project have developed a standard for build tools to follow if they wish to support exact bitwise reproducible output. Bitwise reproducibility is essential for automatically verifying that multiple builders reached the same result, since (for example) it is impossible to develop a general algorithm to say that two different timestamps embedded in *arbitrary* code or data actually "mean" the same thing.

Attached is a patch to make ghostscript support the SOURCE_DATE_EPOCH environment variable. When set, all references to the "current" date/time in the build output will instead refer to this date, which is the number of seconds (excluding leap seconds) since the Unix epoch (1970-01-01 UTC in the Gregorian calendar). We have already been using this in Debian with success at making ghostscript generate bitwise reproducible output.
Comment 1 Ken Sharp 2016-05-10 03:30:01 UTC
I think there is, at the least, some confusion over terminology here. You refer to the 'build timestamp', yet you are modifying the current time function.

We don't have a 'build timestamp', the closest thing in a Ghostscript build would be the release time stamp, which is compiled in a string.

So I'm going to conclude that you are referring to the Creation and Modification dates in PDF output exclusively, and have not considered the rendering of PostScript files. PostScript has functions to access the real time clock, and your patch would break that. Some standard PostScript test suites use the clock functions, and print the result on the output, and use two calls to the clock to determine elapsed time, which is also printed. At least one well known PostScript test file fails (undefined result, ie divide by zero) if the two times are identical.

In addition your patch only addresses Linux, we would need to concoct similar code for all the platform-specific files, or risk causing confusion by having different behaviour on different platforms.

So, assuming that your concern is solely the date and time stamps in the PDF files produced by the pdfwrite device; we are not happy about producing files which lie about the time. We are prepared to consider a command line option to prevent the inclusion of the CreationDate and ModDate, and the XML Metadata, as an enhancement if this is sufficient for your purposes.

Of course, it seems to us that this would be failing part of your objective, since you won't be testing the time functions in different builds in this way, but that is going to be true no matter what you do if you require that the creation dates be the same.
Comment 2 infinity0 2016-05-10 04:47:52 UTC
Whoops, sorry about the jargon. You are right, "build timestamp" refers to the context where ghostscript is used as part of a build process - i.e. to generate some output from some source code, which is meant to be consumed later by a reader.

I'm not familiar with PostScript and sorry that the patch omits things. I was taking over the work from someone else and didn't review it too closely. But we'd be happy to improve it to the standards that you need.

Regarding "PostScript has functions to access the real time clock", there are two things (from my side, being unfamiliar with the format) that this could mean:

1. When the reader reads the end result X.ps, this dynamically adds the time of reading into the displayed file.
2. When the builder builds X.ps from X.src, this dynamically embeds the time of build into X.ps, and future readers of X.ps see this as a static piece of information.

(1) is fine for Reproducible Builds as presumably this would be a call to a function, and this call itself would be represented as a static string of bytes in X.ps.

However, the whole point of SOURCE_DATE_EPOCH is that (2) is really not what people *actually mean* in practice, and people only historically used it because they didn't have the better alternative of SOURCE_DATE_EPOCH available. From this perspective, you would not be "[lying] about the time" - the effect would be roughly the same if the build machine had set its own system time clock back to that date. It is not ghostscript's job to override the intentions of the system administrator, and similarly it is not ghostscript's job to judge that "someone who sets SOURCE_DATE_EPOCH is lying about the time" and ignore it for that reason.

(The reason we have SOURCE_DATE_EPOCH is that in practise setting the system time breaks some other behaviours, and doesn't work in the case where e.g. your build process takes between 500 and 508 seconds and you generate X.ps in the last 90% percent of the build process.)

> We are prepared to consider a command line option to prevent the inclusion of the CreationDate and ModDate, and the XML Metadata, as an enhancement if this is sufficient for your purposes.

This would be a less-than-ideal alternative - the reason we came up with SOURCE_DATE_EPOCH is so that builders wouldn't need to hard-code tool-specific command line options everywhere. For example, GCC accepted our patches recently and GCC 7+ will honour SOURCE_DATE_EPOCH for the __TIME__ and __DATE__ macros. Other documentation generators like doxygen and sphinx have also accepted our patches.
Comment 3 Ken Sharp 2016-05-10 05:02:11 UTC
(In reply to infinity0 from comment #2)
> Whoops, sorry about the jargon. You are right, "build timestamp" refers to
> the context where ghostscript is used as part of a build process - i.e. to
> generate some output from some source code, which is meant to be consumed
> later by a reader.

SO, not simply confined to PDF then, but any input to any output format ?

> Regarding "PostScript has functions to access the real time clock", there
> are two things (from my side, being unfamiliar with the format) that this
> could mean:
> 
> 1. When the reader reads the end result X.ps, this dynamically adds the time
> of reading into the displayed file.
> 2. When the builder builds X.ps from X.src, this dynamically embeds the time
> of build into X.ps, and future readers of X.ps see this as a static piece of
> information.

Neither, the PostScript program requests the date/time and either manipulates it, printing some result which is determined by the date/time, or simply prints the date/time as part of the output.

This is common practice, for example, in the Quality Logic test suite.


>1. When the reader reads the end result X.ps, this dynamically adds the time >of reading into the displayed file.
>2. When the builder builds X.ps from X.src, this dynamically embeds the time >of build into X.ps, and future readers of X.ps see this as a static piece of >information.
> (1) is fine for Reproducible Builds as presumably this would be a call to a
> function, and this call itself would be represented as a static string of
> bytes in X.ps.

I'm not discussing creating a PostScript output file, I'm talking about executing a PostScript program. So the content of x.ps isn't really relevant.


> > We are prepared to consider a command line option to prevent the inclusion of the CreationDate and ModDate, and the XML Metadata, as an enhancement if this is sufficient for your purposes.
> 
> This would be a less-than-ideal alternative - the reason we came up with
> SOURCE_DATE_EPOCH is so that builders wouldn't need to hard-code
> tool-specific command line options everywhere. For example, GCC accepted our
> patches recently and GCC 7+ will honour SOURCE_DATE_EPOCH for the __TIME__
> and __DATE__ macros. Other documentation generators like doxygen and sphinx
> have also accepted our patches.

The 'builders' will need to hard code Ghostscript-specific command line options already, you won't get anything usable if you don't, and you would want to specify many options quite carefully or there is a significant likelihood that identical builds on different machines will produce different output (for example different values from libpaper).

Given that you need to specify options to Ghostscript already, it doesn't seem onerous to require a specific request to disable the production of timestamps in a PDF file.
Comment 4 infinity0 2016-05-10 08:53:29 UTC
(In reply to Ken Sharp from comment #3)
> (In reply to infinity0 from comment #2)
> > Whoops, sorry about the jargon. You are right, "build timestamp" refers to
> > the context where ghostscript is used as part of a build process - i.e. to
> > generate some output from some source code, which is meant to be consumed
> > later by a reader.
> 
> SO, not simply confined to PDF then, but any input to any output format ?
> 
> [..]
> 
> I'm not discussing creating a PostScript output file, I'm talking about
> executing a PostScript program. So the content of x.ps isn't really relevant.
> 

I've reviewed a bit more of our notes and hopefully understand the situation a bit better. Yes, it looks like our patch was only about PDF output [1] but in theory should apply to any output format that Ghostscript supports - let me know if we should extend this to things other than PDF.

Regarding PS, if I understand you correctly, then by "PostScript program" you mean {a .ps file which contains a static sequence of bytes that means "get the current date/time"}. If that's the case, then yes indeed this won't affect R-B and is outside of the scope of our discussion; sorry for the confusion.

But now I think I understand your point: the patch I attached will also affect this PS behaviour, and I agree that this is not correct; we'll fix it.

[1] https://wiki.debian.org/ReproducibleBuilds/PdfGeneratedByGhostscript

> > > We are prepared to consider a command line option to prevent the inclusion of the CreationDate and ModDate, and the XML Metadata, as an enhancement if this is sufficient for your purposes.
> > 
> > This would be a less-than-ideal alternative [..]
> 
> The 'builders' will need to hard code Ghostscript-specific command line
> options already, you won't get anything usable if you don't [..] it doesn't
> seem onerous to require a specific request to disable the production of
> timestamps in a PDF file.

I understand where you're coming from, and yes your suggestion would indeed be similar to mechanisms like CFLAGS etc. During the build process, an OS distribution like Debian could supply a default set of GHOSTSCRIPTFLAGS to disable timestamp creation, and the specific package would append their own flags as necessary.

However, the real-world situation is that most buildsystems do not have support for infrastructure like GHOSTSCRIPTFLAGS; we would have to add this everywhere and in all buildsystems, and do this once for each tool like GhostScript, for a total cost of O(m*n) (# of buildsystems x # of build tools). But embedded timestamps is the biggest single issue that blocks Reproducible Builds today [2], and tools honouring SOURCE_DATE_EPOCH would greatly reduce the cost of achieving this, to O(n) (# of build tools).

(One alternative, to require every piece of software that uses ghostscript, to add this flag specifically, would be even higher cost. Most developers should not need to specifically think about Reproducible Builds, it should Just Work for them.)

Yes, from your point of view I can understand you don't want to support every random new environment variable coming along claiming to be special, but we do have data to back up this claim.

Anyway, if you are not convinced by this then sure, we'll have to change our patch to implement the command line option instead. Let me know what you prefer in the end.

[2] https://wiki.debian.org/ReproducibleBuilds/Howto#Files_in_data.tar_contain_timestamps
Comment 5 Ken Sharp 2016-05-10 09:13:09 UTC
(In reply to infinity0 from comment #4)

> I've reviewed a bit more of our notes and hopefully understand the situation
> a bit better. Yes, it looks like our patch was only about PDF output [1] but
> in theory should apply to any output format that Ghostscript supports - let
> me know if we should extend this to things other than PDF.

Your patch would affect the execution of PostScript programs, which is one reason we're against it.

 
> Regarding PS, if I understand you correctly, then by "PostScript program"
> you mean {a .ps file which contains a static sequence of bytes that means
> "get the current date/time"}.

PostScript is a programming language, Ghostscript is an interpreter for that programming language. The language includes means to interrogate the system clock. The program can then use that information for any purpose it sees fit, and it can easily be used to control the flow of execution in the program, resulting in different output.

At heart Ghostscript is intended to take PostScript as an input and produce raster as an output. PDF input is a recent extension, as is high level (vector) output such as PDF or PostScript. GhostPCL will take PCL and GhostXPS will take XPS as an input, and again these use the same graphics library as Ghostscript. Obviously we have to consider the impact on all input languages and output formats.


> > > This would be a less-than-ideal alternative [..]
> > 
> > The 'builders' will need to hard code Ghostscript-specific command line
> > options already, you won't get anything usable if you don't [..] it doesn't
> > seem onerous to require a specific request to disable the production of
> > timestamps in a PDF file.
> 
> I understand where you're coming from, and yes your suggestion would indeed
> be similar to mechanisms like CFLAGS etc.

I think we have crossed wires again I'm afraid. I'm not discussing any kind of build-time change, such as an alteration to CFLAGS. I'm prepared to implement a run-time flag, which would disable the part of the PDF output which is causing you a problem.

Whoever runs the executable in order to test it must supply a bunch of flags to Ghostscript in order to configure it, so it doesn't seem onerous to have the user add a flag which omits date/time output from the pdfwrite device's output,purely for the purpose of this testing.

The CreationDate and ModDate are optional in PDF, and we would prefer to omit it, rather than produce something which doesn't match the system time.


> Yes, from your point of view I can understand you don't want to support
> every random new environment variable coming along claiming to be special,
> but we do have data to back up this claim.
> 
> Anyway, if you are not convinced by this then sure, we'll have to change our
> patch to implement the command line option instead. Let me know what you
> prefer in the end.

As stated, our preference is to provide a command-line (run-time) option to omit the CreationDate and ModDate from being written to the output PDF file. I'm not asking you to write this, I'm offering it as a solution which we will implement.

I'm not a Linux user myself, but I have discussed this with the other developers, including our Linux build maintainer, and we are currently not inclined to take on any patch which interferes with the time operators in PostScript. For the purposes of producing PDF files which can be simplistically compared we will implement a control as described, if this is sufficient for you.
Comment 6 infinity0 2016-05-10 09:54:41 UTC
(In reply to Ken Sharp from comment #5)
> I think we have crossed wires again I'm afraid. I'm not discussing any kind
> of build-time change, such as an alteration to CFLAGS. I'm prepared to
> implement a run-time flag, which would disable the part of the PDF output
> which is causing you a problem.
> 
> Whoever runs the executable in order to test it must supply a bunch of flags
> to Ghostscript in order to configure it, so it doesn't seem onerous to have
> the user add a flag which omits date/time output from the pdfwrite device's
> output,purely for the purpose of this testing.
> 

OK, thanks for the explanation of GhostScript; I'll try to explain reproducible builds a bit better:

When I say "build time", I mean when GhostScript is invoked as part of the build process of some other project, to build (e.g.) some documentation. So I'm not talking about GhostScript's own build process, but that of a project that uses GhostScript.

We at the Reproducible Builds project represent many OS distributions, whose job it is to package up 10000s of these projects, and make sure that their build processes produce bit-for-bit identical results. Our goal is to make this "the default" of buildsystems, so that project developers don't have to specifically "opt-in" to this security property. "Opt-in" security is not really security, because people don't want to care about security, and won't actually "opt-in".

In other words, we would prefer the cost to be zero, rather than merely for it to be not "onerous". Minor non-onerous costs quickly add up, across all the 10000s of packages that we have to handle. (In fact we would probably just keep patching ghostscript instead of using this flag, since it's easier than patching the several dozen projects that use ghostscript.)

> As stated, our preference is to provide a command-line (run-time) option to
> omit the CreationDate and ModDate from being written to the output PDF file.
> I'm not asking you to write this, I'm offering it as a solution which we
> will implement.
> 
> I'm not a Linux user myself, but I have discussed this with the other
> developers, including our Linux build maintainer, and we are currently not
> inclined to take on any patch which interferes with the time operators in
> PostScript. For the purposes of producing PDF files which can be
> simplistically compared we will implement a control as described, if this is
> sufficient for you.

Would it be possible to omit CreationDate/ModDate when SOURCE_DATE_EPOCH is nonempty, *without* requiring an extra command-line flag?

Of course nothing should affect the time operators in PostScript (and it will probably not affect R-B) - but I'd like to point out that, it's certainly possible to decouple PDF CreationDate/ModDate from PS time operator interpretation, so that honouring SOURCE_DATE_EPOCH doesn't affect PostScript at all.
Comment 7 infinity0 2016-05-10 10:12:01 UTC
(In reply to infinity0 from comment #6)
> Of course nothing should affect the time operators in PostScript (and it
> will probably not affect R-B)

Hmm, actually I just reviewed some of our packages and I think I am wrong here. Some of them use ps2pdf to build pdfs, and (if I understand correctly) this will translate a dynamic "get current time" PostScript command, execute it, then embed it as a static date in the resulting PDF?

So for example, readers of the .ps will see different dates if they read it at different times, but readers of the .pdf will see the date at which ps2pdf was invoked? Is that correct?
Comment 8 Ken Sharp 2016-05-10 11:37:20 UTC
(In reply to infinity0 from comment #6)

> Would it be possible to omit CreationDate/ModDate when SOURCE_DATE_EPOCH is
> nonempty, *without* requiring an extra command-line flag?

Not easily, because the pdfwrite device is (or should be!) abstracted from the OS, so it doesn't use getenv, or have any way to access it. Which is why I suggest a command line parameter.


> Of course nothing should affect the time operators in PostScript (and it
> will probably not affect R-B) - but I'd like to point out that, it's
> certainly possible to decouple PDF CreationDate/ModDate from PS time
> operator interpretation, so that honouring SOURCE_DATE_EPOCH doesn't affect
> PostScript at all.

Yes it is possible, and in fact that *is* the way its currently done, but wouldn't be after your patch :-) Currently the pdfwrite code isn't using the OS abstracted time function, which it absolutely should be (don't know how that got missed). After that, the PostScript time operator and the pdfwrite CreationDate code will use the same code, so if you affect one, you affect both.

I certainly do intend to alter the way pdfwrite is currently getting the time, it should be using the abstracted functions.


(In reply to infinity0 from comment #7)
> > Of course nothing should affect the time operators in PostScript (and it
> > will probably not affect R-B)
> 
> Hmm, actually I just reviewed some of our packages and I think I am wrong
> here. Some of them use ps2pdf to build pdfs,

What else would they be using Ghostscript for ?

Note that Ghostscript's PDF interpreter is actually *written* in PostScript. So even if the input is PDF, you still are using the PostScript interpreter.


> and (if I understand correctly)
> this will translate a dynamic "get current time" PostScript command, execute
> it, then embed it as a static date in the resulting PDF?

Potentially yes, but it can be significantly more complex than that, you could (dumb example) choose to run a totally different set of routines in the afternoon to the ones in the morning for example. Of course that would still produce the same output on 2 machines with the same date/time. The time doesn't *have* to be written (or rendered) to the output, it can be used like any other input, to alter the behaviour of the program.

However, as I said, I've seen a widely used test file which fails if two consecutive calls to the PostScript time function return the same time.

Are you also aware of the PostScript rand operator ? I've also seen a test file which uses that too, so the output is comparatively non-determinstic (you would need to ensure that the pseudo random number generator was seeded the same way each time to get consistent results).

I'm aware that this isn't an issue for your purposes, hut it is for us. The PostScript interpreter would not be performing as per the specification when your environment variable is set.


> So for example, readers of the .ps will see different dates if they read it
> at different times, but readers of the .pdf will see the date at which
> ps2pdf was invoked? Is that correct?

The PDF could contain different text represe4nting a date or time (or indeed anything could change) but yes it will depend on the time when Ghostscript was executed. Each run of the PostScript program would result in different output, potentially.

This is a known problem for us with the Quality Logic test suite, where many of the tests use the time operators to print the date/time or to give an elapsed time, which is printed on the output.


Seems to me that your best bet is going to be to continue patching Ghostscript. I will discuss this again with the other developers but I don;t think this is a route we want to take.
Comment 9 infinity0 2016-05-10 12:35:44 UTC
(In reply to Ken Sharp from comment #8)
> (In reply to infinity0 from comment #6)
> 
> > Would it be possible to omit CreationDate/ModDate when SOURCE_DATE_EPOCH is
> > nonempty, *without* requiring an extra command-line flag?
> 
> Not easily, because the pdfwrite device is (or should be!) abstracted from
> the OS, so it doesn't use getenv, or have any way to access it. Which is why
> I suggest a command line parameter.

The code that reads the command line parameter could read the environment variable instead? At least, I've never seen abstractions that separate these two things into separate layers.

> Are you also aware of the PostScript rand operator ? I've also seen a test
> file which uses that too, so the output is comparatively non-determinstic
> (you would need to ensure that the pseudo random number generator was seeded
> the same way each time to get consistent results).
> 

Yes, we're aware of other sources of non-determinism. However this timestamp issue is by far the largest issue (as a whole, not just ghostscript), and further typically when people use it they don't *really* mean "the build time".  So for cost efficiency reasons, we prefer SOURCE_DATE_EPOCH to get reproducible timestamps, but we're ok with specific patches for other sources of non-determinism.

> Seems to me that your best bet is going to be to continue patching
> Ghostscript. I will discuss this again with the other developers but I don;t
> think this is a route we want to take.

I understand, no worries.

I've chatted with the rest of the team and have a few further suggestions though, perhaps they would be more acceptable:

> However, as I said, I've seen a widely used test file which fails if two consecutive
> calls to the PostScript time function return the same time.
> 
> [..]
> 
> I'm aware that this isn't an issue for your purposes, hut it is for us. The
> PostScript interpreter would not be performing as per the specification when
> your environment variable is set.
> 

In terms of the PostScript specification:

> realtime
> – realtime int
> returns the value of a clock that counts in real time, independently of the exe-
> cution of the PostScript interpreter. The clock’s starting value is arbitrary; it has
> no defined meaning in terms of calendar time. The unit of time represented by
> the realtime value is one millisecond. However, the rate at which it changes is
> implementation-dependent. As the time value becomes greater than the largest
> integer allowed in a particular implementation, it “wraps” to the smallest (most
> negative) integer.

So, this is quite generous, and could be made consistent with SOURCE_DATE_EPOCH. This definition does not say it has to be consistent with any external or "real" system clocks (and in fact many kernels offer multiple clocks such as monotonic wrappers around other clocks). There are a few options forward:

1. When S_D_E is set, then use this as the starting value of the clock. The definition above specifically allows this. This doesn't solve R-B if a particular invocation of ps2pdf has high variance in how long it runs, but see #2.

2. As per (1), but also simple increment the value by 1 each time realtime is called, as opposed to using the system clock to measure "milliseconds". This is a more generous interpretation of "millisecond" but the spec also says "rate at which it changes is implementation-dependent" so nobody should be relying on this value to actually represent real millseconds.

Both of these would be a little complex, but we'd be happy to write this if you don't want to yourselves.
Comment 10 Ken Sharp 2016-05-11 00:10:05 UTC
(In reply to infinity0 from comment #9)

> The code that reads the command line parameter could read the environment
> variable instead? At least, I've never seen abstractions that separate these
> two things into separate layers.

The two are totally different, the command line parameters are parsed off into PostScript. This is not OS-dependent, so its cross-platform. Environment variables are OS-specific, so this is all in the platform-specific code.


> So, this is quite generous, and could be made consistent with
> SOURCE_DATE_EPOCH. This definition does not say it has to be consistent with
> any external or "real" system clocks (and in fact many kernels offer
> multiple clocks such as monotonic wrappers around other clocks). There are a
> few options forward:
> 
> 1. When S_D_E is set, then use this as the starting value of the clock. The
> definition above specifically allows this. This doesn't solve R-B if a
> particular invocation of ps2pdf has high variance in how long it runs, but
> see #2.

The pdfwrite code is very variable, even on the same machine, in its timings. Of course loading on the machine also affects this.

 
> 2. As per (1), but also simple increment the value by 1 each time realtime
> is called, as opposed to using the system clock to measure "milliseconds".
> This is a more generous interpretation of "millisecond" but the spec also
> says "rate at which it changes is implementation-dependent" so nobody should
> be relying on this value to actually represent real millseconds.
> 
> Both of these would be a little complex, but we'd be happy to write this if
> you don't want to yourselves.

I'm pretty confident we won't adopt this approach, again it affects the operation of the time functions, and is still more complex. As I said I'll put it up again for discussion amongst the other developers.
Comment 11 Chris Liddell (chrisl) 2016-05-11 00:58:33 UTC
Personally, I am wary of something that could easily be seen as enabling fraudulent information (metadata) to be embedded in a PDF file. I've seen, on more than one occasion, the CreationDate and ModDate cited as "evidence" (for example, for timely completion of forms etc).

Whilst it is true that some PDF internal knowledge makes it feasible to change the dates, thus not exactly reliable evidence, it still feels worrying to be seen to be condoning the faking of such meta-data.

Hence my suggestion to Ken that we offer to disable the writing of those dates instead - I would *much* rather see the information not being written than fake (potentially fraudulent) information being written.

(NOTE: that there is precedence for this type of thing: for example, eexec encryption for Type1 fonts is almost trivial for most developers to implement, but we avoid making it easily accessible, since we do not want to be seen to be enabling theft of glyph outlines).

WRT to specifying extra command line options when gs is used by another package (either by execing the executable, or calling to the .so library), you can use the environment variable "GS_OPTIONS" to pass options to any gs instance executed in that environment - documented here:
http://www.ghostscript.com/doc/9.19/Use.htm#Environment_variables
Comment 12 infinity0 2016-05-11 02:31:37 UTC
(In reply to Chris Liddell (chrisl) from comment #11)
> Personally, I am wary of something that could easily be seen as enabling
> fraudulent information (metadata) to be embedded in a PDF file. I've seen,
> on more than one occasion, the CreationDate and ModDate cited as "evidence"
> (for example, for timely completion of forms etc).
> 
> Whilst it is true that some PDF internal knowledge makes it feasible to
> change the dates, thus not exactly reliable evidence, it still feels
> worrying to be seen to be condoning the faking of such meta-data.
> 
> Hence my suggestion to Ken that we offer to disable the writing of those
> dates instead - I would *much* rather see the information not being written
> than fake (potentially fraudulent) information being written.
> 

My text editor does not prevent me from writing "I wrote this on 1901-01-01"; your reasoning here is the same as this. And as I said before, anyone running the build can set their clock arbitrarily for a similar effect.

Refusing to code software to write a certain pattern of bits *is not security*. Even if *you* don't write this code, someone with a reason to write this information - such as us, the R-B people - will write this code. It is not "fraudulent" and I'm a little offended of this association.

It is these sorts of "false security" arguments propagating that make non-technical people think software in general is more secure than it really is. Securely stating the time would require some sort of cryptographic ledger protocol to link events on a global scale. For example bitcoin can be thought of as providing this security property.

Plain standalone timestamps inherently are not protectable by any mechanism, and just because some court thought so in a particular scenario with extra constraints that we don't know about, does not mean that software developers can or should assume this is OK for all scenarios.

> WRT to specifying extra command line options when gs is used by another
> package (either by execing the executable, or calling to the .so library),
> you can use the environment variable "GS_OPTIONS" to pass options to any gs
> instance executed in that environment - documented here:
> http://www.ghostscript.com/doc/9.19/Use.htm#Environment_variables

The issue here is that then we would have to add GS-specific settings to get the same effect. The point of SOURCE_DATE_EPOCH is that people who want reproducible builds don't need to have intimate knowledge of all the 3rd-party tools that their software uses.

(In reply to Ken Sharp from comment #10)
> (In reply to infinity0 from comment #9)
> 
> > The code that reads the command line parameter could read the environment
> > variable instead? At least, I've never seen abstractions that separate these
> > two things into separate layers.
> 
> The two are totally different, the command line parameters are parsed off
> into PostScript. This is not OS-dependent, so its cross-platform.
> Environment variables are OS-specific, so this is all in the
> platform-specific code.
> 

It looks like GS_OPTIONS is OS independent, so the code that reads GS_OPTIONS could also read SOURCE_DATE_EPOCH and prepend --no-output-timestamps (or whatever you decide) to GS_OPTIONS if S_D_E is non-empty?

> > So, this is quite generous, and could be made consistent with
> > SOURCE_DATE_EPOCH. This definition does not say it has to be consistent with
> > any external or "real" system clocks (and in fact many kernels offer
> > multiple clocks such as monotonic wrappers around other clocks). There are a
> > few options forward:
> > 
> > 1. When S_D_E is set, then use this as the starting value of the clock. The
> > definition above specifically allows this. This doesn't solve R-B if a
> > particular invocation of ps2pdf has high variance in how long it runs, but
> > see #2.
> 
> The pdfwrite code is very variable, even on the same machine, in its
> timings. Of course loading on the machine also affects this.
> 
>  
> > 2. As per (1), but also simple increment the value by 1 each time realtime
> > is called, as opposed to using the system clock to measure "milliseconds".
> > This is a more generous interpretation of "millisecond" but the spec also
> > says "rate at which it changes is implementation-dependent" so nobody should
> > be relying on this value to actually represent real millseconds.
> > 
> > Both of these would be a little complex, but we'd be happy to write this if
> > you don't want to yourselves.
> 
> I'm pretty confident we won't adopt this approach, again it affects the
> operation of the time functions, and is still more complex. As I said I'll
> put it up again for discussion amongst the other developers.

OK, let me know how it goes. I was thinking you could just have a static variable inside the function and increment that, so it wouldn't take up too many lines. Yes it affects the operation of the function, but it is still within what the spec states.
Comment 13 jsmeix 2016-05-11 03:37:24 UTC
I like to share my personal opinion here:

Personally I am against the underlying idea behind
things like SOURCE_DATE_EPOCH (as far as I understand it).

In general I am against the idea that to achieve "whatever"
all software has to be changed.

From my point of view such an approach will never succeed
because there will always come up more new software that
does not care about "whatever" so that there is an endless
(and hopeless) fight to get "all software right".

Now you fix Ghostscript because that is currently used
by some other software at compile time to make documentation
(why the heck don't they provide their documentation also
in a final ready-to-read form in their sources?)
but some time later they do no longer use Ghostscript
because they switched to the new great "FancyDOC" tool
which makes your reproducible builds fail until
you got "FancyDOC" fixed and so on ad nauseam.
Why not fix how that other software makes its documentation?

Such kind of approach was tried several years ago in SUSE
(I think it was more than 10 years ago).
It never succeeded until it died out.

Bottom line:
From my point of view the idea to implement support
for SOURCE_DATE_EPOCH in all software is a dead concept.



In contrast I think the Ghostscript authors are right
that an appropriate Ghostscript command line option
to suppress time-related output or any random output
is the right way.

This way Ghostscript could be called with that option set
to achieve identical output from identical input
which is (as far as I understand it) what is
actually needed for reproducible builds.

But I think the right Ghostscript command line option for
reproducible builds should not be only "--no-output-timestamps"
but more generally it should be something
like "--no-runtime-dependant-output"
so that for same input there is always same oputput
regardless when (time, date, random number generator, ...)
or in what environment (operating system, architecture, ...)
Ghostscript was run.
Comment 14 infinity0 2016-05-11 05:54:34 UTC
(In reply to jsmeix from comment #13)
> In general I am against the idea that to achieve "whatever"
> all software has to be changed.
> 
> From my point of view such an approach will never succeed
> because there will always come up more new software that
> does not care about "whatever" so that there is an endless
> (and hopeless) fight to get "all software right".
> 

There's a misunderstanding here - with SOURCE_DATE_EPOCH, we're specifically *not* "changing all software" - we're only changing the software which is the root of each particular instance of the issue, i.e. the code that is actually generating timestamps.

> Now you fix Ghostscript because that is currently used
> by some other software at compile time to make documentation
> (why the heck don't they provide their documentation also
> in a final ready-to-read form in their sources?)
> but some time later they do no longer use Ghostscript
> because they switched to the new great "FancyDOC" tool
> which makes your reproducible builds fail until
> you got "FancyDOC" fixed and so on ad nauseam.
> Why not fix how that other software makes its documentation?
> 
> Such kind of approach was tried several years ago in SUSE
> (I think it was more than 10 years ago).
> It never succeeded until it died out.
> 
> Bottom line:
> From my point of view the idea to implement support
> for SOURCE_DATE_EPOCH in all software is a dead concept.
> 

Your argument can be generalised to argue that any ecosystem-wide change is a dead concept, anywhere. But we see ecosystem-wide changes all the time, so your argument must be incorrect.

The more realistic view is that all ecosystem-wide changes are made in the *hope* that others will follow that change. Indeed, the more likely scenario is that newer people writing software see this discussion, understand that "get current date" does not make sense during build processes, and support SOURCE_DATE_EPOCH instead.

GCC, doxygen, sphinx and several other projects are already supporting SOURCE_DATE_EPOCH, so we have some momentum.

> In contrast I think the Ghostscript authors are right
> that an appropriate Ghostscript command line option
> to suppress time-related output or any random output
> is the right way.
> 
> This way Ghostscript could be called with that option set
> to achieve identical output from identical input
> which is (as far as I understand it) what is
> actually needed for reproducible builds.
> 
> But I think the right Ghostscript command line option for
> reproducible builds should not be only "--no-output-timestamps"
> but more generally it should be something
> like "--no-runtime-dependant-output"
> so that for same input there is always same oputput
> regardless when (time, date, random number generator, ...)
> or in what environment (operating system, architecture, ...)
> Ghostscript was run.

If every tool chooses to implement its own specific method to implement *the same behaviour*, then of course your original assertion of "changing all software" becomes a self-fulfilling prophecy. *That is exactly why* we designed SOURCE_DATE_EPOCH in the first place.

In terms of "lying about the time", it is perfectly reasonable to take the position "if SOURCE_DATE_EPOCH is set then we will effectively treat this as the current time". The system administrator has made a specific choice to use SOURCE_DATE_EPOCH, they are giving you permission to do this. [1] It's not like SOURCE_DATE_EPOCH can be accidentally set for no reason. They *could* have set their own system clock instead, but SOURCE_DATE_EPOCH is technically more effective and more predictable, for reasons I mentioned earlier. 

I am sorry for replying so much, and I will accept any decision that the GhostScript developers make, but I just wanted to respond to arguments/points that I believe to be inaccurate or missing our point or misunderstanding our intentions.

[1] There are some corner cases, but I don't see that they apply to GhostScript. I'll go into them elsewhere; trying to keep this response short.
Comment 15 jsmeix 2016-05-11 07:11:24 UTC
I fully agree with you that the crucial factor
whether or not an ecosystem-wide change succeeds
is whether or not more will follow that change
than those who will not follow that change.

It is only my personal opinion that I think it will
not succeed to get SOURCE_DATE_EPOCH support sufficiently
in all relevant software that is needed for reproducilbe builds.


Back to the actual problem:

I think you mean the following (on one of my machines):
-------------------------------------------------------------------------
# date ; echo -e '%!\n100 100 moveto 200 300 lineto stroke showpage' \
 | ps2pdf - line1.pdf
Wed 11 May 15:13:58 CEST 2016

# date ; echo -e '%!\n100 100 moveto 200 300 lineto stroke showpage' \
 | ps2pdf - line2.pdf
Wed 11 May 15:14:04 CEST 2016

# diff -q line1.pdf line2.pdf
Files line1.pdf and line2.pdf differ

# pdfinfo line1.pdf | head -n3
Producer:       GPL Ghostscript RELEASE CANDIDATE 1 9.19
CreationDate:   Wed May 11 15:13:58 2016
ModDate:        Wed May 11 15:13:58 2016

# pdfinfo line2.pdf | head -n3
Producer:       GPL Ghostscript RELEASE CANDIDATE 1 9.19
CreationDate:   Wed May 11 15:14:04 2016
ModDate:        Wed May 11 15:14:04 2016
-------------------------------------------------------------------------

I.e. for identical PostScript input
-------------------------------------------------------------------
%!
100 100 moveto 200 300 lineto stroke showpage
-------------------------------------------------------------------
Ghostscript (via its pdfwrite device) creates different output
because it creates PDF metadata with different timestamps
and with the currently used Ghostscript version.

Accordingly I think the resulting question is
how to let Ghostscript create PDF without metadata
that depends on usually unimportant runtime values.

Obviously only SOURCE_DATE_EPOCH support in Ghostscript
would result different Ghostscript PDF output when
any different Ghostscript version is used
(i.e. also for any minor version change).

I don't know if it is intended for reproducilbe builds
when any different Ghostscript version results
a different PDF output that only differs in its metadata?

It is currently possible to specify hardcoded
PDF metadata in a file "pdfmeta" with content like:
-------------------------------------------------------------
[ /Title (none)
  /Author (none)
  /Subject (none)
  /Keywords (none)
  /ModDate (0)
  /CreationDate (0)
  /Creator (none)
  /Producer (none)
  /DOCINFO pdfmark
-------------------------------------------------------------

Then call Ghostscript with that as additional input like
------------------------------------------------------------------------------
# echo -e '%!\n100 100 moveto 200 300 lineto stroke showpage' \
 | gs -q -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=line3.pdf - pdfmeta


# pdfinfo line3.pdf
Title:          none
Subject:        none
Keywords:       none
Author:         none
Creator:        none
Producer:       none
CreationDate:   0
ModDate:        0
...

# echo -e '%!\n100 100 moveto 200 300 lineto stroke showpage' \
 | gs -q -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=line4.pdf - pdfmeta


# pdfinfo line4.pdf
Title:          none
Subject:        none
Keywords:       none
Author:         none
Creator:        none
Producer:       none
CreationDate:   0
ModDate:        0
...
------------------------------------------------------------------------------
cf. "Embedding PDFmarks" at
http://milan.kupcevic.net/ghostscript-ps-pdf/

But unfortunately that alone does not help:
------------------------------------------------------------------------------
# diff -q line3.pdf line4.pdf
Files line3.pdf and line4.pdf differ

# diff -aU0 line3.pdf line4.pdf | cut -b-76
--- line3.pdf   2016-05-11 15:49:20.990174454 +0200
+++ line4.pdf   2016-05-11 15:49:27.882294106 +0200
@@ -45 +45 @@
-<rdf:Description rdf:about='uuid:b5d92b71-4f9b-11f1-0000-9b0914259d8d' xmln
+<rdf:Description rdf:about='uuid:ba0548f1-4f9b-11f1-0000-9b0914259d8d' xmln
@@ -48 +48 @@
-<rdf:Description rdf:about='uuid:b5d92b71-4f9b-11f1-0000-9b0914259d8d' xmln
+<rdf:Description rdf:about='uuid:ba0548f1-4f9b-11f1-0000-9b0914259d8d' xmln
@@ -51,2 +51,2 @@
-<rdf:Description rdf:about='uuid:b5d92b71-4f9b-11f1-0000-9b0914259d8d' xmln
-<rdf:Description rdf:about='uuid:b5d92b71-4f9b-11f1-0000-9b0914259d8d' xmln
+<rdf:Description rdf:about='uuid:ba0548f1-4f9b-11f1-0000-9b0914259d8d' xmln
+<rdf:Description rdf:about='uuid:ba0548f1-4f9b-11f1-0000-9b0914259d8d' xmln
@@ -83 +83 @@
-/ID [<91D25DAA8329AF4695BA72BB2C411C1C><91D25DAA8329AF4695BA72BB2C411C1C>]
+/ID [<76DB5A97627D81D78B6B06E3229B9D37><76DB5A97627D81D78B6B06E3229B9D37>]
------------------------------------------------------------------------------
There are still some special UUIDs and IDs in the PDF :-(

I assume that SOURCE_DATE_EPOCH support in Ghostscript
will not be sufficient to get identical PDF as output
from identical PostScript input.

<sarcasm>
Welcome to the wonderful world of PDF!
</sarcasm>

Perhaps it is really easier to fix those other software
that creates its static PDF documentation anew each time
when it is compiled that they simply provide all their
static documentation also ready-for-use in their sources
in addition to their original sources like LaTeX sources
or whatever they use for their documentation.

Furthermore this usually saves tons of build resources
because one does no longer need the full stack of
documentation processing tools in the build system
and one does no longer need to run all those various
usually resource hungy documentation generating tools
each time when the software is compiled to only
generate again and again same static documentation.
Comment 16 Ken Sharp 2016-05-11 07:30:42 UTC
(In reply to jsmeix from comment #15)

> It is currently possible to specify hardcoded
> PDF metadata in a file "pdfmeta" with content like:
> -------------------------------------------------------------
> [ /Title (none)
>   /Author (none)
>   /Subject (none)
>   /Keywords (none)
>   /ModDate (0)
>   /CreationDate (0)
>   /Creator (none)
>   /Producer (none)
>   /DOCINFO pdfmark
> -------------------------------------------------------------
> 
> Then call Ghostscript with that as additional input like

This approach has already been rejected.


> There are still some special UUIDs and IDs in the PDF :-(

These are also generated from the time value, so if you hack the time, the values remain constant. Of course this is another good reason for us not to support hacking the system time, the UUIDs *should* be different.

I was investigating a compromise approach, but it would not address the UUIDs. The sticking point for us is hacking the clock, we aren't prepared to break the PostScript operators for this.

So my conclusion is that you should carry on patching Ghostscript, I don't see any way forward which will satisfy both the Ghostscript team and the Reliable Builds team.
Comment 17 jsmeix 2016-05-11 07:47:41 UTC
FYI:

For openSUSE and even more for SUSE Linux Enterprise
I will continue to keep our ghostscript RPM packages
in full compliance with Ghostscript upstream
which means that
I will not accept patches for SUSE ghostscript RPM packages
that support hacking the system time.

As a consequence for reproducible builds at SUSE
each particular other software would have to be adapted
if it calls SUSE's upstream compliant Ghostscript to make PDFs.
The presumably best adaption is when each software provides
all its static documentation as source files.

Alternatively of course someone else could maintain
any kind of "hacked Ghostscript" for SUSE as he likes ;-)
Comment 18 infinity0 2016-05-11 07:49:02 UTC
(In reply to jsmeix from comment #15)
> I don't know if it is intended for reproducilbe builds
> when any different Ghostscript version results
> a different PDF output that only differs in its metadata?
> 

This is fine in general - e.g. different compiler versions will produce different output - although it's always best to avoid minor differences where possible. What we're building would add this metadata in a separate file so that it's not lost, but it's not part of the "installed artifact" that users directly consume, which can be compared between separate builders.

> There are still some special UUIDs and IDs in the PDF :-(
> 
> I assume that SOURCE_DATE_EPOCH support in Ghostscript
> will not be sufficient to get identical PDF as output
> from identical PostScript input.
> 

As Ken mentioned, one does get identical UUID output if one fixes the time values.

> Perhaps it is really easier to fix those other software
> that creates its static PDF documentation anew each time
> when it is compiled that they simply provide all their
> static documentation also ready-for-use in their sources
> in addition to their original sources like LaTeX sources
> or whatever they use for their documentation.
> 
> Furthermore this usually saves tons of build resources
> because one does no longer need the full stack of
> documentation processing tools in the build system
> and one does no longer need to run all those various
> usually resource hungy documentation generating tools
> each time when the software is compiled to only
> generate again and again same static documentation.

This wouldn't be acceptable from a FOSS point of view - we generally want even documentation in the "preferred form for modification" and PDFs are not that.

(In reply to Ken Sharp from comment #16)
> These are also generated from the time value, so if you hack the time, the
> values remain constant. Of course this is another good reason for us not to
> support hacking the system time, the UUIDs *should* be different.
> 
> I was investigating a compromise approach, but it would not address the
> UUIDs. The sticking point for us is hacking the clock, we aren't prepared to
> break the PostScript operators for this.
> 

Out of interest, what was the compromise approach? If the rest of the file is the same, do the UUIDs really need to be different?

> So my conclusion is that you should carry on patching Ghostscript, I don't
> see any way forward which will satisfy both the Ghostscript team and the
> Reliable Builds team.

Alright, thanks for trying, and for the detailed discussion.
Comment 19 Ken Sharp 2016-05-11 08:06:37 UTC
(In reply to infinity0 from comment #18)

> > I was investigating a compromise approach, but it would not address the
> > UUIDs. The sticking point for us is hacking the clock, we aren't prepared to
> > break the PostScript operators for this.
> > 
> 
> Out of interest, what was the compromise approach? If the rest of the file
> is the same, do the UUIDs really need to be different?

Creating a command line switch to prevent emission of the dates, then using the (Ghostscript extension to the PostScript language) getenv operator to interrogate the system for the presence of the environment variable, and having that set the command line parameter. As I mentioned previously these are, in effect, translated into PostScript, so the command line parameters can be read and set by the PostScript interpreter.

This would have, in effect, converted the environment variable into a command line switch and prevented the pdfwrite device from emitting the dates when that environment variable was present. Its all cross-platform, wouldn't have resulted in incorrect creation times and wouldn't have affected the operation of the time operators.

But having to also squash the UUIDs is just too much. As noted, these should *not* be the same, its really an error (though very minor I grant) to have them be the same, unique is supposed to mean unique.


> Alright, thanks for trying, and for the detailed discussion.

Not that it will affect you, but the patch as it stands will put the wrong timestamp on PDF files when built on Windows, even in the absence of the environment variable.
Comment 20 jsmeix 2016-05-11 08:07:00 UTC
Regarding "documentation in the preferred form for modification":

Intentionally I wrote "in addition to their original sources".

Would you also reject Makefile and Makefile.in to be
provided in the sources in addition to Makefile.am ?
Comment 21 infinity0 2016-05-11 08:22:18 UTC
(In reply to jsmeix from comment #20)
> Regarding "documentation in the preferred form for modification":
> 
> Intentionally I wrote "in addition to their original sources".
> 
> Would you also reject Makefile and Makefile.in to be
> provided in the sources in addition to Makefile.am ?

Sorry my bad, I skimmed over that. For FOSS purposes that is fine, yes. But for our R-B verification purposes this wouldn't be sufficient. Your suggestion might "tick the box" but it would basically be cheating, so we wouldn't want to pursue that option.
Comment 22 James Cloos 2016-05-11 14:57:56 UTC
One thought:

Other software generating pdfs have started (or already did) accepting options specifying exactly what metadata to add to the resulting file.

There should not be any problem w/ gs doing that, too.

(In fact, can't a bit of extra ps code do that already, anyway?  Ie a -c snippet before the src files?)

That would allow static creation et al dates to be put in the output pdf files.
Comment 23 Ken Sharp 2016-05-12 00:12:22 UTC
(In reply to James Cloos from comment #22)

> (In fact, can't a bit of extra ps code do that already, anyway?  Ie a -c
> snippet before the src files?)

You mean a pdfmark which sets the DOcInfo metadata.
 
> That would allow static creation et al dates to be put in the output pdf
> files.

That idea was rejected as well. The requirement from Reproducible Builds is that the environment variable is the *only* control.
Comment 24 Ken Sharp 2016-05-12 00:18:37 UTC
(In reply to Ken Sharp from comment #23)

> > (In fact, can't a bit of extra ps code do that already, anyway?  Ie a -c
> > snippet before the src files?)
> 
> You mean a pdfmark which sets the DOcInfo metadata.
>  
> > That would allow static creation et al dates to be put in the output pdf
> > files.
> 
> That idea was rejected as well. The requirement from Reproducible Builds is
> that the environment variable is the *only* control.

Pressed 'save changes' too quick....

I can eliminate the problem with the CreationDate and Mod Date by various means, but this still leaves the problem of UUIDs, which are also generated from the time and which cannot be overriden with a pdfmark, and which would also be required to be identical.

Which is where I gave up.
Comment 25 infinity0 2016-05-12 02:10:04 UTC
(In reply to Ken Sharp from comment #24)
> (In reply to Ken Sharp from comment #23)
> 
> > > (In fact, can't a bit of extra ps code do that already, anyway?  Ie a -c
> > > snippet before the src files?)
> > 
> > You mean a pdfmark which sets the DOcInfo metadata.
> >  
> > > That would allow static creation et al dates to be put in the output pdf
> > > files.
> > 
> > That idea was rejected as well. The requirement from Reproducible Builds is
> > that the environment variable is the *only* control.

A command-line option, although not saving R-B too much cost, would still be useful for those other projects that use ghostscript directly. If they want to think about reproducing their builds, this would become possible for them with an unpatched ghostscript. (They would have to avoid realtime in their PS input; also there remains the UUID issue.)

You could do that and ignore our preference for SOURCE_DATE_EPOCH. I was just making points on why the latter is preferred, i.e. it would not save much global cost if *everyone* chose to ignore it.

> I can eliminate the problem with the CreationDate and Mod Date by various
> means, but this still leaves the problem of UUIDs, which are also generated
> from the time and which cannot be overriden with a pdfmark, and which would
> also be required to be identical.
> 
> Which is where I gave up.

I understand this direction. If you're interested though, we did think through these topics ourselves a year or so ago, and our conclusion is like this:

Yes, perhaps on a surface level making these things constant (timestamps and UUIDs) might seem like "lying" or breaking some intuitive semantics of how unique they should be. But if we step back a bit and ask, what really is the *purpose* of these pieces of information? For UUIDs it is meant to be an easy way to distinguish two documents that are different. But if A.pdf and B.pdf are otherwise identical *except* for the UUID, what is the point of them being different?

More abstractly: uniqueness/constantness is relative, it is always *given* something. If I'm running ghostscript in a VM and I clone the VM, I would get the same UUID in both cases. What we're saying is that UUIDs should be unique, *given* useful (less redundant) pieces of information. Instead of UUID = f ( ghostscript version, input.ps, timestamp ), we think it's better if UUID = f ( ghostscript version, input.ps ).
Comment 26 Stefan Brüns 2017-11-16 13:09:57 UTC
(In reply to infinity0 from comment #25)
> (In reply to Ken Sharp from comment #24)
> > (In reply to Ken Sharp from comment #23)
> > 
> > > > (In fact, can't a bit of extra ps code do that already, anyway?  Ie a -c
> > > > snippet before the src files?)
> > > 
> > > You mean a pdfmark which sets the DOcInfo metadata.
> > >  
> > > > That would allow static creation et al dates to be put in the output pdf
> > > > files.
> > > 
> > > That idea was rejected as well. The requirement from Reproducible Builds is
> > > that the environment variable is the *only* control.
> 
> A command-line option, although not saving R-B too much cost, would still be
> useful for those other projects that use ghostscript directly. If they want
> to think about reproducing their builds, this would become possible for them
> with an unpatched ghostscript. (They would have to avoid realtime in their
> PS input; also there remains the UUID issue.)
> 
> You could do that and ignore our preference for SOURCE_DATE_EPOCH. I was
> just making points on why the latter is preferred, i.e. it would not save
> much global cost if *everyone* chose to ignore it.
> 
> > I can eliminate the problem with the CreationDate and Mod Date by various
> > means, but this still leaves the problem of UUIDs, which are also generated
> > from the time and which cannot be overriden with a pdfmark, and which would
> > also be required to be identical.
> > 
> > Which is where I gave up.
> 
> I understand this direction. If you're interested though, we did think
> through these topics ourselves a year or so ago, and our conclusion is like
> this:
> 
> Yes, perhaps on a surface level making these things constant (timestamps and
> UUIDs) might seem like "lying" or breaking some intuitive semantics of how
> unique they should be. But if we step back a bit and ask, what really is the
> *purpose* of these pieces of information? For UUIDs it is meant to be an
> easy way to distinguish two documents that are different. But if A.pdf and
> B.pdf are otherwise identical *except* for the UUID, what is the point of
> them being different?
> 
> More abstractly: uniqueness/constantness is relative, it is always *given*
> something. If I'm running ghostscript in a VM and I clone the VM, I would
> get the same UUID in both cases. What we're saying is that UUIDs should be
> unique, *given* useful (less redundant) pieces of information. Instead of
> UUID = f ( ghostscript version, input.ps, timestamp ), we think it's better
> if UUID = f ( ghostscript version, input.ps ).

I don't know why it has not been mentioned yet:
https://www.ghostscript.com/doc/current/Ps2pdf.htm#Options

-sDocumentUUID=string
    Defines a DocumentID to be included into the document Metadata. [...]
    Note that Adobe XMP specification requires DocumentID must be same for all versions of a document. Since Ghostscript does not provide a maintenance of document versions, users are responsible to provide a correct UUID through this parameter. [...]

-sInstanceUUID=string
    Defines a instance ID to be included into the document Metadata. [...]
    Note that Adobe XMP specification requires instance ID must be inique for all versions of document. This parameter may be used to disable an unique ID generation for a debug purpose.

So the current way of generating the DocumentUUID from the timestamp is probably *ahem* suboptimal - it will return different UUIDs for the same document recreated at a later time or for a later version, and it may return the same UUID for multiple documents - the UUID is generated from gettimeofday (on UNIX), so "just" microsecond resolution.

So for DocumentUUID, probably hash(input path + project name) would be a better option, while InstanceUUID could either be derived from a version or from hash(contents).
Comment 27 Bernhard M. Wiedemann 2020-05-22 11:36:42 UTC
Created attachment 19241 [details]
rebased patch to 9.52


I found that openSUSE's transfig package 'sample-presentation.pdf' can be built reproducibly with this ghostscript patch.

I would like to know if upstream would re-consider merging some or all of this
since it does not alter default behaviour.
Or if meanwhile there is a better way to produce the same pdf twice.
Comment 28 Ken Sharp 2020-05-22 12:06:48 UTC
(In reply to Bernhard M. Wiedemann from comment #27)
> Created attachment 19241 [details]
> rebased patch to 9.52
> 
> 
> I found that openSUSE's transfig package 'sample-presentation.pdf' can be
> built reproducibly with this ghostscript patch.
> 
> I would like to know if upstream would re-consider merging some or all of
> this
> since it does not alter default behaviour.
> Or if meanwhile there is a better way to produce the same pdf twice.

If you read through the (admittedly lengthy) thread here, you'll see that we rejected this some time ago.

We haven't changed out mind.
Comment 29 Bernhard M. Wiedemann 2020-05-22 13:54:24 UTC
I read much of the discussion, but could not find any specific argument
that would prevent merging of the 3 time -> gp_get_realtime changes.

While that would not provide reproducible results itself, it would at least make it easier to maintain these reproducibility patches downstream.


For the other part, I think the “Lying about the time” section in
https://reproducible-builds.org/docs/source-date-epoch/ applies.

Users can already produce .pdf files using ghostscript with any date they like by setting the system clock.
And the argument against providing a 2nd way to do that, is that you do not want to make that even easier?

Or do we already have an option to omit influences from build time?


@jsmeix: over the last 4 years working on reproducible builds, I found that such toolchain patches are actually much nicer than having to patch every possible individual caller (including ones written in the future)
So far I did only 500 patches and we already have 11796 reproducible openSUSE packages.
Comment 30 Ken Sharp 2020-05-22 14:20:54 UTC
(In reply to Bernhard M. Wiedemann from comment #29)
> I read much of the discussion, but could not find any specific argument
> that would prevent merging of the 3 time -> gp_get_realtime changes.

The patch does considerably more than that.


> For the other part, I think the “Lying about the time” section in
> https://reproducible-builds.org/docs/source-date-epoch/ applies.
> 
> Users can already produce .pdf files using ghostscript with any date they
> like by setting the system clock.
> And the argument against providing a 2nd way to do that, is that you do not
> want to make that even easier?

This bug report is closed, the discussion is at an end, we do not intend to apply the reproducible builds patches. Please do not append any more items to this thread.
Comment 31 Kaz Kylheku 2021-07-20 21:19:44 UTC
This bug contains a very interesting exchange between downstream packagers and upstream application developers.

I'm an outside party having no stake in this issue at all. 

Both sides have great points.

Allow me to summarize and "referree".

1.

Firstly, the patch is a bad patch. It is changing a core function in the GhostScript interpreter, and then changing the PDF-writing code to call that function instead of the ISO C time function.  This is way more scope than necessary for achieving the objective of freezing the timestamp that is written into the PDF. I understand the need to have that SOURCE_DATE_EPOCH being handled in just one place, but the interpreter's real-time function that is visible to PostScript programs should not be hijacked for that purpose. A separate library function like gs_document_stamp_time or whatever should be introduced.

Of course Ken is pushing back on such a thing; I would too.

I mean, if changing core functions is acceptable, then why not just patch the GNU C Library that the toolchain is built against, right? Glibc's time function could itself react to SOURCE_DATE_EPOCH. The reason you don't do it that way is that it's a bad idea; not every access to time is done for the purpose of depositing something into a built object. If this is hacked at to low a level, then build log time stamps would all show the same time. You wouldn't be able to look at the log of your reproducible build to see how long certain parts of it took.

2.

I agree with jsmeix's apt observations that programs can just release pre-built PDF documentation that the distro can fetch, just like their source code. When an application does releases documentation builds, they should be used, arguably. However, that works only for applications which release built documentation. You have to understand that "we will just get all upstream packages to release documentation binaries" is not a reasonable strategy for the distro.

3.

I'm puzzled by Ken's attachment to the UUIDs, like some sort of sacred cow. These UUIDs do not identify. An object's true ID is its image: the bits it is composed of. The second best thing is a cryptographic hash of that object, like a SHA-256.  A UUID that is simply generated out of thin air does not pertain in any way to the object it is attached to. The identity which a UUID represents is its own. It says, "I am these 128 bits".  When you generate a UUID, you are making a new, empty object identity. Then when you combine that with other objects, like documents, you are adding those objects as properties of that UUID. A document UUID is only useful if the users control it. For instance, you might want to generate a family of related PDF documents and link them together by giving them the same UUID. The UUID then refers to that family, and the documents are that family's members. How these kinds of of assigned identifiers are used is a flexible matter.

When we are just randomly generating UUIDs every time we run the same tool on exactly the same inputs, it has no meaning. When UUIDs are used, they are important, and things are organized around them. They are not just a willy-nilly ornament.

Tools that write UUIDs into objects must be able to take those UUIDs as inputs for those UUIDs to function as a useful tool.

I do agree with pushing back on that other patch (in bug #698208) which removes the UUID writing entirely. That looks like a huge interoperability risk. Suppose some PDF viewer refuses to show the document, or malfunctions, due to not finding the UUID.

4.

I don't agree with Ken's proposition that command line arguments are OS-agnostic, whereas environment variables aren't. The getenv function was standardized in ANSI C 89, just like numerous other functions GhostScript uses.  There are differences in command line handling across platforms. Environment variables are more stable between Unix and Windows than command-line handling, since Windows leaves the delimiting of arguments to the application, whereas it provides API functions for retrieving environment variables (GetEnvironmentStrings, GetEnvironmentVariable) on which getenv can be more or less directly based: no hacky parsing is required. That said, the SOURCE_DATE_EPOCH mechanism does not absolutely have to work everywhere.  It's strictly for build machines. A build of GhostScript for some small embedded system that doesn't have environment variables is of no concern; nobody is doing reproducible document builds on that system. If that system has a conforming, hosted ISO C implementation for it, it will have a getenv function you can link to. That function will return NULL for "SOURCE_DATE_EPOCH", like it does for everything else, and that's that.

5.

It is puzzling that Ken is willing to implement a run-time switch which prevents the PDF writing code from adding a creation and modification date entirely, yet is opposed to a mechanism which lets the time stamp values be specified as inputs. I think that, like 4, this view is worth re-examining.  Like UUIDs. I could perhaps understand some product manager from Adobe pushing back on something like this. Time stamps being sacred cows in the PDF-writing back end of an open source PostScript engine makes no sense.

6.

I think jsmeix is too pessimistic on the adoption of SOURCE_DATE_EPOCH. Of course, if everyone takes the attitudes seen in this ticket, that then becomes a self-fulfilling prophesy.

The argument is something like, "well, all sorts of unknown applications will not jump on that bandwagon, neither should we".  But all those other applications, if they are bundled into a reproducible distro such as Debian, will be assiduously patched for reproducibility by their respective package maintainers, who will ask those respective projects whether they would please consider upstreaming those changes. So then the argument reduces to "others are probably refusing to upstream changes similar to this, so therefore I'm refusing" which appears needlessly defeatist.

In any case, the change is removable. If it turns out that SOURCE_DATE_EPOCH turns out to be eclipsed by some other de facto (or actual) standard, the code can change. It's okay to do one thing now and a different thing later. It's not like some huge commitment for all eternity.
Comment 32 Ken Sharp 2021-07-21 07:06:07 UTC
(In reply to Kaz Kylheku from comment #31)
> This bug contains a very interesting exchange between downstream packagers
> and upstream application developers.
> 
> I'm an outside party having no stake in this issue at all. 
> 
> Both sides have great points.
> 
> Allow me to summarize and "referree".

This bug is **CLOSED**

As already requested in comment #28 (!!) please do not add any further comments.