Follow up to my previous post - Reproducible Multi-Project Gradle builds
I got around to doing some testing on this using a simple Gradle project. Very quickly did I find out that this particular strategy was not going to be easy…and potentially impossible.
Problem 1 - Setting Timestamp value on Jar Entry’s
The first problem I ran into was actually setting the timestamp value of a Jar Entry. My first attempt looked something like this:
Yup. That didn’t work. Jar’s are more or less read-only. So in order to change the time stamp, I needed to write out a new Jar file with the modified entries. Basically, create a new jar file, iterate over the entries from the first jar file, clone them, modified the time stamp, add the cloned entry to the new jar file and then binary copy the data from the first jar file to the second. It looks something like this:
Problem 2 - JAR “Magic” Extra Byte
The next issue I encountered was that my new Jar file always had an entry that didn’t match the original Jar for it’s ‘extra’ property.
JarEntry.geExtra() returns a
byte. In the orignal file, this was null, but in my copied file it was set to some data. I finally found that it is from the implementation of JarOutputStream in Java.
This Jar MAGIC byte gets added to the first entry in the Jar file. I haven’t been able to find any documentation on what it is for, but a friend thought it was likely for the Jar tool itself to determine if an archive file is actually a Jar.
Curiously, Jar files produced by the Gradle Jar task, do NOT have this magic byte. Digging into their code, I found that they use a
ZipOutputStream to write out the Jar file which doesn’t have this magic byte code. Using
ZipEntry works just as fine and avoids this, so we update the build to do the same:
Problem 3 - Zip timestamp spec
After all this, I still wasn’t getting subsequent builds of the Jar to have matching checksums. I wrote a method that at the end of producing the timestamp adjusted Jar iterated over all the entries and compared all the fields to the original except the time field which I compared to the expected time stamp. Looks like this:
This showed me that some entries still didn’t have the same timestamp. Curiously, they were always off by 1 second (1000 milliseconds in the script). Digging back into the Java source code for ZipEntry, I found it was doing some sort of unix - DOS conversion and was losing the resolution of the time.
After a call out to the Twitterverse, a friend pointed out that the Zip Spec specifies that Zip entries have a time resolution of 2 seconds:
updateparameter controls what happens if the ZIP file already exists. When set to
yes, the ZIP file is updated with the files specified. (New files are added; old files are replaced with the new versions.) When set to
no(the default) the ZIP file is overwritten if any of the files that would be added to the archive are newer than the entries inside the archive. Please note that ZIP files store file modification times with a granularity of two seconds. If a file is less than two seconds newer than the entry in the archive, Apache Ant will not consider it newer.
So, we need to mimic this same behavior when producing our new Jar. Basically, we need to convert our timestamp to a resolution of 2 seconds. It looks like this:
Problem 4 - Groovy
_timestamp static field
This got my Jars closer, but they still weren’t checksumming the same. At this point, I started using a hex viewer to compare the two files. I used Hex Fiend because it can do a side by side diff of files and is free
Using a Hex viewer on a Jar file doesn’t do much since the data is compressed, but it can get you pointed in the right direction. In my cause I could see some byte differences around what appeared to be some class declarations
Next step was to explode the Jars and compare each of the files. This should a pretty apparent difference
There was a binary difference in the class file produced by subsequent compilations. The difference was related to a filed named
That’s an interesting field, because there’s nothing like that in my source file:
.class file with a Java Decompiler (JD-GUI in this case), we see this for the class:
There is in fact a private variable being initialized. Opening the other copy of this class revealed that the name of this variable was changing:
Hmmm. That’s odd. A little more digging lead me to this Groovy-6308 - Timestamps in bytecode prevents baselining of code. The title of this bug is “Timestamps in bytecode prevents baselining of code”. Yup, this seems to be my problem.
It appears this field is injected as some sort of addition to
SerialVersionUUID, though the corresponding conversations of that bug and the related bugs seems to indicate that it’s not really used for anything.
Unfortunately, the bugs are listed as being targeted for Groovy 3.0 which doesn’t help me in the near term.
At this point, since it’s a compiler function that’s stopping me, I’m not sure I have a path forward. This will probably get back burnered for a while since it’s not critical and more of a thought experiment than anything.