Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Pack200 reconstituted Jar

2006, harmony, java, pack200

Following on from my last success, I'm pleased to report that I've managed not only to decompress the data, but also reassemble it into a suitable archive and write it out. It handles any non-class file, and currently, .class files that are entirely empty interfaces. So, if you've got any oversized archives of types like java.io.Serializable, this could save you literally a hundred bytes or so ...

There's a lot more work to be done though, so don't hold your breath just yet. I can't decode methods, fields, byte-code, or any non-String type in the reconstituted class file; but that's more because I've not got around to it yet than any show-stopper. There's a bit of mangling that needs to be done for byte-codes, so I'm going to tackle decoding fields and constants first, and then work my way up through empty methods into ones with code.

Looping, branching etc. are going to throw a bit of a spanner in the works; there's a funky bit of 'BCI renumbering' that goes on in the spec. What that means is if you have a 20-byte block of byte-code, and the instructions/data are 4,5,3,2,4,2 bytes wide each, then instead of byte-code offsets being listed as [0,4,9,12,14,18], they get written as [0,1,2,3,4,5]. In fact, there's a mapping from numbers [0..19] into the sequence [0,5,6,7,1,8,9,10,11,2,12,13,3,14,15,16,4,17,18,5,19]. Might seem a bit confusing at first, but the idea is that you only ever want to branch to the start of an op-code, so mapped to bytes, the numbers 0,1,2,3,4,5 are the only places that you can branch to. Since the integers in pack200 are variable width (like UTF-8; smaller numbers take up less space, on the grounds that they're more common) it makes sense to do this renumbering so that you only need to store the smallest integer offset from where you currently are. Of course, it puts a bit of extra work on the compressor/decompressor, but at the benefit of saving space in the file, which is after all what pack200 is all about.

Anyway, it means that to decode byte-codes, you need to understand each byte-code and how many bytes it takes up. There's also some remappings for 'wide' byte-codes as well as other special-case stuff; and lastly, there's a bit of weirdness in the way the constant pool is reconstituted based on what byte-codes access access those entries. And all this so you can save about half the space when you download a file remotely ...

By the way, if you want to track the status of this, you can subscribe to the pack200 feed or just browse pack200 related posts.