Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Pack200 status update

Java 2006 Harmony Pack200

So, work on pack200 for harmony continues; I'm now at a stage where I can decompress bands using any codec, including those dynamically specified. The pack200 spec is quite an interesting beastie, in that each band has an individual coding that translates a sequence of bytes into a sequence of integers (signed/unsigned). Codecs like BYTE1 are easy enough; they're just a passthrough of unsigned bytes. The other ones -- like UNSIGNED5 -- represent small integers in one byte, medium integers in two bytes, large integers in three bytes etc, up to 5 bytes. (The BHSD codecs are in the range 1..5 bytes long, which gives the full range of [-2^32..2^32-1].) It's sort of like how UTF-8 encodes 'common' (i.e. small) characters in one byte, and less common (i.e. larger) characters in two or more bytes.

As well as a band having a default codec, it's also possible for a band to switch to a completely different codec at the beginning. So, whilst an entry might be UNSIGNED5, if it's a series of larger numbers then UDELTA5 might be more appropriate. (A sequence of [1,1,1,1,1] with an UNSIGNED5 encoding will be translated to [1,1,1,1,1] -- if it's UDELTA5 then it will decode as [1,2,3,4,5]. Thus, values like 100,200,300,400,500 will take up more space when encoded as a UNSIGNED5 than as a UDELTA5 encoding.)

A band switches encodings by writing out a special signature value which is converted into an integer [0..188]. This identifies (in an incredibly compact way) which subsequent encoding to use for the band. There are many default values, as well as a dual-encoding (decode x with codec1, then the remainder with codec2 ... which can then be arbitrarily nested) and a population based encoding (for repeated values).

So, whilst the previous stage of the implementation just used the defaults, it's now possible to use this dynamic switching mechanism. I have a feeling that it will come in handy, since I'm pretty sure that the bytecode encodings are going to use this kind of mechanism; at least, the population based encoding. It's also more likely that as the optimisation switches for the pack200 are increased (e.g. --effort=9) then there will be more of these types of codec switches. I'm in the process of refactoring the existing band decoding calls so that they will take advantage of the dynamic codec switch.

Next step ... get the byte code definitions out of the way. Then we'll be able to start decompressing classes :-) Mind you, there's still a few hurdles; there's a lot of stuff about decompressing annotations (which should only be an issue for Java5 compiled classes, so I'm ignoring those for now) and then how to reconstitute those decoded various bits into class files, and then into a Jar file. I'm still not sure the best way to do this (whether to use a library like BCEL or something else that's in the Harmony codebase, or whether just to write out the data as binary).

By the way, I don't think there's any reason why anyone else couldn't use the codecs to decode streams of data. Obviously, I've not got any encoding going on at the moment, but it will be a necessary part for the packing at a later stage.