Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Compression and MacZFS

2010 Zfs Mac

It's been a while since I last wrote anything about ZFS (well, OK, four months) but that doesn't mean it's out of mind. Unfortunately, other things have come up recently which have taken my attention away from doing much open-source work; but recently, I managed to fix the networking in my office so that I can get up to speed again.

With that out of the way, I was able to make pretty good progress on some of the current issues in the project. In essence, these are necessary steps to do before we can tackle the real problem (of getting the version of the zpool up to date). Here's what I've done:

The most visible of these will be the new icon, which is the snowflake icon that won the popular vote from before. This will show up on new pools created; for existing pools, you may need to copy over /System/Library/Filesystems/zfs.fs/Contents/Resources/VolumeIcons.icns to /Volumes/Pool/.VolumeIcons.icns (and restart Finder) to get them showing up properly.

No less important, but perhaps slightly less visible, is the addition of GZip compression support. When the OSX implementation first shipped, it didn't have GZip support for compression (though it did have lzjb compression, which I've talked about previously, and I'll talk about it more shortly.

One of the key problems, though, has been referring to the build of ZFS, not least because it's changing all the time. The common identifiers floating around are 'the 119 bits' (meaning the last publicly released version from Apple) or 'the 10a286 bits' (meaning the last privately released beta version). Both of these numbers use an incrementing version number, managed by Apple's ZFS build team, which no longer exists. And whilst ZFS itself has a version number (in fact, it has two; the zpool version number, and the zfs version number. The current MacZFS implementation is still stuck at zpool version 8 and zfs version 2. Only two years behind :-)

In any case, although the versions of zpool and zfs move on, they tend to happen infrequently enough that it doesn't make sense to use them as an identifier. Instead, it's more interesting to track the tag for the nevada release; in the form of onnv_72. These get released periodically (we're still on 72, the most recent upstream one I have is 144). We can thus have some idea of where MacZFS is (and how far it's progressing) relative to upstream.

We also need to track how many changes there have been since the last mergepoint; fortunately, that's an easy number to come by thanks to git describe which prints these things out for you. So as long as we don't diverge too far from a merge point, we can have a good idea what we mean.

We can also stamp this number on the ZFS module; though OSX has a limit on 2 chars (max 99) for each of the major/minor/patch levels. As a result, I generate version idenfiers of the form:

maczfs_72-1-7-g383f329: 72.1.7

where 72 corresponds to the onnv (nevada) version, the 1 is a subsequent release version (and the corresponding maczfs_72-1 tag), and the 7 is the number of commits since the maczfs_72-1 tag. Finally, the g383f329 is an abbreviation of the 383f329f21... commit, which will uniquely identify a point in the repository. Fortunately, all this is automated – running support/release.sh builds an installer, with the right version, for 10.5 and 10.6 automatically. It also stamps on the version number in both of the plists on the system (.../Extensions/zfs.kext/Contents/Info.plist and .../Filesystems/zfs.fs/Contents/Info.plist) and when you run kextstat | grep zfs it'll show you the same version. It should make for tracking kernel panic logs and identifying issues somewhat easier; so if you do need to report an issue, if you can have that to hand, that would be great.

Anyway, I've made MacZFS-72.1.7.pkg available as an installer for both 10.5 and 10.6 systems. (The same installer can be used for both; it will select the right version based on what you isntall with.) At this stage, it's really a beta; I'm using it on a number of development machines (including my main laptop) but don't have it installed on my file server yet. If people are happy with this, I might tag it as maczfs_72-2 subsequently and re-relase the installer with the new name.

What I will conclude with is some stats on the new compression. Undoubtably, GZip compression does a better job but takes more processing power to get there. Options can be set on a filesystem-by-filesystem basis using either on or lzjb for the faster, less effective compression, or gzip (or gzip-1 through gzip-9; gzip is a synonym for gzip-6).

  • zfs set compression=off Pool/FS
  • zfs set compression=on Pool/FS
  • zfs set compression=lzjb Pool/FS
  • zfs set compression=gzip-1 Pool/FS
  • zfs set compression=gzip Pool/FS
  • zfs set compression=gzip-6 Pool/FS
  • zfs set compression=gzip-9 Pool/FS

Roughly, as you go down the list, the slower it takes and the more saving you get. Taking the MacZFS source tree (under usr) and dumping it to a filing system with the different levels showed that whilst lzjb is practically free, the gzip algorithms can outperform from a compression point of view at a cost of longer writes. From a reading perspective, there is no significant difference between them – and in fact, can be slightly faster than loading an uncompressed data set, if it reduces data coming off the disk. Here's a graphic showing roughly how they compare:

Google Chart

To put some real numbers on there; on disk, the src tree took up 15.5Mb of the filesystem (though that includes file system overhead, of course). With lzjb, that dropped to 8.8Mb (giving a compression ratio of 1.76x) whilst gzip-6 shrunk it to 4.9Mb (3.20x). The best and worse case gzip compression was 4.9Mb (2.68x) and 5.8Mb (3.20x) but to achieve that, took anywhere from 0.75s to 1.25s to work (uncompressed took 0.68s and lzjb took 070s). So there's not much on gzip-1 over lzjb in terms of time, but a substantially better performance; but by the time you hit gzip-9, the time is much more noticeable. However, if you're compressing documents that you are unlikely to write to again (say, the J2SE reference documentation or an Xcode docset) then a one-off hit for writing seems a good chance to pay to get effectively double your hard drive space.

A couple of clarification points are in order (in advance of mails on the subject):

  • This wasn't very scientific. I quat all my open programs, but I only did one run and didn't average them. The values may thus be inaccurate and should be taken as an overall impression, not fact.
  • Source code is highly compressible. Most things are not. MP3s, H.264s, ZIP files are all pretty incompressible. Search for Shannon information theory if you want to know more. However, a compression test on incompressible data would have just returned a compression ratio of 1.00x for all of them and been uninteresting.
  • Since you asked, my laptop drive is compression enabled and I get 1.43x on my laptop, and 1.02x on my networked drive (though it has some big ipa installs at the moment). My media collection gets a little over 1.00x.
  • Compression only takes effect on newly written files. You're not going to get wodges of space if you just turn compression on; they'll apply to newly written ones. Of course, if you touch everything, that will have the effect of re-writing it and it will be recompressed. (You may need to actually write the file, not just touch the timestamp though.) So it's a good one to get right first of all. Firing up a /Developer/Documentation with a gzip-9 and then copying in from the old directory sounds like a winning solution. I have a 1.56x compression ratio on my /Developer filesystem, and a 1.77x on my /Library/Documentation filesystem.

The net effect of all this is my 120G SSD in my laptop (of which the ZFS partition is 100G) has an overall compression ratio of 1.45x – in other words, my HFS+ partition is effectively free, and I've got an extra 20G for the price of a bit of processing power. Not too shabby.