Thursday, July 22, 2010

New blog style

References

After having used Blogger's default rounded corner template for a while, I thought I'd try something different. And having worked around the pathetic you-must-use-an-exact-pixel-count-for-the-widths that Blogger enforced on you, my writing now naturally flows over the page and down the side.

Let me know what you think in the comments.

Eclipse and Java 6u21 issue rolled back

References

As I noted on my InfoQ post on the subject, the issue regarding the vendor name in the Java DLL has been resolved. However, until a new build of 6u21 is available, you'll have to workaround by following the FAQ to add the permgen flags yourself, or downgrade to 6u20 in the interim.

Monday, July 19, 2010

Compression and MacZFS

References

It's been a while since I last wrote anything about ZFS (well, OK, four months) but that doesn't mean it's out of mind. Unfortunately, other things have come up recently which have taken my attention away from doing much open-source work; but recently, I managed to fix the networking in my office so that I can get up to speed again.

With that out of the way, I was able to make pretty good progress on some of the current issues in the project. In essence, these are necessary steps to do before we can tackle the real problem (of getting the version of the zpool up to date). Here's what I've done:

The most visible of these will be the new icon, which is the snowflake icon that won the popular vote from before. This will show up on new pools created; for existing pools, you may need to copy over /System/Library/Filesystems/zfs.fs/Contents/Resources/VolumeIcons.icns to /Volumes/Pool/.VolumeIcons.icns (and restart Finder) to get them showing up properly.

No less important, but perhaps slightly less visible, is the addition of GZip compression support. When the OSX implementation first shipped, it didn't have GZip support for compression (though it did have lzjb compression, which I've talked about previously, and I'll talk about it more shortly.

One of the key problems, though, has been referring to the build of ZFS, not least because it's changing all the time. The common identifiers floating around are 'the 119 bits' (meaning the last publicly released version from Apple) or 'the 10a286 bits' (meaning the last privately released beta version). Both of these numbers use an incrementing version number, managed by Apple's ZFS build team, which no longer exists. And whilst ZFS itself has a version number (in fact, it has two; the zpool version number, and the zfs version number. The current MacZFS implementation is still stuck at zpool version 8 and zfs version 2. Only two years behind :-)

In any case, although the versions of zpool and zfs move on, they tend to happen infrequently enough that it doesn't make sense to use them as an identifier. Instead, it's more interesting to track the tag for the nevada release; in the form of onnv_72. These get released periodically (we're still on 72, the most recent upstream one I have is 144). We can thus have some idea of where MacZFS is (and how far it's progressing) relative to upstream.

We also need to track how many changes there have been since the last mergepoint; fortunately, that's an easy number to come by thanks to git describe which prints these things out for you. So as long as we don't diverge too far from a merge point, we can have a good idea what we mean.

We can also stamp this number on the ZFS module; though OSX has a limit on 2 chars (max 99) for each of the major/minor/patch levels. As a result, I generate version idenfiers of the form:

maczfs_72-1-7-g383f329: 72.1.7

where 72 corresponds to the onnv (nevada) version, the 1 is a subsequent release version (and the corresponding maczfs_72-1 tag), and the 7 is the number of commits since the maczfs_72-1 tag. Finally, the g383f329 is an abbreviation of the 383f329f21... commit, which will uniquely identify a point in the repository. Fortunately, all this is automated – running support/release.sh builds an installer, with the right version, for 10.5 and 10.6 automatically. It also stamps on the version number in both of the plists on the system (.../Extensions/zfs.kext/Contents/Info.plist and .../Filesystems/zfs.fs/Contents/Info.plist) and when you run kextstat | grep zfs it'll show you the same version. It should make for tracking kernel panic logs and identifying issues somewhat easier; so if you do need to report an issue, if you can have that to hand, that would be great.

Anyway, I've made MacZFS-72.1.7.pkg available as an installer for both 10.5 and 10.6 systems. (The same installer can be used for both; it will select the right version based on what you isntall with.) At this stage, it's really a beta; I'm using it on a number of development machines (including my main laptop) but don't have it installed on my file server yet. If people are happy with this, I might tag it as maczfs_72-2 subsequently and re-relase the installer with the new name.

What I will conclude with is some stats on the new compression. Undoubtably, GZip compression does a better job but takes more processing power to get there. Options can be set on a filesystem-by-filesystem basis using either on or lzjb for the faster, less effective compression, or gzip (or gzip-1 through gzip-9; gzip is a synonym for gzip-6).

  • zfs set compression=off Pool/FS
  • zfs set compression=on Pool/FS
  • zfs set compression=lzjb Pool/FS
  • zfs set compression=gzip-1 Pool/FS
  • zfs set compression=gzip Pool/FS
  • zfs set compression=gzip-6 Pool/FS
  • zfs set compression=gzip-9 Pool/FS

Roughly, as you go down the list, the slower it takes and the more saving you get. Taking the MacZFS source tree (under usr) and dumping it to a filing system with the different levels showed that whilst lzjb is practically free, the gzip algorithms can outperform from a compression point of view at a cost of longer writes. From a reading perspective, there is no significant difference between them – and in fact, can be slightly faster than loading an uncompressed data set, if it reduces data coming off the disk. Here's a graphic showing roughly how they compare:

Google Chart

To put some real numbers on there; on disk, the src tree took up 15.5Mb of the filesystem (though that includes file system overhead, of course). With lzjb, that dropped to 8.8Mb (giving a compression ratio of 1.76x) whilst gzip-6 shrunk it to 4.9Mb (3.20x). The best and worse case gzip compression was 4.9Mb (2.68x) and 5.8Mb (3.20x) but to achieve that, took anywhere from 0.75s to 1.25s to work (uncompressed took 0.68s and lzjb took 070s). So there's not much on gzip-1 over lzjb in terms of time, but a substantially better performance; but by the time you hit gzip-9, the time is much more noticeable. However, if you're compressing documents that you are unlikely to write to again (say, the J2SE reference documentation or an Xcode docset) then a one-off hit for writing seems a good chance to pay to get effectively double your hard drive space.

A couple of clarification points are in order (in advance of mails on the subject):

  • This wasn't very scientific. I quat all my open programs, but I only did one run and didn't average them. The values may thus be inaccurate and should be taken as an overall impression, not fact.
  • Source code is highly compressible. Most things are not. MP3s, H.264s, ZIP files are all pretty incompressible. Search for Shannon information theory if you want to know more. However, a compression test on incompressible data would have just returned a compression ratio of 1.00x for all of them and been uninteresting.
  • Since you asked, my laptop drive is compression enabled and I get 1.43x on my laptop, and 1.02x on my networked drive (though it has some big ipa installs at the moment). My media collection gets a little over 1.00x.
  • Compression only takes effect on newly written files. You're not going to get wodges of space if you just turn compression on; they'll apply to newly written ones. Of course, if you touch everything, that will have the effect of re-writing it and it will be recompressed. (You may need to actually write the file, not just touch the timestamp though.) So it's a good one to get right first of all. Firing up a /Developer/Documentation with a gzip-9 and then copying in from the old directory sounds like a winning solution. I have a 1.56x compression ratio on my /Developer filesystem, and a 1.77x on my /Library/Documentation filesystem.

The net effect of all this is my 120G SSD in my laptop (of which the ZFS partition is 100G) has an overall compression ratio of 1.45x – in other words, my HFS+ partition is effectively free, and I've got an extra 20G for the price of a bit of processing power. Not too shabby.

Wednesday, July 14, 2010

Eclipse CVS checkout behind a firewall and proxy

References

I discovered today that it is now possible to do anonymous checkouts of Eclipse projects behind a firewall. There has been (for a long time) a CVS server running on proxy.eclipse.org on port 80 (for anonymous access) and port 443 (for committer access). However, the anonymous access only works if you have only a firewall (i.e. which lets any traffic through on port 80).

Most large organisations however don't employ only a firewall - they also have an HTTP proxy sitting in the middle. The problem is that the HTTP proxy only knows how to speak HTTP, and gets confused when network clients start declaring their affections.

HTTPS isn't so encumbered, however; what happens is that the client sends a HTTP Connect message, and thereafter all the bits are opaque to the proxy. But also, most HTTPS proxies will only permit CONNECT calls through to port 443; which is why the committee proxy works.

Fortunately, those nice people at Eclipse have set up a second box, pebbles.eclipse.org, which runs an anonymous CVS client on port 443. This means if you use "pebbles.eclipse.org" as the host, /cvsroot/eclipse (etc) as the CVSRoot, pserver as the method, anonymous as the user, and change the port to 443, and Bam! Bam! You can checkout code from behind the firewall.

I don't know how long it's been in place - I know I've wanted it for a while - but thanks to the Eclipse Webmasters I can now checkout from CVS. At least I'll get some use out of it before everyone migrates to Git ...

Tuesday, July 13, 2010

Thoughts on LLVM and Clang

References

I've been a fan of LLVM for a while, and of Clang specifically. However, I didn't have the chance to investigate either of these in depth until fairly recently, and I thought it's worth repeating what I've been doing.

LLVM is a lot more than just a simple framework for compilers; it's more like a generic assembly language which maintains strong typing and logical variables throughout the code path, and then at the end being turned into hardware-specific machine code. It's also used to dynamically build up such programs on the fly and have them executed; Apple uses LLVM to optimise OpenGL effects (and falling back to interpreted where hardware acceleration isn't available).

Clang is a compiler, built on top of LLVM (and thus able to take advantage of all of the low-level performance optimisations, as well as its own), that happens to compile C, Objective-C and (with 2.0) C++ programs. It's embeddable, in that IDEs can host the runtime in-situ (instead of having to call externally and parsing results) which means that it can interact with an IDE in a much more pro-active way than before (or at least, without having to re-implement the parser multiple times).

One of the things this buys you is free analysis of the source code. Given that you have full information of the source, including its defines and include paths, it is possible to find call sites that are questionable. This includes the simple lint checks like doing an assignment inside an if block; but it can do quite a lot more complicated analysis as well, such as determining loops and states of variables (e.g. on the first iteration of the loop you nullify a variable; on the second, it may go down a different callpath).

There's a blog entry on the LLVM blog about amazing feats of clang error recovery, and it's well worth a read. Not only that, but it works better than this blog entry gives you hope for; it really is solid stuff.

Apple have taken it further with the integration into Xcode; not only does it parse the Clang-driven error messages, but when it's explaining how it arrived at a conclusion, Xcode will narrate the code path with call arrows in order to indicate the potential problem. I was able to use this to generate a whole heap of fixes for the Mac ZFS port, pretty much all driven from the results of running the Clang static analyzer (they've got a good screenshot of the Xcode there; but interestingly, you can also drive the results through a browser as well).

The other aspect of LLVM is that it's fast; faster than GCC, anyway. You can run Clang in two modes; as a backend to GCC (so it presents itself to the tools as GCC, but uses the LLVM internals) or as a standalone clang compiler. I was able to build the entire Mac ZFS codebase with LLVM in 22s, with the LLVM+GCC and GCC combinations taking 29s. Not too shabby, considering that it's only a single command line switch away.

There's also an upcoming LLVM 2.0, which will have a revamped debugger called LLDB. This uses the same parser as the Clang compiler does; so when the debugger starts up, it's able to provide you information based on your specific object types, and evaluate watch expressions based on C code, rather than the subset that is supported by GCC.

At this point, the GCC compiler is considered strictly legacy on the Mac; and FreeBSD is compiling with LLVM already. And with the new libcxx library, it won't be long before it's doing all the C++ coding as well. The only question remains: how long will it take others to migrate to LLVM?