Alex headshot

AlBlue’s Blog

Macs, Modularity and More

ZFS under the hood and inside the block

2007, zfs

I recently wrote about an upcoming presentation on ZFS at the London OpenSolaris User Group, and having attended it (a couple of weeks ago now) have been meaning to write up about it for some time.

The talk was given by Jarod, one of the Sun engineers on the project, and was a very deep dive into the internals of ZFS. In fact, he started the presentation with a disclaimer “Don't ask me about performance or administration of ZFS, because I won't answer them”. OK, so it's good to be honest, but I somewhat suspect that people in the audience did have questions in this area that they'd like to answer; it's just that he wasn't the right person to ask (or wanted to be asked).

The ‘under the hood’ title didn't give the talk justice; this was about ripping off the bonnet and taking the engine block apart with close-ups of the piston rings. Most of it went over my head; and judging by the response of about 10% to the question at the start “Right, who knows about DTrace?” meant that it was quite likely to be over the top of many people's heads too.

Alas, I was unable to find out about the details of Raid-Z (which allegedly is a whole lot better than Raid-5) or when you'd want to use that over and above normal mirroring. Nor was there much in the way of practical demonstrations — it was just a walk-through of the presentation material (which you can no doubt find on the Losug site). Indeed, the presentation was videotaped but because of the presenter's refusal to wear the microphone, is likely to have been both inaudible and thanks to pacing backwards and forwards, unlikely to have been put up on the web.

(It strikes me that the presentation was ironically representative of Sun at a much larger level. Lots of good technology and people who are passionate about it, but crap at relaying that to the outside world and absolutely flabbergasted when people want to ask beginners questions or give feedback about the way it's presented.)

So, what did I take away from it? Well, I picked up a few useful nuggets; if you've got a ZFS drive, it stores its root metadata at the front and end of the device. There's a demo on the ZFS site that shows a ZFS pool being corrupted by wanton destruction of the first xMb of data in the file, and lo it can recover the filing system (if not the data). Well, try zapping the beginning and the end, and you'll lose it all. Still, it seemed reasonable and the copy-on-write semantics means that you always have a consistent on-disk representation if the filesystem's blocks are damaged. The root block is written four places – two at the front and two at the end – so any individual block or area damage wouldn't cause catastrophic failure. Also, the root block isn't at the beginning of the drive but (something like) 16k in, so any EFI headers or partitioning information at the front isn't destroyed. Oh, and it's mentioned in the system administration manual, but if you have an entire ZFS disk, it will use the disk's write cache; but if you partition it into smaller areas (even if they're all ZFS) then they won't use the disk cache. Could be an expensive thing to get wrong.

There's also a load of caching that the filing system does, stored in the ARC. I'm not sure exactly what ARC stands for, but it's something like Adaptive Replacement Cache. Roughly speaking it uses both a recently-used and frequently-used cache to store data, so that a walk of the filing system won't cause unnecessary impact to someone working on a file repeatedly. The downside of the ARC is that it might take up a whole load of memory; in fact, it uses 1G or 3/4 of the available memory for cache purposes alone (and that's more than I have in my proposed server). But it's supposed to be adaptive, and if the OS needs more memory it's supposed to release it, though it'll be interesting to see what effect that has on a real system. The goal is to have a self-tuning system so that you don't store too much, but that what you store is relevant. The ratio between the recently-used and frequently-used space is variable (as is the total cache size itself) and adapts under different loads.

Compression is available at the moment on a per-block basis, and with luck encryption will be as well. (In fact, encryption isn't a problem to integrate per se; the problem is more what encryption algorithms are used and how key management is done that is the difficult bit.)

ZFS stores a bunch of properties with each file system, such as where it's supposed to be mounted. These properties are stored at the root of the FS, so when a ZFS drive is plugged in, it can be automatically queried and mounted. There's a fairly naff 'CSI' type video on You Tube (search for 'zfs csi') which shows a couple of German Sun engineers create a ZFS partition on a bunch of USB keyring memory sticks, then unplug them all and plug them back into a different combination of hubs. I can't see people doing that much; as a cost-per-megabyte it's pretty high (but cost-per-device is pretty low).

Anyway, if anyone can tell me the difference between Raid-Z and Raid-5, and whether I'd want to use that in place of just standard mirroring, then get back to me. I didn't find it out at this presentation.