Alex headshot

AlBlue’s Blog

Macs, Modularity and More

ZFS on Mac - mirroring and scrubbing

Howto Zfs Mac 2008

I my last post, I discussed creating file systems under zpools. By now, you should have the basic ideas about how the systems work and what they can do. I'm now going to go back to basics and talk about data integrity.

When you create a zpool, you can define how multiple devices are used. When multiple devices are used together, some variant of RAID is often used. Before discussing what ZFS can do, let's review the standard RAID levels:

  • Striping (RAID 0): data is written onto either one disk or another. Both disks must be present in order to read it back. The performance of a striped disk is usually higher than that of a single disk, because writes and reads can be shared across the physical disks. A slight variation of this is called Spanning or JBOD (Just a Bunch of Disks), which treats two physical disks as one large one but by writing to one disk until it fills up and then using the second. Performance of JBOD is equivalent to that of a single disk; in essence, it's just a way of creating bigger virtual disk.
  • Mirroring (RAID 1): two equal sized disks are used. Writes are sent to both disks; if one disk fails, the other disk holds the complete image and so can be used standalone. Both reads and writes are equivalent to that of a single disk, since the data has to be read (and written) both places.
  • RAID 5: multiple disks (more than three) are used, with parity being written to one of the disks. If a disk read failure occurs, then the data can be reconstructed with reference to the parity data on the other disk.

In the examples I've shown so far, the virtual disks have been concatenated to form a RAID 0 striped approach. This works well for increasing the data space available, but doesn't do anything for data reliability. What we need to do instead is duplicate the data so that if a data read fails, we can construct the missing data.

However, when data is important, you don't just want to rely on a single copy of the data. If something happens to it, then you've lost your work. That is important when it's your ripped CD collection, but it's even more important when you have irreplaceable photos of family members.

The first approach for ensuring that multiple copies of data be created is by setting the copies parameter:


apple[~] mkfile 64m /tmp/disk1
apple[~] zpool create sample /tmp/disk1
apple[~] zfs set copies=2 sample
qpple[~] echo "Hello World" > /Volumes/sample/HelloWorld.txt

Now, we've got two copies of our HelloWorld text file on disk. We can verify this by ruunning a grep on the disk itself:


apple[~] strings /tmp/disk1 | grep Hello
Hello World
HelloWorld.txt
Hello World
HelloWorld.txt
HelloWorld.txt

Of course, this isn't true data security. If something happens to the disk, we lose both copies and thus it's not guaranteed reliability. But it does help in the case there's a disk head crash and you've only got one disk available - for example, in a single-disk laptop for some data that you might want to store in duplicate for safety's sake. Combined with compression, it might be the case that two copies take up less space than a single (uncompressed) copy.

What we'd ideally like to do is create a mirrored (or better) set of disks so that writes are sent to separate physical devices. If you've got more than one physical disk, you can create a zpool mirror quite easily:


apple[~] mkfile 64m /tmp/disk1
apple[~] mkfile 64m /tmp/disk2
apple[~] zpool create safe mirror /tmp/disk1 /tmp/disk2
apple[~] zpool upgrade safe
apple[~] zpool status safe
  pool: safe
 state: ONLINE
 scrub: none requested
config:
    NAME            STATE     READ WRITE CKSUM
    safe            ONLINE       0     0     0
      mirror        ONLINE       0     0     0
        /tmp/disk1  ONLINE       0     0     0
        /tmp/disk2  ONLINE       0     0     0
errors: No known data errors

Whenever we write data to this device, we'll duplicate the write through to the underlying devices (in this case, /tmp/disk1 and /tmp/disk2 respectively). Even better, when we read from the system, the reads will be striped over both devices. Since each device holds its own checksum for the data on disk, any disk error can automatically be detected and repaired on the fly without intervention.

Let's see what a problem looks like. We're going to use dd to copy some random data into the first part of our disk to represent a disk failure, and then use a scrub to inform us of the problem followed by a disk replacement:


apple[~] echo Hello World > /Volumes/safe/HelloWorld.txt
apple[~] dd if=/dev/random of=/tmp/disk1 bs=1024 count=1024
apple[~] cat /Volumes/safe/HelloWorld.txt
Hello World
apple[~] zpool status safe
  pool: safe
 state: ONLINE
 scrub: none requested
config:
    NAME            STATE     READ WRITE CKSUM
    safe            ONLINE       0     0     0
      mirror        ONLINE       0     0     0
        /tmp/disk1  ONLINE       0     0     0
        /tmp/disk2  ONLINE       0     0     0
errors: No known data errors
apple[~] zpool scrub safe
apple[~] zpool status safe
  pool: safe
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
	invalid.  Sufficient replicas exist for the pool to continue
	functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: scrub completed with 0 errors on Sun Apr  6 22:55:40 2008
config:
    NAME            STATE     READ WRITE CKSUM
    safe            DEGRADED     0     0     0
      mirror        DEGRADED     0     0     0
        /tmp/disk1  UNAVAIL      0     0     0  corrupted data
        /tmp/disk2  ONLINE       0     0     0
errors: No known data errors

It's noticed that we've taken a big chunk of data out of one of the disks, and it's showing the array as degraded. However, we can still continue to use the array since we've got a valid data copy. Let's create a new disk, and replace it:


apple[~] mkfile 64m /tmp/disk3
apple[~] zpool replace safe /tmp/disk1 /tmp/disk3
apple[~] zpool status safe
  pool: safe
 state: DEGRADED
 scrub: resilver completed with 0 errors on Sun Apr  6 23:01:32 2008
config:
    NAME              STATE     READ WRITE CKSUM
    safe              DEGRADED     0     0     0
      mirror          DEGRADED     0     0     0
        replacing     DEGRADED     0     0     0
          /tmp/disk1  UNAVAIL      0     0     0  corrupted data
          /tmp/disk3  ONLINE       0     0     0
        /tmp/disk2    ONLINE       0     0     0
errors: No known data errors
apple[~] zpool status safe
  pool: safe
 state: ONLINE
 scrub: resilver completed with 0 errors on Sun Apr  6 23:01:32 2008
config:
    NAME            STATE     READ WRITE CKSUM
    safe            ONLINE       0     0     0
      mirror        ONLINE       0     0     0
        /tmp/disk3  ONLINE       0     0     0
        /tmp/disk2  ONLINE       0     0     0
errors: No known data errors

We're back in business. The replacement started (as shown in the first command) and then finished by the time we did the status a second time. What's more, unlike other hardware RAID 1 solutions (or indeed, some software RAID 1 solutions), the reslivering took a time proportional to the amount of data in use as opposed to the size of thie disk. Given the average disk is being measured in Tb instead of Gb, the amount of time simply to copy all blocks from one disk to another is a large amount of time. If you know which subset of blocks to copy, and you're not using all the space available, then a subset can be much quicker to copy.

For the average home user, mirroring is a good way of achieving data integrity with the minimum of hassle. You can even buy devices that do hardware mirroring across two physical disks, but using a software-based solution like ZFS is a superior choice, for the following reasons.

A RAID 1 hardware device just mirrors the disk sectors, bit for bit. Any disk errors get propagated between the devices (which is especially true in a hot-swap type device), and the OS has no clue whether the disk blocks are valid or not. Furthermore, the hardware device will be file-system agnostic, and so won't know how much (or how little) of the disk is in use. In order to fix this, it will run through and mirror the entire disk.

ZFS' mirroring is much more intelligent. Not only does it know what to copy during a resliver (which can make it much faster for replacing disks), it also has checksums on disk and so knows when a block is bad. This can be handled transparently by the file system layer for fixing without the user being aware that there's a problem. This information isn't hidden completely though - a check to the 'zpool status' will confirm what has happened.

A final word on RAID 5. Mirroring is one of the safest ways for small users to ensure that their data is protected against a single disk failure. However, larger organisations might want to use a higher utilisation of disk space; a mirrored approach gives N/2 space. With a higher variant of RAID, the space is N-1 (or N-2) which for large N can be a significant difference. However, unless you're looking at 3+ disks in your system, mirroring is the way to go.

ZFS provides something called Raid Z, which is similar to Raid 5 in principle, if not in practice. The idea is to split the writes across many disks such that if any one disk fails, the array can continue to be used. (Raid Z2 allows two disks to fail simultaneously.) The technical difference between Raid 5 and Raid Z is that the latter doesn't suffer from the raid write hole, which you can search for more information. Creating a RAID array is just as easy as creaing a mirror, except that we need to use at least three devices, or four if we want a hot spare:


apple[~] mkfile 64m /tmp/disk4
apple[~] mkfile 64m /tmp/disk5
apple[~] mkfile 64m /tmp/disk6
apple[~] mkfile 64m /tmp/disk7
apple[~] zpool create safer raidz /tmp/disk4 /tmp/disk5 /tmp/disk6 spare /tmp/disk7
apple[~] zpool upgrade safer
apple[~] zpool status safer
  pool: safer
 state: ONLINE
 scrub: none requested
config:
    NAME            STATE     READ WRITE CKSUM
    safer           ONLINE       0     0     0
      raidz1        ONLINE       0     0     0
        /tmp/disk4  ONLINE       0     0     0
        /tmp/disk5  ONLINE       0     0     0
        /tmp/disk6  ONLINE       0     0     0
    spares
      /tmp/disk7    AVAIL
errors: No known data errors
apple[~] zfs get available safer
NAME   PROPERTY   VALUE  SOURCE
safer  available  86.2M  -

The amount of data we have available is equivalent to two disks. If anything happens to one of disk{4,5,6}, then disk7 will be automatically brought on line.

Periodic checks of the file-system (with zpool scrub) will perform the block-checking to determine if there's any read or write errors on the disk. Given that the data can be recreated using the additional data on the disk, problems can be automatically rectified as long as the number of available disks doesn't fall below the threshold (one for mirrored or RAID Z disks, two for RAID Z2), and a hot spare can be brought in should the need arise.

That wraps it up for this lengthy post. Next time, we'll revisit snapshots and discover how we can send them to remote ZFS pools for off-site backup purposes.