I my last post, I discussed creating file systems under zpools. By now, you should have the basic ideas about how the systems work and what they can do. I'm now going to go back to basics and talk about data integrity.
When you create a zpool, you can define how multiple devices are used. When multiple devices are used together, some variant of RAID is often used. Before discussing what ZFS can do, let's review the standard RAID levels:
- Striping (RAID 0): data is written onto either one disk or another. Both disks must be present in order to read it back. The performance of a striped disk is usually higher than that of a single disk, because writes and reads can be shared across the physical disks. A slight variation of this is called Spanning or JBOD (Just a Bunch of Disks), which treats two physical disks as one large one but by writing to one disk until it fills up and then using the second. Performance of JBOD is equivalent to that of a single disk; in essence, it's just a way of creating bigger virtual disk.
- Mirroring (RAID 1): two equal sized disks are used. Writes are sent to both disks; if one disk fails, the other disk holds the complete image and so can be used standalone. Both reads and writes are equivalent to that of a single disk, since the data has to be read (and written) both places.
- RAID 5: multiple disks (more than three) are used, with parity being written to one of the disks. If a disk read failure occurs, then the data can be reconstructed with reference to the parity data on the other disk.
In the examples I've shown so far, the virtual disks have been concatenated to form a RAID 0 striped approach. This works well for increasing the data space available, but doesn't do anything for data reliability. What we need to do instead is duplicate the data so that if a data read fails, we can construct the missing data.
However, when data is important, you don't just want to rely on a single copy of the data. If something happens to it, then you've lost your work. That is important when it's your ripped CD collection, but it's even more important when you have irreplaceable photos of family members.
The first approach for ensuring that multiple copies of data be created is by setting the copies
parameter:
apple[~] mkfile 64m /tmp/disk1 apple[~] zpool create sample /tmp/disk1 apple[~] zfs set copies=2 sample qpple[~] echo "Hello World" > /Volumes/sample/HelloWorld.txt
Now, we've got two copies of our HelloWorld
text file on disk. We can verify this by ruunning a grep
on the disk itself:
apple[~] strings /tmp/disk1 | grep Hello Hello World HelloWorld.txt Hello World HelloWorld.txt HelloWorld.txt
Of course, this isn't true data security. If something happens to the disk, we lose both copies and thus it's not guaranteed reliability. But it does help in the case there's a disk head crash and you've only got one disk available - for example, in a single-disk laptop for some data that you might want to store in duplicate for safety's sake. Combined with compression, it might be the case that two copies take up less space than a single (uncompressed) copy.
What we'd ideally like to do is create a mirrored (or better) set of disks so that writes are sent to separate physical devices. If you've got more than one physical disk, you can create a zpool mirror quite easily:
apple[~] mkfile 64m /tmp/disk1 apple[~] mkfile 64m /tmp/disk2 apple[~] zpool create safe mirror /tmp/disk1 /tmp/disk2 apple[~] zpool upgrade safe apple[~] zpool status safe pool: safe state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM safe ONLINE 0 0 0 mirror ONLINE 0 0 0 /tmp/disk1 ONLINE 0 0 0 /tmp/disk2 ONLINE 0 0 0 errors: No known data errors
Whenever we write data to this device, we'll duplicate the write through to the underlying devices (in this case, /tmp/disk1
and /tmp/disk2
respectively). Even better, when we read from the system, the reads will be striped over both devices. Since each device holds its own checksum for the data on disk, any disk error can automatically be detected and repaired on the fly without intervention.
Let's see what a problem looks like. We're going to use dd
to copy some random data into the first part of our disk to represent a disk failure, and then use a scrub to inform us of the problem followed by a disk replacement:
apple[~] echo Hello World > /Volumes/safe/HelloWorld.txt apple[~] dd if=/dev/random of=/tmp/disk1 bs=1024 count=1024 apple[~] cat /Volumes/safe/HelloWorld.txt Hello World apple[~] zpool status safe pool: safe state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM safe ONLINE 0 0 0 mirror ONLINE 0 0 0 /tmp/disk1 ONLINE 0 0 0 /tmp/disk2 ONLINE 0 0 0 errors: No known data errors apple[~] zpool scrub safe apple[~] zpool status safe pool: safe state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: scrub completed with 0 errors on Sun Apr 6 22:55:40 2008 config: NAME STATE READ WRITE CKSUM safe DEGRADED 0 0 0 mirror DEGRADED 0 0 0 /tmp/disk1 UNAVAIL 0 0 0 corrupted data /tmp/disk2 ONLINE 0 0 0 errors: No known data errors
It's noticed that we've taken a big chunk of data out of one of the disks, and it's showing the array as degraded. However, we can still continue to use the array since we've got a valid data copy. Let's create a new disk, and replace it:
apple[~] mkfile 64m /tmp/disk3 apple[~] zpool replace safe /tmp/disk1 /tmp/disk3 apple[~] zpool status safe pool: safe state: DEGRADED scrub: resilver completed with 0 errors on Sun Apr 6 23:01:32 2008 config: NAME STATE READ WRITE CKSUM safe DEGRADED 0 0 0 mirror DEGRADED 0 0 0 replacing DEGRADED 0 0 0 /tmp/disk1 UNAVAIL 0 0 0 corrupted data /tmp/disk3 ONLINE 0 0 0 /tmp/disk2 ONLINE 0 0 0 errors: No known data errors apple[~] zpool status safe pool: safe state: ONLINE scrub: resilver completed with 0 errors on Sun Apr 6 23:01:32 2008 config: NAME STATE READ WRITE CKSUM safe ONLINE 0 0 0 mirror ONLINE 0 0 0 /tmp/disk3 ONLINE 0 0 0 /tmp/disk2 ONLINE 0 0 0 errors: No known data errors
We're back in business. The replacement started (as shown in the first command) and then finished by the time we did the status a second time. What's more, unlike other hardware RAID 1 solutions (or indeed, some software RAID 1 solutions), the reslivering took a time proportional to the amount of data in use as opposed to the size of thie disk. Given the average disk is being measured in Tb instead of Gb, the amount of time simply to copy all blocks from one disk to another is a large amount of time. If you know which subset of blocks to copy, and you're not using all the space available, then a subset can be much quicker to copy.
For the average home user, mirroring is a good way of achieving data integrity with the minimum of hassle. You can even buy devices that do hardware mirroring across two physical disks, but using a software-based solution like ZFS is a superior choice, for the following reasons.
A RAID 1 hardware device just mirrors the disk sectors, bit for bit. Any disk errors get propagated between the devices (which is especially true in a hot-swap type device), and the OS has no clue whether the disk blocks are valid or not. Furthermore, the hardware device will be file-system agnostic, and so won't know how much (or how little) of the disk is in use. In order to fix this, it will run through and mirror the entire disk.
ZFS' mirroring is much more intelligent. Not only does it know what to copy during a resliver (which can make it much faster for replacing disks), it also has checksums on disk and so knows when a block is bad. This can be handled transparently by the file system layer for fixing without the user being aware that there's a problem. This information isn't hidden completely though - a check to the 'zpool status
' will confirm what has happened.
A final word on RAID 5. Mirroring is one of the safest ways for small users to ensure that their data is protected against a single disk failure. However, larger organisations might want to use a higher utilisation of disk space; a mirrored approach gives N/2
space. With a higher variant of RAID, the space is N-1
(or N-2
) which for large N can be a significant difference. However, unless you're looking at 3+ disks in your system, mirroring is the way to go.
ZFS provides something called Raid Z, which is similar to Raid 5 in principle, if not in practice. The idea is to split the writes across many disks such that if any one disk fails, the array can continue to be used. (Raid Z2 allows two disks to fail simultaneously.) The technical difference between Raid 5 and Raid Z is that the latter doesn't suffer from the raid write hole, which you can search for more information. Creating a RAID array is just as easy as creaing a mirror, except that we need to use at least three devices, or four if we want a hot spare:
apple[~] mkfile 64m /tmp/disk4 apple[~] mkfile 64m /tmp/disk5 apple[~] mkfile 64m /tmp/disk6 apple[~] mkfile 64m /tmp/disk7 apple[~] zpool create safer raidz /tmp/disk4 /tmp/disk5 /tmp/disk6 spare /tmp/disk7 apple[~] zpool upgrade safer apple[~] zpool status safer pool: safer state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM safer ONLINE 0 0 0 raidz1 ONLINE 0 0 0 /tmp/disk4 ONLINE 0 0 0 /tmp/disk5 ONLINE 0 0 0 /tmp/disk6 ONLINE 0 0 0 spares /tmp/disk7 AVAIL errors: No known data errors apple[~] zfs get available safer NAME PROPERTY VALUE SOURCE safer available 86.2M -
The amount of data we have available is equivalent to two disks. If anything happens to one of disk{4,5,6}
, then disk7
will be automatically brought on line.
Periodic checks of the file-system (with zpool scrub
) will perform the block-checking to determine if there's any read or write errors on the disk. Given that the data can be recreated using the additional data on the disk, problems can be automatically rectified as long as the number of available disks doesn't fall below the threshold (one for mirrored or RAID Z disks, two for RAID Z2), and a hot spare can be brought in should the need arise.
That wraps it up for this lengthy post. Next time, we'll revisit snapshots and discover how we can send them to remote ZFS pools for off-site backup purposes.