If btrfs interested you, start your next-gen trip with a step-by-step guide to ZFS.
In my last article on next-gen filesystems, we did something in between a generic high altitude overview of next-gen filesystems and a walkthrough of some of btrfs’ features and usage. This time, we’re going to specifically look at what ZFS brings to the table, walking through getting it installed and using it on one of the more popular Linux distributions: Precise Pangolin. That’s the most current Long Term Support (LTS) Ubuntu release.
With that said… if Ubuntu’s not your cup of tea, don’t worry! There are lots of options for running ZFS, and very little of this walkthrough will really depend on your use of Ubuntu in particular or even Linux in general. You can always visit http://zfsonlinux.org directly for help with the initial installation if you prefer RHEL or Fedora or Arch or what have you—and if you’re a BSD fan, ZFS is available from the base installer in either PC-BSD or in the latest 10.0 release of FreeBSD itself.
In the interest of brevity, I’m going to assume you’re already familiar with most of the generic terms and features associated with next-gen filesystems: atomic snapshots, asynchronous incremental replication, self-healing arrays, per-block checksumming, etc. If you aren’t already familiar with those concepts, you might want to brush up on the last article to catch up.
Prerequisites and installation
You’ll need a 64-bit PC or virtual machine with a recommended minimum of 8GB of RAM (you may be able to squeak by with less—possibly much less—but you’re more likely to encounter performance degradation or odd behavior if you do) and several hard drives or partitions available to use with ZFS. We’re going to use and cover Ubuntu Precise here, specifically. If you have a different distribution of Linux, you’ll need to look at a guide for installation on your distribution at http://zfsonlinux.org. If you’re using (a new version of) FreeBSD, PC-BSD, or one of the Solaris variants, you should have ZFS support built-in already.
Assuming that you’ve got a 64-bit PC running Ubuntu Precise just like I do here, actual installation is pretty mind-numbingly simple: first, we need to add the PPA (Personal Package Archive) for ZFSonLinux, then install the package itself:
you@box:~$ sudo apt-get install python-software-properties you@box:~$ sudo apt-add-repository ppa:zfs-native/stable you@box:~$ sudo apt-get update you@box:~$ sudo apt-get install ubuntu-zfs
In the first command above, we gave ourselves access to the apt-add-repository command, which makes it much simpler to safely add PPAs to our repository list. Then we added the PPA, updated our source list to reflect that, and installed the package itself. Couldn’t be (much) easier. One note: the final step isn’t just slapping a binary in place. This command automatically compiles the module for your particular kernel live at the time of installation, so you should expect it to take a minute or five.
Another note: this tuning step is for Linux only! If you’re using a BSD or Solaris variant, this isn’t necessary, and you may choose to skip ahead to the next section.
One of the weaknesses in the ZFS implementation on Linux is that the ARC—which is ZFS’ Advanced Replacement Cache, a smarter type of cache than the typical “first in first out” used in most filesystems—is unfortunately too slow to release RAM back into the system. In theory, it should behave as a normal filesystem cache does. When your system needs more RAM, the ARC should release RAM to it as necessary. In practice, large memory allocation requests—such as starting up a virtual machine or a large database—are likely to simply fail if they need RAM currently allocated to the ARC.
What this means for you is that you should manually limit the ARC to an appropriate maximum value for your system. A pure fileserver or NAS might want almost all RAM available for the ARC, or a virtual machine host might want as much RAM as possible available for the system itself. If you’re in any doubt at all, half the system RAM is a great starting point, and you can adjust later if you need to. We’ll set this value in /etc/modprobe.d/zfs.conf:
# /etc/modprobe.d/zfs.conf # # yes you really DO have to specify zfs_arc_max IN BYTES ONLY! # 16GB=17179869184, 8GB=8589934592, 4GB=4294967296, 2GB=2147483648, 1GB=1073741824, 500MB=536870912, 250MB=268435456 # options zfs zfs_arc_max=4294967296
Note that just creating this file doesn’t actually apply the change. The file is only read when the kernel module is initially loaded, so you’ll have to either load and unload the module or reboot the system to make it take effect. If you’re adventurous, and you have a really new version of ZFS on Linux (newer than the version I am using today), you may also be able to just echo the value directly to the module for immediate effect:
you@box:~$ sudo -s root@box:~# echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_max
Be aware that changes echoed directly to the module as shown above take place immediately but do not persist across reboots. Extra-confusingly, you need to look at a different value if you want to check what your current in-use value of zfs_arc_max is:
you@box:~$ sudo grep c_max /proc/spl/kstat/zfs/arcstats c_max 4 4294967296
Once you’re satisfied that you’ve set zfs_arc_max to a value appropriate for your system (remember: half the available RAM is a good rule of thumb), you’re ready to move on.
Learning ZFS lingo
Before we go nuts on the command line, let’s briefly discuss some important ZFS terminology—that is, how we refer to the hierarchy that the actual physical disks will fit into. In the broadest terms, any data stored on ZFS is stored on one or more vdevs, which may populate one or more higher-level vdevs, which populate a single zpool (though you can have multiple zpools in one system). Confused yet? Let’s start from the smallest term and move our way up.
A vdev is a Virtual DEVice. This can be a single physical drive or partition, or it can be a higher-level vdev consisting of multiple… uh… “lower-level” vdevs. To make everybody’s life easier, from here on out I’m just going to refer to simple drives or partitions as devices, and when I use the term vdev, I will be referring to higher-level vdevs.
A “higher-level vdev” is a special ZFS raid array of lower-level vdevs, typically meaning of individual physical drives or partitions. These devices may be arranged in the vdev as a simple mirror or as a striped array with parity. Parity may be one, two, or three blocks per stripe—corresponding to RAID5, RAID6, or a mythical nonexistent RAID7 (referred to as raidz1, raidz2, and raidz3 respectively). Note that it’s not possible to create a higher-level vdev out of other higher-level vdevs. However, if you have multiple higher-level vdevs in a single pool, the pool functions as a variable stripe width raid0 across them—so a pool filled with mirror vdevs is largely similar to a raid10 array, a pool filled with raidz1 vdevs is largely similar to a raid50 array, and so on.
One final note about vdevs: vdevs are immutable. This means that once you’ve created yourself a RAIDZ1 vdev with three drives in it, if you buy a fourth drive, you cannot add it to your existing vdev (sorry, Charlie). You can always add more vdevs to a zpool, but you cannot grow or shrink a vdev once created. There is one (and only one) exception: if you remove and then replace each drive in an existing vdev with a larger drive, one by painful one, once every single drive has been replaced with a larger one, you can resilver your vdev, and it will then be higher capacity. This could, of course, require weeks in a large and heavily populated vdev… but that’s just how ZFS rolls. (If this is a show-stopper for you, btrfs—which is usable, but not yet production-ready—may be more your cup of tea.)
A zpool is a named device which consists of one or more vdevs in what is effectively a variable-stripe-width RAID0 configuration. Whew. What this actually means is that you can add a bunch of arbitrary devices or higher-level vdevs into a zpool, and the system will do its best to fill them all up at the same time, even if they are of different sizes. For example, say you create a new zpool named ars with two vdevs: a two-drive 1TB mirror vdev, and a two-drive 2TB mirror vdev. It will look something like this:
you@box:~$ sudo zpool status ars pool: ars state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM ars ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x50014ee2080259c8 ONLINE 0 0 0 wwn-0x50014ee2080268b2 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 wwn-0x50014ee25d4cdecd ONLINE 0 0 0 wwn-0x50014ee25d4ce711 ONLINE 0 0 0 errors: no known data errors
OK, so that’s relatively easy to visualize—a zpool with two mirror vdevs in it. But what’s this about “variable stripe width RAID0?” Basically, the zpool will do its best to fill all vdevs up at the same rate relative to their capacity. So in our example above, if mirror-0 is two 1TB drives, and mirror-1 is two 2TB drives, for every 3GB of data we write, 1GB of it goes on mirror-0, and the other 2GB goes on mirror-1. Each vdev will have the same percentage of space free at any given time, all the way up to when the pool’s completely full… and mirror-1 is going to be receiving twice as much of the data as mirror-0 is, which will correspondingly hurt your overall performance.
But what if we add a new vdev to the pool later? (Yes, we can do that.) Let’s assume that the zpool is 50 percent full, and we add a third 2TB mirror vdev. In terms of free space, that means we now have a vdev with 500GB free (the half-full mirror-0), a vdev with 1TB free (the half-full mirror-1), and a vdev with 2TB free (the new, empty mirror-2). If we write another 1400GB of data to the zpool now, 200GB goes on mirror-0, 400GB goes on mirror-1, and 800GB goes on mirror-2—so, again, the elements of our zpool are filling their remaining empty space at the same rate.
Yes, this means that now you’re working the drives in vdev3 much harder than the drives in vdev2, and the drives in vdev2 much harder than the drives in vdev1. So this zpool will perform significantly slower than it would have if all of the vdevs had been in it to begin with and if all of the vdevs had been the same size. Such is life under ZFS; and you may very well choose not to use mismatched vdevs—or add new vdevs to an existing zpool—for exactly this reason.
For the purposes of this article, I’m going to be sloppy and refer to both mirror vdevs and raidz vdevs as vdevs with parity, meaning that if a single device in the vdev fails or returns corrupt data, the data can be reconstructed from other devices in the vdev. Note that a vdev without parity (such as a single drive or partition) has no way of reconstructing corrupt or missing data. More on that next.
Hating your data
Hating your data is a highly technical term referring to the addition of a single device to an existing zpool. I’m being tongue-in-cheek here, but I’m hoping it grabs your attention. The nature of a zpool means you can really get yourself badly in trouble if you aren’t careful if you decide to add new vdevs to it later. What happens if, instead of adding another 2TB mirror vdev, we had added a single 2TB drive to our “ars” pool from the zpool example above?
you@box:~$ sudo zpool status ars pool: ars state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM ars ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x50014ee2080259c8 ONLINE 0 0 0 wwn-0x50014ee2080268b2 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 wwn-0x50014ee25d4cdecd ONLINE 0 0 0 wwn-0x50014ee25d4ce711 ONLINE 0 0 0 wwn-0x50014ee25d4cr2d2 ONLINE 0 0 0 errors: no known data errors
Keep in mind, the zpool stripes all data, including filesystem metadata, across all the top-level vdevs—and now, one of our top-level vdevs has no parity. This means that any failure on that single disk will bring the whole array down. All that space you devoted to parity on the first two vdevs won’t do you any good at all if you lose a sector on the singleton!
If the failure odds aren’t enough by themselves to make you unhappy, let’s revisit the last example: nearly sixty percent of the 1.4TB of data we wrote after adding that third vdev went to that third vdev, which in this case is a single 2TB drive. This means that our performance just tanked hard, both on writes and on reads of the new data that’s mostly stored on a single disk.
By the way, if you’re thinking “well at least that first 1.5TB of data is safe,” sorry, it isn’t. Remember, all data, including filesystem metadata, is striped across all top-level vdevs with each new write. You lose that single disk after you added it, and your zpool will be left in an inconsistent state and will refuse to mount at all, no matter how prettily you cry.
Like I said, adding a single disk to a zpool is called hating your data. Don’t do it. Please.
Baby’s first zpool
OK, now that your eyes have glazed over with terminology, let’s get down to business and actually set up a zpool. Believe it or not, it’s pretty easy. We’re going to create a relatively simple zpool here, consisting of a single raidz1 vdev with three drives in it. We’re going to devote the entire drives to ZFS, so we don’t need to partition them first. First step is positively identifying the drives, using /dev/disk/by-id:
me@box:~$ ls -l /dev/disk/by-id total 0 lrwxrwxrwx 1 root root 9 Aug 1 22:09 ata-WDC_WD2002FAEX-007BA0_WD-WCAY01065572 -> ../../sda lrwxrwxrwx 1 root root 10 Aug 1 22:09 ata-WDC_WD2002FAEX-007BA0_WD-WCAY01065572-part1 -> ../../sda1 lrwxrwxrwx 1 root root 9 Aug 1 22:09 ata-WDC_WD2002FAEX-007BA0_WD-WCAY01065731 -> ../../sdb lrwxrwxrwx 1 root root 9 Aug 1 22:09 ata-WDC_WD2002FAEX-007BA0_WD-WCAY01127972 -> ../../sdc lrwxrwxrwx 1 root root 9 Aug 1 22:09 ata-WDC_WD2002FAEX-007BA0_WD-WCAY01133538 -> ../../sdd lrwxrwxrwx 1 root root 9 Aug 1 22:09 scsi-SATA_WDC_WD2002FAEX-_WD-WCAY01065572 -> ../../sda lrwxrwxrwx 1 root root 10 Aug 1 22:09 scsi-SATA_WDC_WD2002FAEX-_WD-WCAY01065572-part1 -> ../../sda1 lrwxrwxrwx 1 root root 9 Aug 1 22:09 scsi-SATA_WDC_WD2002FAEX-_WD-WCAY01065731 -> ../../sdb lrwxrwxrwx 1 root root 9 Aug 1 22:09 scsi-SATA_WDC_WD2002FAEX-_WD-WCAY01127972 -> ../../sdc lrwxrwxrwx 1 root root 9 Aug 1 22:09 scsi-SATA_WDC_WD2002FAEX-_WD-WCAY01133538 -> ../../sdd lrwxrwxrwx 1 root root 9 Aug 1 22:09 wwn-0x50014ee2080259c8 -> ../../sdd lrwxrwxrwx 1 root root 9 Aug 1 22:09 wwn-0x50014ee2080268b2 -> ../../sdc lrwxrwxrwx 1 root root 9 Aug 1 22:09 wwn-0x50014ee25d4cdecd -> ../../sda lrwxrwxrwx 1 root root 10 Aug 1 22:09 wwn-0x50014ee25d4cdecd-part1 -> ../../sda1 lrwxrwxrwx 1 root root 9 Aug 1 22:09 wwn-0x50014ee25d4ce711 -> ../../sdb
Wheee! How do we interpret this? Well, we have four physical drives, beginning with /dev/sda and ending with /dev/sdd. We can see that /dev/sda has a partition table on it, which confirms what we’d already assume—this is our existing system drive. The other three drives are bare, and we’ll use them for our raidz1 vdev.
We can see each drive listed multiple times because they can be referred to multiple ways: by their wwn ID, by their model and serial number as connected to the ATA bus, or by their model and serial number as connected to the (virtual, in this case) SCSI bus. Which one should you pick? Well, any of them will work, including the super simple devicename (like /dev/sdb) itself, but you want to pick one that you can also see on the label on the physical drive. This is so you can be absolutely certain that you pull and replace the correct drive later, if one fails.
If your drive doesn’t show the wwn ID on the drive label, use the scsi-model-serial listing. If your drivedoes have the wwn ID printed visibly, I’d use that instead (just because it’s shorter). In this example, my drives do have the wwn printed visibly, so I’m going to use that.
me@box:~$ sudo zpool create -o ashift=12 ars raidz1 /dev/disk/by-id/wwn-0x50014ee25d4ce711 /dev/disk/by-id/wwn-0x50014ee2080268b2 /dev/disk/by-id/wwn-0x50014ee2080259c8
Whew. Dissecting the pieces: -o ashift=12 means “use 4K blocksizes instead of the default 512 byte blocksizes,” which is appropriate on almost all modern drives. Ars is the name of our new zpool. Raidz1 means we want a striped array with a single block of parity per stripe. And, finally, we have the device identifiers themselves. Voila:
me@box:~$ sudo zpool status pool: ars state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM ars ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 wwn-0x50014ee25d4ce711 ONLINE 0 0 0 wwn-0x50014ee2080268b2 ONLINE 0 0 0 wwn-0x50014ee2080259c8 ONLINE 0 0 0 errors: No known data errors me@box:~$ sudo zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT ars 2.98T 1008K 2.98T 0% 1.00x ONLINE -
Yay! Our first zpool! First thing we should notice here is that we’re seeing the full 3T capacity even though we said we wanted raidz1, which means that we’re using 1TB of those 3TB for parity. That’s because the zpool command shows us raw capacity, not usable capacity. We can see the usable capacity by using the zfs command, which queries filesystems rather than zpools (or by standard system tools like df):
me@box:~$ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT ars 709K 1.96T 181K /ars me@box:~$ df -h /ars Filesystem Size Used Avail Use% Mounted on ars 2.0T 128K 2.0T 1% /ars
There we go. Now we are seeing the 2TB of usable capacity we’d expect out of a single-parity array with three 1TB drives.
More ZFS lingo
Now that we’ve created our first zpool, let’s look at the purely logical constructs we can (and should) create underneath them: filesystems, zvols, snapshots, and clones.
This doesn’t mean what you probably think it means. Under ZFS, a filesystem is sort of like a partition that comes already formatted for you, only it’s really easy to create them, modify them, resize them, and otherwise play around with them instantly and whenever you’d like. Let’s create a couple now.
me@box:~$ sudo zfs create ars/textfiles me@box:~$ sudo zfs create ars/jpegs me@box:~$ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT ars 1.18M 1.96T 192K /ars ars/jpegs 181K 1.96T 181K /ars/jpegs ars/textfiles 181K 1.96T 181K /ars/textfiles me@box:~$ ls -l /ars total 22 drwxr-xr-x 2 me me 2 Jan 23 15:10 jpegs drwxr-xr-x 2 me me 2 Jan 23 15:10 textfiles
Snazzy! But why would we want to create filesystems instead of just making folders, since they look just like folders? Lots of reasons. You can take snapshots of a filesystem, not a folder, but you can also set properties on a filesystem. Since one of these folders is for Lee’s awesome ANSI art, we know it’ll be highly compressible. Let’s go ahead and set compression on it. And just to make sure our jpeg hoarding problem won’t consume our entire storage pool, let’s go ahead and set a quota of 200G on it:
me@box:~$ sudo zfs set compression=on ars/textfiles me@box:~$ sudo zfs set quota=200G ars/jpegs me@box:~$ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT ars 1.19M 1.96T 202K /ars ars/jpegs 181K 200G 181K /ars/jpegs ars/textfiles 181K 1.96T 181K /ars/textfiles
Nice! We can see now that /ars/jpegs only shows 200G of space available. We can just as easily set the quota from 200G to 1T now or shrink it from 200G down to, say, 50M. It’s all instantaneous and easy—no scary partition sizing, no weird “filesystem grow/shrink operations” afterward, just (re)set the quota and you’re done. But what about the compression we set on /ars/textfiles? I don’t actually have a significant amount of Lee’s ANSI art on hand, but we can check it out by writing a whole bunch of 00s into a big file:
me@box:~$ dd if=/dev/zero bs=128M count=128 of=/ars/textfiles/zeroes.bin 128+0 records in 128+0 records out 17179869184 bytes (17 GB) copied, 12.1335 s, 1.4 GB/s
OK, that’s 16GB worth of 00s written to disk. Wait a minute, did that say that it wrote at 1.4 GBps? Sure did—highly compressible data can be compressed in memory faster than it can be written to disk, so in some cases (like textfiles, or even more so, ridiculously large numbers of zeroes) having compression on can be a huge performance win. (Compression will slow performance down on already-compressed or otherwise incompressible data, like most images, movies, executables, etc.)
Now let’s look at our textfiles filesystem. We saw a giant performance win, will we see a corresponding storage win?
me@box:~$ sudo zfs list ars/textfiles NAME USED AVAIL REFER MOUNTPOINT ars/textfiles 181K 1.96T 181K /ars/textfiles me@box:~$ ls -lh /ars/textfiles total 512 -rw-r--r-- 1 me me 16G Jan 23 15:24 zeroes.bin
Looks too good to be true; there isn’t any more storage space taken up by our 16GB of zeroes! In reality, this is just a very extreme case. Infinite zeroes are nearly infinitely compressible; normal text (or Lee’s ANSI art) would still be very compressible—frequently up to 90 percent—but not near infinite. And as we can see, a simple ls shows that yes, all 16GB of our zeroes are safely stored.
There are lots more properties that can be played with on ZFS filesystems, but we can’t possibly cover them all today.
A zvol is basically a ZFS filesystem “without the filesystem.” Logically, it’s presented to the system as a raw block device, directly accessible through an entry in /dev. Why would you want a zvol? Well, honestly, you probably don’t. If you did want one, you’d want it to format with another filesystem entirely, and therefore be able to use ZFS features like snapshots, compression, dynamic resizing, and replication on it. You might not want one even then, since zvols can be a little quirky with how they handle snapshots, but that’s beyond the scope of what we’re trying to do today, which is just get a good beginner’s handle on the basic care and feeding of ZFS.
A snapshot is an instantaneously created copy of every single block of data in a filesystem at the exact point in time the snapshot was created. Once you have a snapshot, you can mount it, you can look through its folders and files and what have you just like you could in the original filesystem, you can copy bits and pieces out of the snapshot and into the “real world,” and you can even roll the entire filesystem itself back to the snapshot. Let’s play:
me@box:~$ echo lolz > /ars/textfiles/lolz.txt me@box:~$ sudo zfs snapshot ars/textfiles@snapshot1 me@box:~$ sudo zfs list -rt snapshot ars/textfiles NAME USED AVAIL REFER MOUNTPOINT ars/textfiles@snapshot1 133K - 186K -
OK, we’ve added a new file to /ars/textfiles. I felt a terminal case of the stupids coming on, so I took a snapshot of the filesystem and there it is, ars/textfiles@snapshot1. Notice how sometimes I use a leading slash and sometimes I don’t? To the filesystem, everything is relative to root, so everything has a leading slash. To ZFS, though, “ars” is the actual pool. When we use the zfs command, we don’t put a leading slash behind “ars.” (It’s a little confusing at first, but you get used to it.)
me@box:~$ rm /ars/textfiles/lolz.txt me@box:~$ ls /ars/textfiles zeroes.bin
Oh no! I knew I felt a case of the dumb coming on. My incredibly valuable lolz.txt file is gone! No worries, though, I took a snapshot… let me go ahead and mount ars/textfiles@snapshot1 and see if my missing file is there:
me@box:~$ mkdir /tmp/textfiles@snapshot1 me@box:~$ sudo mount -t zfs ars/textfiles@snapshot1 /tmp/textfiles@snapshot1 me@box:~$ ls /tmp/textfiles@snapshot1 lolz.txt zeroes.bin
Whew! lolz.txt is safe and sound in my snapshot, which I’ve mounted under /tmp. I could just copy the file out of the mounted snapshot and put it back where I want it. But what if I’d made lots and lots of changes, and I wasn’t sure what had or hadn’t been changed? I could still put everything back the way it was by rolling back to my former snapshot.
me@box:~$ sudo umount /tmp/textfiles@snapshot1 me@box:~$ sudo zfs rollback ars/textfiles@snapshot1 me@box:~$ ls /ars/textfiles/ lolz.txt zeroes.bin
Super easy. Everything’s just like it was. No fuss, no muss.
A clone is a copy of a filesystem (actually, a copy of a snapshot of a filesystem) that initially doesn’t take up any more space on disk. As the clone diverges from its parent, it uses actual space to store the blocks that differ. There are a few interesting use cases for clones. For example, if you want to do something experimental but really don’t want to commit to it happening in your “real” filesystem, you can instead create a clone, perform your experiments there, and then destroy the clone when you’re done.
I personally find clones most valuable when using virtual machines. You can clone an older snapshot of a VM, boot it up, and then look for files, data, or programs in it without disturbing the “real” VM. Or, you can clone a fresh snapshot and try something risky on it. Want to see what happens when you do an in-place upgrade of that old, creaky Windows Small Business Server? Clone it and test-upgrade away.
At this point, you have a zpool. That zpool has at least one nice, redundant, self-healing vdev with parity in it. You know how to take snapshots, so now let’s look at how to replicate those snapshots to another machine which is also running ZFS.
Set up SSH keys
This isn’t strictly a ZFS step, but you’ll need it in order to handle replication the easy way, so we’ll go ahead and cover it here. Let’s assume you have box1 and box2; your data is on box1 and you want to back it up to box2. Further, let’s assume you want to push the backups from box1 to box2, rather than pulling them the other way around. First, generate yourself a root SSH key on box1:
me@box:~$ sudo ssh-keygen -t dsa Generating public/private dsa key pair. Enter file in which to save the key (/root/.ssh/id_dsa): Created directory '/root/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /root/.ssh/id_dsa. Your public key has been saved in /root/.ssh/id_dsa.pub. The key fingerprint is: 8f:67:61:ab:4d:be:99:9f:b9:4f:68:25:37:e5:82:ed root@box1 The key's randomart image is: +--[ DSA 1024]----+ | | | | | .| | o o | | S o o * .| | + o * o | | . * o E | | B + + | | . *o=o. | +-----------------+
You’ll be asked if you want to save your key to the default location /root/.ssh (you do) and if you want to use a passkey (for this example, you don’t). Once you’re done, it’s time to copy your new public key off to box2:
me@box1:~$ sudo scp /root/.ssh/id_dsa.pub me@box2:/tmp/ me@box2's password: id_dsa.pub 100% 602 0.6KB/s 00:00
Now it’s time to add box1’s public key to the root authorized keys file on box2 and make sure that we allow the use of keys on box2.
me@box2:~$ sudo -s root@box2:~# cat /tmp/id_dsa.pub >> /root/.ssh/authorized_keys root@box2:~# echo AuthorizedKeysFile %h/.ssh/authorized_keys >> /etc/ssh/sshd_config
Now we’ll be able to SSH as root with no password from box1 to box2, which is necessary for our next step.
As of right now, box2 has its own zpool (which we named technica, and which does not have to be composed of the same number, type, or arrangement of vdevs as our original zpool on box1) but has no actual filesystems on it. I now have a gigabyte of data on ars/jpegs, and I want to replicate that data to box2. Keep in mind that we don’t replicate the filesystem itself, we replicate snapshots. Let’s take a snapshot:
me@box1:~$ sudo -s root@box1:~# zfs snapshot ars/jpegs@1 root@box1:~# zfs list -rt all ars/jpegs NAME USED AVAIL REFER MOUNTPOINT ars/jpegs 1024M 199G 1024M /ars/jpegs ars/jpegs@1 0 - 1024M -
Now let’s replicate it:
root@box1:~# zfs send ars/jpegs@1 | ssh box2 zfs receive technica
It’s that easy. After the 1GB of data gets done moving across the network, you now have a replicated copy of ars/jpegs on box1 at technica/jpegs on box2:
root@box2:~# zfs list -rt all technica/jpegs NAME USED AVAIL REFER MOUNTPOINT technica/jpegs 1024M 1.95T 1024M /technica/jpegs technica/jpegs@1 0 - 1024M -
What about the next time we replicate? Well, as long as we haven’t gotten rid of snapshot ars/jpegs@1 on box1, we can use it as a parent snapshot and do incremental replication the next time, which will go much quicker. Let’s make a silly little file, take another snapshot, and replicate incrementally:
root@box1:~# echo lolz > /ars/jpegs/lolz.txt root@box1:~# ls -l /ars/jpegs total 1048251 -rw-r--r-- 1 root root 1073741824 Jan 23 16:23 1G.bin -rw-r--r-- 1 root root 5 Jan 23 16:32 lolz.txt root@box1:~# zfs snapshot ars/jpegs@2 root@box1:~# zfs send -i ars/jpegs@1 ars/jpegs@2 | ssh box2 zfs receive technica/jpegs
Notice that this time, we used the -i argument and specified both snapshots. We also used the full path to the existing filesystem technica/jpegs in our receive command, since we’re receiving an incremental to an existing filesystem, not a full replication to create a new filesystem with. This replication happened pretty much instantaneously—lolz.txt is just a silly little file, after all—and ZFS already knows what has or hasn’t changed from snapshot @1 to snapshot @2. Since it doesn’t have to grovel over the disk looking for changes, it can just immediately start sending them when asked.
Does everything look as we’d expect it to, over on box2?
root@box2:~# zfs list -rt all technica/jpegs NAME USED AVAIL REFER MOUNTPOINT technica/jpegs 1024M 1.95T 1024M /technica/jpegs technica/jpegs@1 117K - 1024M - technica/jpegs@2 0 - 1024M - root@box2:~# ls -lh /technica/jpegs total 1.0G -rw-r--r-- 1 root root 1.0G Jan 23 16:23 1G.bin -rw-r--r-- 1 root root 5 Jan 23 16:32 lolz.txt
Exactly as we’d expect: not only a full copy of the original filesystem, but a full copy of the original filesystem and all of its snapshots as we replicated them over.
At this point, you can safely get rid of snapshot ars/jpegs@1 on box1 if you’d like to. The next time you replicate to box2, you’ll use @2 as a parent for whatever your next snapshot is and so on. This allows you to do some pretty cool stuff, like make a “main” server with expensive, fast storage (but not much of it) and a “backup” server with cheap, slow storage (with plenty of it). You can even keep lots of snapshots on your backup server, while destroying them pretty quickly from your “main” server. Pretty powerful stuff.
I have to be honest, the only reason I’m even mentioning dedup is I know there’ll be a furor in the comments if I don’t. There may be one anyway, because the next thing I have to tell you is something you don’t want to hear:
You probably don’t want to use dedup. Full stop.
Deduplication sounds exciting. Stop caring when your users blindly make a copy of a folder with 15G of stuff in it! Don’t write more stuff than you have to! Keep more stuff on the same drive! Reap some performance benefits, sometimes, depending! But the problem is, the way ZFS implements dedup, it takes up a lot of RAM; unless you have a very specialized machine and a very specialized workload, almost certainly more RAM than you’ll be willing to feed it.
The bottom line: for every 1TB of deduplicated storage, you’re going to need roughly 5GB of RAM. And that’s for dedup tablespace alone. That doesn’t count ZFS’ normal memory consumption. I’ve tested this personally. After copying about 6TB of data to a ZFS filesystem with dedup turned on, my RAM consumption went up roughly 32GB. This was a special server that has 128GB of RAM, so luckily it could handle it. Even so, I disabled dedup immediately after the test because I wasn’t happy with the result.
In most cases, for most users… it’s just not worth it. Sorry.
The final takeaway
We’ve still really only scratched the surface of what ZFS can do. But hopefully, you’ve seen enough to get you half as interested in ZFS as I am. I’ve been using ZFS professionally and in production for over five years, and I can honestly say that it’s both changed the course of my career and my business. I wouldn’t dream of going back to the way I did things before ZFS.
For you Windows and Mac users out there (or any Linux users who are allergic to the command line), don’t despair and stay tuned! Next in this series, I’ll be covering FreeNAS, which is essentially “ZFS on easy mode.” It’s a ready-to-download, ready-to-use distribution that lets you set up, manage, and configure your own ZFS-powered Network Attached Storage device out of a generic PC with a bunch of hard drives; no command line required.
Jim Salter (@jrssnet) is an author, public speaker, small business owner, mercenary sysadmin, and father of three—not necessarily in that order. He got his first real taste of open source by running Apache on his very own dedicated FreeBSD 3.1 server back in 1999, and he’s been a fierce advocate of FOSS ever since. He also also created and maintains http://freebsdwiki.net andhttp://ubuntuwiki.net.
This article was originally posted on ArsTechnica.