Using the ZFS next-gen filesystem on Linux

If btrfs interested you, start your next-gen trip with a step-by-step guide to ZFS.

If you’re not an expert on armored anteaters that’s a pangolin.
Aurich Lawson / ThinkStock

In my last article on next-gen filesystems, we did something in between a generic high altitude overview of next-gen filesystems and a walkthrough of some of btrfs’ features and usage. This time, we’re going to specifically look at what ZFS brings to the table, walking through getting it installed and using it on one of the more popular Linux distributions: Precise Pangolin. That’s the most current Long Term Support (LTS) Ubuntu release.

With that said… if Ubuntu’s not your cup of tea, don’t worry! There are lots of options for running ZFS, and very little of this walkthrough will really depend on your use of Ubuntu in particular or even Linux in general. You can always visit http://zfsonlinux.org directly for help with the initial installation if you prefer RHEL or Fedora or Arch or what have you—and if you’re a BSD fan, ZFS is available from the base installer in either PC-BSD or in the latest 10.0 release of FreeBSD itself.

In the interest of brevity, I’m going to assume you’re already familiar with most of the generic terms and features associated with next-gen filesystems: atomic snapshots, asynchronous incremental replication, self-healing arrays, per-block checksumming, etc. If you aren’t already familiar with those concepts, you might want to brush up on the last article to catch up.

Prerequisites and installation

You’ll need a 64-bit PC or virtual machine with a recommended minimum of 8GB of RAM (you may be able to squeak by with less—possibly much less—but you’re more likely to encounter performance degradation or odd behavior if you do) and several hard drives or partitions available to use with ZFS. We’re going to use and cover Ubuntu Precise here, specifically. If you have a different distribution of Linux, you’ll need to look at a guide for installation on your distribution at http://zfsonlinux.org. If you’re using (a new version of) FreeBSD, PC-BSD, or one of the Solaris variants, you should have ZFS support built-in already.

Assuming that you’ve got a 64-bit PC running Ubuntu Precise just like I do here, actual installation is pretty mind-numbingly simple: first, we need to add the PPA (Personal Package Archive) for ZFSonLinux, then install the package itself:

    you@box:~$ sudo apt-get install python-software-properties
    you@box:~$ sudo apt-add-repository ppa:zfs-native/stable
    you@box:~$ sudo apt-get update
    you@box:~$ sudo apt-get install ubuntu-zfs

In the first command above, we gave ourselves access to the apt-add-repository command, which makes it much simpler to safely add PPAs to our repository list. Then we added the PPA, updated our source list to reflect that, and installed the package itself. Couldn’t be (much) easier. One note: the final step isn’t just slapping a binary in place. This command automatically compiles the module for your particular kernel live at the time of installation, so you should expect it to take a minute or five.

Initial tuning

Another note: this tuning step is for Linux only! If you’re using a BSD or Solaris variant, this isn’t necessary, and you may choose to skip ahead to the next section.

One of the weaknesses in the ZFS implementation on Linux is that the ARC—which is ZFS’ Advanced Replacement Cache, a smarter type of cache than the typical “first in first out” used in most filesystems—is unfortunately too slow to release RAM back into the system. In theory, it should behave as a normal filesystem cache does. When your system needs more RAM, the ARC should release RAM to it as necessary. In practice, large memory allocation requests—such as starting up a virtual machine or a large database—are likely to simply fail if they need RAM currently allocated to the ARC.

What this means for you is that you should manually limit the ARC to an appropriate maximum value for your system. A pure fileserver or NAS might want almost all RAM available for the ARC, or a virtual machine host might want as much RAM as possible available for the system itself. If you’re in any doubt at all, half the system RAM is a great starting point, and you can adjust later if you need to. We’ll set this value in /etc/modprobe.d/zfs.conf:

    # /etc/modprobe.d/zfs.conf
    #
    # yes you really DO have to specify zfs_arc_max IN BYTES ONLY!
    # 16GB=17179869184, 8GB=8589934592, 4GB=4294967296, 2GB=2147483648, 1GB=1073741824, 500MB=536870912, 250MB=268435456
    #
    options zfs zfs_arc_max=4294967296

Note that just creating this file doesn’t actually apply the change. The file is only read when the kernel module is initially loaded, so you’ll have to either load and unload the module or reboot the system to make it take effect. If you’re adventurous, and you have a really new version of ZFS on Linux (newer than the version I am using today), you may also be able to just echo the value directly to the module for immediate effect:

    you@box:~$ sudo -s
    root@box:~# echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_max

Be aware that changes echoed directly to the module as shown above take place immediately but do not persist across reboots. Extra-confusingly, you need to look at a different value if you want to check what your current in-use value of zfs_arc_max is:

    you@box:~$ sudo grep c_max /proc/spl/kstat/zfs/arcstats
    c_max                           4    4294967296

Once you’re satisfied that you’ve set zfs_arc_max to a value appropriate for your system (remember: half the available RAM is a good rule of thumb), you’re ready to move on.

Learning ZFS lingo

Before we go nuts on the command line, let’s briefly discuss some important ZFS terminology—that is, how we refer to the hierarchy that the actual physical disks will fit into. In the broadest terms, any data stored on ZFS is stored on one or more vdevs, which may populate one or more higher-level vdevs, which populate a single zpool (though you can have multiple zpools in one system). Confused yet? Let’s start from the smallest term and move our way up.

vdev

A vdev is a Virtual DEVice. This can be a single physical drive or partition, or it can be a higher-level vdev consisting of multiple… uh… “lower-level” vdevs. To make everybody’s life easier, from here on out I’m just going to refer to simple drives or partitions as devices, and when I use the term vdev, I will be referring to higher-level vdevs.

higher-level vdev

A “higher-level vdev” is a special ZFS raid array of lower-level vdevs, typically meaning of individual physical drives or partitions. These devices may be arranged in the vdev as a simple mirror or as a striped array with parity. Parity may be one, two, or three blocks per stripe—corresponding to RAID5, RAID6, or a mythical nonexistent RAID7 (referred to as raidz1, raidz2, and raidz3 respectively). Note that it’s not possible to create a higher-level vdev out of other higher-level vdevs. However, if you have multiple higher-level vdevs in a single pool, the pool functions as a variable stripe width raid0 across them—so a pool filled with mirror vdevs is largely similar to a raid10 array, a pool filled with raidz1 vdevs is largely similar to a raid50 array, and so on.

One final note about vdevs: vdevs are immutable. This means that once you’ve created yourself a RAIDZ1 vdev with three drives in it, if you buy a fourth drive, you cannot add it to your existing vdev (sorry, Charlie). You can always add more vdevs to a zpool, but you cannot grow or shrink a vdev once created. There is one (and only one) exception: if you remove and then replace each drive in an existing vdev with a larger drive, one by painful one, once every single drive has been replaced with a larger one, you can resilver your vdev, and it will then be higher capacity. This could, of course, require weeks in a large and heavily populated vdev… but that’s just how ZFS rolls. (If this is a show-stopper for you, btrfs—which is usable, but not yet production-ready—may be more your cup of tea.)

zpool

A zpool is a named device which consists of one or more vdevs in what is effectively a variable-stripe-width RAID0 configuration. Whew. What this actually means is that you can add a bunch of arbitrary devices or higher-level vdevs into a zpool, and the system will do its best to fill them all up at the same time, even if they are of different sizes. For example, say you create a new zpool named ars with two vdevs: a two-drive 1TB mirror vdev, and a two-drive 2TB mirror vdev. It will look something like this:

    you@box:~$ sudo zpool status ars
      pool: ars
     state: ONLINE
     scrub: none requested
    config:

            NAME                        STATE     READ WRITE CKSUM
            ars                         ONLINE       0     0     0
              mirror-0                  ONLINE       0     0     0
                wwn-0x50014ee2080259c8  ONLINE       0     0     0
                wwn-0x50014ee2080268b2  ONLINE       0     0     0
              mirror-1                  ONLINE       0     0     0
                wwn-0x50014ee25d4cdecd  ONLINE       0     0     0
                wwn-0x50014ee25d4ce711  ONLINE       0     0     0

    errors: no known data errors

OK, so that’s relatively easy to visualize—a zpool with two mirror vdevs in it. But what’s this about “variable stripe width RAID0?” Basically, the zpool will do its best to fill all vdevs up at the same rate relative to their capacity. So in our example above, if mirror-0 is two 1TB drives, and mirror-1 is two 2TB drives, for every 3GB of data we write, 1GB of it goes on mirror-0, and the other 2GB goes on mirror-1. Each vdev will have the same percentage of space free at any given time, all the way up to when the pool’s completely full… and mirror-1 is going to be receiving twice as much of the data as mirror-0 is, which will correspondingly hurt your overall performance.

But what if we add a new vdev to the pool later? (Yes, we can do that.) Let’s assume that the zpool is 50 percent full, and we add a third 2TB mirror vdev. In terms of free space, that means we now have a vdev with 500GB free (the half-full mirror-0), a vdev with 1TB free (the half-full mirror-1), and a vdev with 2TB free (the new, empty mirror-2). If we write another 1400GB of data to the zpool now, 200GB goes on mirror-0, 400GB goes on mirror-1, and 800GB goes on mirror-2—so, again, the elements of our zpool are filling their remaining empty space at the same rate.

Yes, this means that now you’re working the drives in vdev3 much harder than the drives in vdev2, and the drives in vdev2 much harder than the drives in vdev1. So this zpool will perform significantly slower than it would have if all of the vdevs had been in it to begin with and if all of the vdevs had been the same size. Such is life under ZFS; and you may very well choose not to use mismatched vdevs—or add new vdevs to an existing zpool—for exactly this reason.

Parity

For the purposes of this article, I’m going to be sloppy and refer to both mirror vdevs and raidz vdevs as vdevs with parity, meaning that if a single device in the vdev fails or returns corrupt data, the data can be reconstructed from other devices in the vdev. Note that a vdev without parity (such as a single drive or partition) has no way of reconstructing corrupt or missing data. More on that next.

Hating your data

Hating your data is a highly technical term referring to the addition of a single device to an existing zpool. I’m being tongue-in-cheek here, but I’m hoping it grabs your attention. The nature of a zpool means you can really get yourself badly in trouble if you aren’t careful if you decide to add new vdevs to it later. What happens if, instead of adding another 2TB mirror vdev, we had added a single 2TB drive to our “ars” pool from the zpool example above?

    you@box:~$ sudo zpool status ars
      pool: ars
     state: ONLINE
     scrub: none requested
    config:

            NAME                        STATE     READ WRITE CKSUM
            ars                         ONLINE       0     0     0
              mirror-0                  ONLINE       0     0     0
                wwn-0x50014ee2080259c8  ONLINE       0     0     0
                wwn-0x50014ee2080268b2  ONLINE       0     0     0
              mirror-1                  ONLINE       0     0     0
                wwn-0x50014ee25d4cdecd  ONLINE       0     0     0
                wwn-0x50014ee25d4ce711  ONLINE       0     0     0
             wwn-0x50014ee25d4cr2d2     ONLINE       0     0     0

    errors: no known data errors

Keep in mind, the zpool stripes all data, including filesystem metadata, across all the top-level vdevs—and now, one of our top-level vdevs has no parity. This means that any failure on that single disk will bring the whole array down. All that space you devoted to parity on the first two vdevs won’t do you any good at all if you lose a sector on the singleton!

If the failure odds aren’t enough by themselves to make you unhappy, let’s revisit the last example: nearly sixty percent of the 1.4TB of data we wrote after adding that third vdev went to that third vdev, which in this case is a single 2TB drive. This means that our performance just tanked hard, both on writes and on reads of the new data that’s mostly stored on a single disk.

 

By the way, if you’re thinking “well at least that first 1.5TB of data is safe,” sorry, it isn’t. Remember, all data, including filesystem metadata, is striped across all top-level vdevs with each new write. You lose that single disk after you added it, and your zpool will be left in an inconsistent state and will refuse to mount at all, no matter how prettily you cry.

Like I said, adding a single disk to a zpool is called hating your data. Don’t do it. Please.

zpool is equally exciting.

Baby’s first zpool

OK, now that your eyes have glazed over with terminology, let’s get down to business and actually set up a zpool. Believe it or not, it’s pretty easy. We’re going to create a relatively simple zpool here, consisting of a single raidz1 vdev with three drives in it. We’re going to devote the entire drives to ZFS, so we don’t need to partition them first. First step is positively identifying the drives, using /dev/disk/by-id:

    me@box:~$ ls -l /dev/disk/by-id
    total 0
    lrwxrwxrwx 1 root root  9 Aug  1 22:09 ata-WDC_WD2002FAEX-007BA0_WD-WCAY01065572 -> ../../sda
    lrwxrwxrwx 1 root root 10 Aug  1 22:09 ata-WDC_WD2002FAEX-007BA0_WD-WCAY01065572-part1 -> ../../sda1
    lrwxrwxrwx 1 root root  9 Aug  1 22:09 ata-WDC_WD2002FAEX-007BA0_WD-WCAY01065731 -> ../../sdb
    lrwxrwxrwx 1 root root  9 Aug  1 22:09 ata-WDC_WD2002FAEX-007BA0_WD-WCAY01127972 -> ../../sdc
    lrwxrwxrwx 1 root root  9 Aug  1 22:09 ata-WDC_WD2002FAEX-007BA0_WD-WCAY01133538 -> ../../sdd
    lrwxrwxrwx 1 root root  9 Aug  1 22:09 scsi-SATA_WDC_WD2002FAEX-_WD-WCAY01065572 -> ../../sda
    lrwxrwxrwx 1 root root 10 Aug  1 22:09 scsi-SATA_WDC_WD2002FAEX-_WD-WCAY01065572-part1 -> ../../sda1
    lrwxrwxrwx 1 root root  9 Aug  1 22:09 scsi-SATA_WDC_WD2002FAEX-_WD-WCAY01065731 -> ../../sdb
    lrwxrwxrwx 1 root root  9 Aug  1 22:09 scsi-SATA_WDC_WD2002FAEX-_WD-WCAY01127972 -> ../../sdc
    lrwxrwxrwx 1 root root  9 Aug  1 22:09 scsi-SATA_WDC_WD2002FAEX-_WD-WCAY01133538 -> ../../sdd
    lrwxrwxrwx 1 root root  9 Aug  1 22:09 wwn-0x50014ee2080259c8 -> ../../sdd
    lrwxrwxrwx 1 root root  9 Aug  1 22:09 wwn-0x50014ee2080268b2 -> ../../sdc
    lrwxrwxrwx 1 root root  9 Aug  1 22:09 wwn-0x50014ee25d4cdecd -> ../../sda
    lrwxrwxrwx 1 root root 10 Aug  1 22:09 wwn-0x50014ee25d4cdecd-part1 -> ../../sda1
    lrwxrwxrwx 1 root root  9 Aug  1 22:09 wwn-0x50014ee25d4ce711 -> ../../sdb

Wheee! How do we interpret this? Well, we have four physical drives, beginning with /dev/sda and ending with /dev/sdd. We can see that /dev/sda has a partition table on it, which confirms what we’d already assume—this is our existing system drive. The other three drives are bare, and we’ll use them for our raidz1 vdev.

We can see each drive listed multiple times because they can be referred to multiple ways: by their wwn ID, by their model and serial number as connected to the ATA bus, or by their model and serial number as connected to the (virtual, in this case) SCSI bus. Which one should you pick? Well, any of them will work, including the super simple devicename (like /dev/sdb) itself, but you want to pick one that you can also see on the label on the physical drive. This is so you can be absolutely certain that you pull and replace the correct drive later, if one fails.

If your drive doesn’t show the wwn ID on the drive label, use the scsi-model-serial listing. If your drivedoes have the wwn ID printed visibly, I’d use that instead (just because it’s shorter). In this example, my drives do have the wwn printed visibly, so I’m going to use that.

    me@box:~$ sudo zpool create -o ashift=12 ars raidz1 /dev/disk/by-id/wwn-0x50014ee25d4ce711 /dev/disk/by-id/wwn-0x50014ee2080268b2 /dev/disk/by-id/wwn-0x50014ee2080259c8

Whew. Dissecting the pieces: -o ashift=12 means “use 4K blocksizes instead of the default 512 byte blocksizes,” which is appropriate on almost all modern drives. Ars is the name of our new zpool. Raidz1 means we want a striped array with a single block of parity per stripe. And, finally, we have the device identifiers themselves. Voila:

    me@box:~$ sudo zpool status
      pool: ars
     state: ONLINE
      scan: none requested
    config:

            NAME                          STATE     READ WRITE CKSUM
            ars                           ONLINE       0     0     0
              raidz1-0                    ONLINE       0     0     0
                wwn-0x50014ee25d4ce711    ONLINE       0     0     0
                wwn-0x50014ee2080268b2    ONLINE       0     0     0
                wwn-0x50014ee2080259c8    ONLINE       0     0     0

    errors: No known data errors

    me@box:~$ sudo zpool list
    NAME  SIZE   ALLOC  FREE     CAP  DEDUP  HEALTH  ALTROOT
    ars   2.98T  1008K  2.98T     0%  1.00x  ONLINE  -

Yay! Our first zpool! First thing we should notice here is that we’re seeing the full 3T capacity even though we said we wanted raidz1, which means that we’re using 1TB of those 3TB for parity. That’s because the zpool command shows us raw capacity, not usable capacity. We can see the usable capacity by using the zfs command, which queries filesystems rather than zpools (or by standard system tools like df):

    me@box:~$ sudo zfs list
    NAME   USED  AVAIL  REFER  MOUNTPOINT
    ars    709K  1.96T   181K  /ars

    me@box:~$ df -h /ars 
    Filesystem      Size  Used Avail Use% Mounted on
    ars             2.0T  128K  2.0T   1% /ars

There we go. Now we are seeing the 2TB of usable capacity we’d expect out of a single-parity array with three 1TB drives.

More ZFS lingo

Now that we’ve created our first zpool, let’s look at the purely logical constructs we can (and should) create underneath them: filesystems, zvols, snapshots, and clones.

Filesystems

This doesn’t mean what you probably think it means. Under ZFS, a filesystem is sort of like a partition that comes already formatted for you, only it’s really easy to create them, modify them, resize them, and otherwise play around with them instantly and whenever you’d like. Let’s create a couple now.

    me@box:~$ sudo zfs create ars/textfiles
    me@box:~$ sudo zfs create ars/jpegs

    me@box:~$ sudo zfs list
    NAME            USED  AVAIL  REFER  MOUNTPOINT
    ars            1.18M  1.96T   192K  /ars
    ars/jpegs       181K  1.96T   181K  /ars/jpegs
    ars/textfiles   181K  1.96T   181K  /ars/textfiles

    me@box:~$ ls -l /ars
    total 22
    drwxr-xr-x 2 me me 2 Jan 23 15:10 jpegs
    drwxr-xr-x 2 me me 2 Jan 23 15:10 textfiles

Snazzy! But why would we want to create filesystems instead of just making folders, since they look just like folders? Lots of reasons. You can take snapshots of a filesystem, not a folder, but you can also set properties on a filesystem. Since one of these folders is for Lee’s awesome ANSI art, we know it’ll be highly compressible. Let’s go ahead and set compression on it. And just to make sure our jpeg hoarding problem won’t consume our entire storage pool, let’s go ahead and set a quota of 200G on it:

    me@box:~$ sudo zfs set compression=on ars/textfiles
    me@box:~$ sudo zfs set quota=200G ars/jpegs
    me@box:~$ sudo zfs list
    NAME            USED  AVAIL  REFER  MOUNTPOINT
    ars            1.19M  1.96T   202K  /ars
    ars/jpegs       181K   200G   181K  /ars/jpegs
    ars/textfiles   181K  1.96T   181K  /ars/textfiles

Nice! We can see now that /ars/jpegs only shows 200G of space available. We can just as easily set the quota from 200G to 1T now or shrink it from 200G down to, say, 50M. It’s all instantaneous and easy—no scary partition sizing, no weird “filesystem grow/shrink operations” afterward, just (re)set the quota and you’re done. But what about the compression we set on /ars/textfiles? I don’t actually have a significant amount of Lee’s ANSI art on hand, but we can check it out by writing a whole bunch of 00s into a big file:

    me@box:~$ dd if=/dev/zero bs=128M count=128 of=/ars/textfiles/zeroes.bin
    128+0 records in
    128+0 records out
    17179869184 bytes (17 GB) copied, 12.1335 s, 1.4 GB/s

OK, that’s 16GB worth of 00s written to disk. Wait a minute, did that say that it wrote at 1.4 GBps? Sure did—highly compressible data can be compressed in memory faster than it can be written to disk, so in some cases (like textfiles, or even more so, ridiculously large numbers of zeroes) having compression on can be a huge performance win. (Compression will slow performance down on already-compressed or otherwise incompressible data, like most images, movies, executables, etc.)

Now let’s look at our textfiles filesystem. We saw a giant performance win, will we see a corresponding storage win?

    me@box:~$ sudo zfs list ars/textfiles
    NAME            USED  AVAIL  REFER  MOUNTPOINT
    ars/textfiles   181K  1.96T   181K  /ars/textfiles

    me@box:~$ ls -lh /ars/textfiles
    total 512
    -rw-r--r-- 1 me me 16G Jan 23 15:24 zeroes.bin

Looks too good to be true; there isn’t any more storage space taken up by our 16GB of zeroes! In reality, this is just a very extreme case. Infinite zeroes are nearly infinitely compressible; normal text (or Lee’s ANSI art) would still be very compressible—frequently up to 90 percent—but not near infinite. And as we can see, a simple ls shows that yes, all 16GB of our zeroes are safely stored.

There are lots more properties that can be played with on ZFS filesystems, but we can’t possibly cover them all today.

zvols

A zvol is basically a ZFS filesystem “without the filesystem.” Logically, it’s presented to the system as a raw block device, directly accessible through an entry in /dev. Why would you want a zvol? Well, honestly, you probably don’t. If you did want one, you’d want it to format with another filesystem entirely, and therefore be able to use ZFS features like snapshots, compression, dynamic resizing, and replication on it. You might not want one even then, since zvols can be a little quirky with how they handle snapshots, but that’s beyond the scope of what we’re trying to do today, which is just get a good beginner’s handle on the basic care and feeding of ZFS.

Snapshots

A snapshot is an instantaneously created copy of every single block of data in a filesystem at the exact point in time the snapshot was created. Once you have a snapshot, you can mount it, you can look through its folders and files and what have you just like you could in the original filesystem, you can copy bits and pieces out of the snapshot and into the “real world,” and you can even roll the entire filesystem itself back to the snapshot. Let’s play:

    me@box:~$ echo lolz > /ars/textfiles/lolz.txt
    me@box:~$ sudo zfs snapshot ars/textfiles@snapshot1
    me@box:~$ sudo zfs list -rt snapshot ars/textfiles
    NAME                      USED  AVAIL  REFER  MOUNTPOINT
    ars/textfiles@snapshot1   133K      -   186K  -

OK, we’ve added a new file to /ars/textfiles. I felt a terminal case of the stupids coming on, so I took a snapshot of the filesystem and there it is, ars/textfiles@snapshot1. Notice how sometimes I use a leading slash and sometimes I don’t? To the filesystem, everything is relative to root, so everything has a leading slash. To ZFS, though, “ars” is the actual pool. When we use the zfs command, we don’t put a leading slash behind “ars.” (It’s a little confusing at first, but you get used to it.)

    me@box:~$ rm /ars/textfiles/lolz.txt
    me@box:~$ ls /ars/textfiles
    zeroes.bin

Oh no! I knew I felt a case of the dumb coming on. My incredibly valuable lolz.txt file is gone! No worries, though, I took a snapshot… let me go ahead and mount ars/textfiles@snapshot1 and see if my missing file is there:

    me@box:~$ mkdir /tmp/textfiles@snapshot1
    me@box:~$ sudo mount -t zfs ars/textfiles@snapshot1 /tmp/textfiles@snapshot1
    me@box:~$ ls /tmp/textfiles@snapshot1
    lolz.txt  zeroes.bin

Whew! lolz.txt is safe and sound in my snapshot, which I’ve mounted under /tmp. I could just copy the file out of the mounted snapshot and put it back where I want it. But what if I’d made lots and lots of changes, and I wasn’t sure what had or hadn’t been changed? I could still put everything back the way it was by rolling back to my former snapshot.

    me@box:~$ sudo umount /tmp/textfiles@snapshot1
    me@box:~$ sudo zfs rollback ars/textfiles@snapshot1
    me@box:~$ ls /ars/textfiles/
    lolz.txt  zeroes.bin

Super easy. Everything’s just like it was. No fuss, no muss.

Clones

A clone is a copy of a filesystem (actually, a copy of a snapshot of a filesystem) that initially doesn’t take up any more space on disk. As the clone diverges from its parent, it uses actual space to store the blocks that differ. There are a few interesting use cases for clones. For example, if you want to do something experimental but really don’t want to commit to it happening in your “real” filesystem, you can instead create a clone, perform your experiments there, and then destroy the clone when you’re done.

I personally find clones most valuable when using virtual machines. You can clone an older snapshot of a VM, boot it up, and then look for files, data, or programs in it without disturbing the “real” VM. Or, you can clone a fresh snapshot and try something risky on it. Want to see what happens when you do an in-place upgrade of that old, creaky Windows Small Business Server? Clone it and test-upgrade away.

Replication

At this point, you have a zpool. That zpool has at least one nice, redundant, self-healing vdev with parity in it. You know how to take snapshots, so now let’s look at how to replicate those snapshots to another machine which is also running ZFS.

Set up SSH keys

This isn’t strictly a ZFS step, but you’ll need it in order to handle replication the easy way, so we’ll go ahead and cover it here. Let’s assume you have box1 and box2; your data is on box1 and you want to back it up to box2. Further, let’s assume you want to push the backups from box1 to box2, rather than pulling them the other way around. First, generate yourself a root SSH key on box1:

    me@box:~$ sudo ssh-keygen -t dsa
    Generating public/private dsa key pair.
    Enter file in which to save the key (/root/.ssh/id_dsa): 
    Created directory '/root/.ssh'.
    Enter passphrase (empty for no passphrase): 
    Enter same passphrase again: 
    Your identification has been saved in /root/.ssh/id_dsa.
    Your public key has been saved in /root/.ssh/id_dsa.pub.
    The key fingerprint is:
    8f:67:61:ab:4d:be:99:9f:b9:4f:68:25:37:e5:82:ed root@box1
    The key's randomart image is:
    +--[ DSA 1024]----+
    |                 |
    |                 |
    |                .|
    |             o o |
    |        S o o * .|
    |         + o * o |
    |        . * o E  |
    |         B + +   |
    |        . *o=o.  |
    +-----------------+

You’ll be asked if you want to save your key to the default location /root/.ssh (you do) and if you want to use a passkey (for this example, you don’t). Once you’re done, it’s time to copy your new public key off to box2:

    me@box1:~$ sudo scp /root/.ssh/id_dsa.pub me@box2:/tmp/
    me@box2's password: 
    id_dsa.pub                                                                                                                    100%  602     0.6KB/s   00:00

Now it’s time to add box1’s public key to the root authorized keys file on box2 and make sure that we allow the use of keys on box2.

    me@box2:~$ sudo -s
    root@box2:~# cat /tmp/id_dsa.pub >> /root/.ssh/authorized_keys
    root@box2:~# echo AuthorizedKeysFile %h/.ssh/authorized_keys >> /etc/ssh/sshd_config

Now we’ll be able to SSH as root with no password from box1 to box2, which is necessary for our next step.

Full replication

As of right now, box2 has its own zpool (which we named technica, and which does not have to be composed of the same number, type, or arrangement of vdevs as our original zpool on box1) but has no actual filesystems on it. I now have a gigabyte of data on ars/jpegs, and I want to replicate that data to box2. Keep in mind that we don’t replicate the filesystem itself, we replicate snapshots. Let’s take a snapshot:

    me@box1:~$ sudo -s
    root@box1:~# zfs snapshot ars/jpegs@1
    root@box1:~# zfs list -rt all ars/jpegs
    NAME          USED  AVAIL  REFER  MOUNTPOINT
    ars/jpegs    1024M   199G  1024M  /ars/jpegs
    ars/jpegs@1      0      -  1024M  -

Now let’s replicate it:

    root@box1:~# zfs send ars/jpegs@1 | ssh box2 zfs receive technica

It’s that easy. After the 1GB of data gets done moving across the network, you now have a replicated copy of ars/jpegs on box1 at technica/jpegs on box2:

    root@box2:~# zfs list -rt all technica/jpegs
    NAME               USED  AVAIL  REFER  MOUNTPOINT
    technica/jpegs    1024M  1.95T  1024M  /technica/jpegs
    technica/jpegs@1      0      -  1024M  -

Incremental replication

What about the next time we replicate? Well, as long as we haven’t gotten rid of snapshot ars/jpegs@1 on box1, we can use it as a parent snapshot and do incremental replication the next time, which will go much quicker. Let’s make a silly little file, take another snapshot, and replicate incrementally:

    root@box1:~# echo lolz > /ars/jpegs/lolz.txt
    root@box1:~# ls -l /ars/jpegs
    total 1048251
    -rw-r--r-- 1 root root 1073741824 Jan 23 16:23 1G.bin
    -rw-r--r-- 1 root root          5 Jan 23 16:32 lolz.txt

    root@box1:~# zfs snapshot ars/jpegs@2
    root@box1:~# zfs send -i ars/jpegs@1 ars/jpegs@2 | ssh box2 zfs receive technica/jpegs

Notice that this time, we used the -i argument and specified both snapshots. We also used the full path to the existing filesystem technica/jpegs in our receive command, since we’re receiving an incremental to an existing filesystem, not a full replication to create a new filesystem with. This replication happened pretty much instantaneously—lolz.txt is just a silly little file, after all—and ZFS already knows what has or hasn’t changed from snapshot @1 to snapshot @2. Since it doesn’t have to grovel over the disk looking for changes, it can just immediately start sending them when asked.

Does everything look as we’d expect it to, over on box2?

    root@box2:~# zfs list -rt all technica/jpegs
    NAME               USED  AVAIL  REFER  MOUNTPOINT
    technica/jpegs    1024M  1.95T  1024M  /technica/jpegs
    technica/jpegs@1   117K      -  1024M  -
    technica/jpegs@2      0      -  1024M  -

    root@box2:~# ls -lh /technica/jpegs
    total 1.0G
    -rw-r--r-- 1 root root 1.0G Jan 23 16:23 1G.bin
    -rw-r--r-- 1 root root    5 Jan 23 16:32 lolz.txt

Exactly as we’d expect: not only a full copy of the original filesystem, but a full copy of the original filesystem and all of its snapshots as we replicated them over.

At this point, you can safely get rid of snapshot ars/jpegs@1 on box1 if you’d like to. The next time you replicate to box2, you’ll use @2 as a parent for whatever your next snapshot is and so on. This allows you to do some pretty cool stuff, like make a “main” server with expensive, fast storage (but not much of it) and a “backup” server with cheap, slow storage (with plenty of it). You can even keep lots of snapshots on your backup server, while destroying them pretty quickly from your “main” server. Pretty powerful stuff.

Deduplication

I have to be honest, the only reason I’m even mentioning dedup is I know there’ll be a furor in the comments if I don’t. There may be one anyway, because the next thing I have to tell you is something you don’t want to hear:

You probably don’t want to use dedup. Full stop.

Deduplication sounds exciting. Stop caring when your users blindly make a copy of a folder with 15G of stuff in it! Don’t write more stuff than you have to! Keep more stuff on the same drive! Reap some performance benefits, sometimes, depending! But the problem is, the way ZFS implements dedup, it takes up a lot of RAM; unless you have a very specialized machine and a very specialized workload, almost certainly more RAM than you’ll be willing to feed it.

The bottom line: for every 1TB of deduplicated storage, you’re going to need roughly 5GB of RAM. And that’s for dedup tablespace alone. That doesn’t count ZFS’ normal memory consumption. I’ve tested this personally. After copying about 6TB of data to a ZFS filesystem with dedup turned on, my RAM consumption went up roughly 32GB. This was a special server that has 128GB of RAM, so luckily it could handle it. Even so, I disabled dedup immediately after the test because I wasn’t happy with the result.

In most cases, for most users… it’s just not worth it. Sorry.

The final takeaway

We’ve still really only scratched the surface of what ZFS can do. But hopefully, you’ve seen enough to get you half as interested in ZFS as I am. I’ve been using ZFS professionally and in production for over five years, and I can honestly say that it’s both changed the course of my career and my business. I wouldn’t dream of going back to the way I did things before ZFS.

For you Windows and Mac users out there (or any Linux users who are allergic to the command line), don’t despair and stay tuned! Next in this series, I’ll be covering FreeNAS, which is essentially “ZFS on easy mode.” It’s a ready-to-download, ready-to-use distribution that lets you set up, manage, and configure your own ZFS-powered Network Attached Storage device out of a generic PC with a bunch of hard drives; no command line required.

 

Jim Salter (@jrssnet) is an authorpublic speaker, small business owner, mercenary sysadmin, and father of three—not necessarily in that order. He got his first real taste of open source by running Apache on his very own dedicated FreeBSD 3.1 server back in 1999, and he’s been a fierce advocate of FOSS ever since. He also also created and maintains http://freebsdwiki.net andhttp://ubuntuwiki.net.

This article was originally posted on ArsTechnica.