At one particular customer, increasing storage needs require that we add a large file server which should be as fast as possible. In that context, I'm currently evaluating using Debian GNU/kFreeBSD ("Debian," from now on) on ZFS. Not being hampered by a large degree of knowledge on the subject, here's a blog post with things I've learned so far.
For context, the system on which we'll be installing will have 64 gigs of RAM, about two dozen SAS disks connected to a RAID controller of which I'm reasonably sure (though not 100% so) that it supports JBOD mode, and a PCIe-attached solid state storage device of several terabytes. If it turns out the RAID controller doesn't support JBOD mode, I'll probably create as many RAID0 devices as possible and hand those over to ZFS as "hard disks", instead; this wouldn't be an ideal solution, but I don't have much of a choice of hard disk attachment, unfortunately.
That being said, the system on which I'm testing right now is the latest incantation of samba, an HP ProLiant ML115 server. A system with far more modest hardware specifications, it has 2G of RAM (which isn't really enough for ZFS), a single 160G SATA hard disk, and a dual-core AMD processor of about half a decade ago. And oh yes, no SSD. So for evaluating ZFS performance it's not ideal, but for evaluating Debian and ZFS features, it will do.
Anyway, what I've found so far on samba:
partman-zfs (that is, the ZFS support in the installer) works, but could use some improvements.
- It allows me to create a system that boots off a ZFS filesystem, but does not allow me to create nested ZFS filesystems; i.e., if I create a ZFS filesystem for / and one for /home, the /home filesystem won't be a child of the / filesystem, which would be the way you'd usually set this up. Workaround: "zfs rename".
- I can't set random ZFS options at filesystem creation time. I would like to be able to do things like set the "compression" and/or the "dedup" property for certain filesystems, but that isn't possible from the installer. Workaround: create the filesystem layout in partman, and when choosing the "finish partitioning and continue" option, switch to a shell and issue "zfs set" commands to change things. You'll miss the first few files that way, but this shouldn't matter all that much.
Given that I've only got one hard disk in this machine, I can't test whether partman-zfs supports many options for creating vdevs. I'll have to see whether that's the case when the actual server arrives.
- ZFS has a "sharenfs" property that's supposed to transparently and
automatically share filesystems via NFS. However, this isn't supported
in Debian, with the zfs binary issueing a warning to that extent.
Digging a bit deeper in how things work, people
have told me that setting the "sharenfs" property should simply
produce a file "/etc/zfs/exports" that I can pass on to mountd, which
should be restarted after every change to that file. I'm not sure why
the zfs binary doesn't just do that; after all, it's perfectly
possible to hand two
exports
configuration files to the FreeBSDmountd
binary... - Somewhat disappointingly, NFSv4 is not supported on Debian (it is on
vanilla FreeBSD, but a necessary component for NFSv4,
nfsuserd
, isn't packaged). For my environment, this isn't a showstopper, but it would've been nice had this worked. - The "sharesmb" property, which also exists, seems to be not supported at all. But that's okay, the environment in which we're working doesn't need SMB, only NFS.
- ZFS seems to have no soft quota, only hard quota. Workaround, valid for our environment (but not necessarily for everyone's environment): use reservations (plus daily warning mails) in place of soft quota, and place the ZFS quota on the level of what would've been the hard quota on ext4 (might be a little higher than what we use on the current fileserver).
We'll see what else I learn.
Some ideas here sound reasonable to fix/implement for jessie.
"PCIe-attached solid state storage device" sounds scarily like it might require a non-free kernel driver, which could be awkward with either Linux or FreeBSD - I guess we won't know until you have it. Also if used as a sole ZIL device it would be a single point of failure with potential for some data loss (e.g. cached writes from the last 10-60 seconds not yet flushed to disk). Using it as a L2 (read) cache device to supplement main memory would be perfectly safe though.
Also I suggest it would be good practice with ZFS to keep the root pool (the OS root, /var, /usr, etc., or at the very least /boot) simple (mirrored vdevs, no dedup, no ZIL) and totally separate if possible from the big zpool of user data. Just in case something unexpectedly bad happened to the big pool, you'd still have a bootable system from which to recover more easily, and/or more easily import the user data zpool on another system. (An example would be if a ZIL failed and you had to forcibly remove it).
And finally, if you can't configure the RAID controller as JBOD, I'd consider maybe using it as intended - perhaps based on RAID-10s, or multiple RAID-6 groups if capacity is more important. Then run ZFS on that without adding more redundancy.
Because I'm thinking if you had two RAID-0 groups for example, a single disk failure takes out its entire group (even the remaining good disks). That impacts performance more than necessary, and if you had ZFS mirror the pool across both groups, it further heightens the risk of failure (in the other columns, I mean) when recovering data from the intact RAID-0.
(This could perhaps be imagined as RAID 1+0, striping across many mirrored pairs, which is good; vs. RAID 0+1, mirroring two unreliable RAID-0's, which is bad).
If you fancy doing detailed reliability engineering calculations for different designs, this is a brilliant article: http://web.archive.org/web/20090307061506/http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs
Actually I meant this article, sorry: http://web.archive.org/web/20090209015655/http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl
@Steven - when I say "as many RAID0 devices as possible", I mean to create RAID-0 devices of no more than two or maybe three physical disks per device; it will depend on hardware limitations in the RAID controller.
The idea being that if we can't do "one disk per device handed to ZFS", at least we get as close as possible. I agree that doing two RAID0 devices is dumb, and I'm not going to do that.
If you can/must group into two-disks pairs, then I'd prefer RAID-1's (where ZFS handles striping) instead of RAID-0's (where ZFS handles mirroring).
If ZFS handles mirroring it must send two copies of written data to the RAID controller; otherwise if the RAID controller handles that it needs only one copy of the writes which it then distributes to two disks. The former would doubly utilise any battery-backed write cache the controller has, double the needed write throughput from the OS to the controller, and potentially double latency in a worst-case scenario.
Either way, by not having JBOD you already lost ZFS's ability to continue using good blocks on a failing disk. The RAID controller decides when to eject a disk from a RAID, and if it does that I imagine both disks in a RAID-0 pair would fail at once, which is sub-optimal, for performance and reliability, while running degraded and during recovery. (ZFS has to recover 2 disks from 2, instead of the RAID controller recovering 1 disk from 1).
By having RAID-1 pairs of disks, you can also add new disks (or swap out with larger capacity) 2 at a time to increase the zpool capacity. With a ZFS mirror of two-disk RAID-0's that could only be done 4 disks at a time.
Ah, yes, hadn't thought of that... good point, thanks.
Like I said, hopefully it will support JBOD mode, in which case it's all moot anyway. But if it doesn't, and I have to group disks, you do have a point that multiple RAID1 arrays make more sense than multiple RAID0s.
Thanks,