Going to try and consolidate some discussion on next-gen filesystems. This comes up in various subforums (other operating systems, business and enterprise, storage), but I think it's a good idea to bring it all in one place. I'm going to start a quick list of features of the two I'm most familiar with: ZFS and BtrFS. I'd love for someone to throw in something comprehensive on ReFS (I've never used it in anger, and only know what I've read). First up: what is a next-gen filesystem? Filesystems up until now have relied on the fact that the underlying hardware is fairly reliable. That was pretty much true up until we hit multi-terabyte workloads. With petabyte workloads becoming common in large business, it's a concern. But why? https://en.wikipedia.org/wiki/RAID#URE Hard disks have a URE (Unrecoverable Read Error) rate of about 1:10^14 bits (~12TB)for consumer disks (IDE, SATA) and about 1:10^15 (~125TB) bits for enterprise disks (SAS, SCSI). This differs between vendors, however these numbers are beginning to become less of a unlikely chance, and more of a guaranteed thing at the data rates we use today. We call this failure "bit rot". For large storage systems in the past, many of us relied on RAID controllers to offer us not only data sets that span beyond the limits of a single drive, but also a way to include some sort of redundancy should one of the physical drives malfunction within the data set. We're seeing patterns today where these RAID controllers are missing occurrences of bit rot, and are assuming that data sets are clean when they are not. RAID controllers also need to verify entire, large disk systems, and are often unaware of the file systems on top of them. This leads to problems where a single drive loss can take days to rebuild in large arrays. Combine that with the bit rot issue, and our data isn't as guaranteed as we'd like. Previously it was considered "good practice" to abstract the layers of storage. The RAID system should be agnostic to the volume management, and the volume management should be agnostic to the file system. This allowed high flexibility for people to choose whatever file system they like, and combine it with whatever RAID system they like. Next-gen filesystems are a realisation that at current data rates, we need disk management systems that are aware of all the layers at once. They need to understand everything from the logical file system that handles data and metadata, right down to where a byte is physically placed on a disk, and if that byte has been reliably written or read. As a result, next gen filesystems do away with RAID controllers entirely. That firmware layer merely hides the actual success or failure of a write from the filesystem. Likewise with volume management, these filesystems need to "do it all", so that they can be 100% sure that data is written and read correctly, without physical or logical fault at any layer. The upside is that you can use any disk controller that allows direct access to a raw disk (often called "Initiator Target mode", or "IT mode"). Often you'll see guides for users wanting to flash their SATA and SAS controllers to remove the vendor-provided RAID functionality, and offer this direct IT mode access. The other upside is that even very cheap SATA controllers on motherboards now become useful for quality storage, as no special hardware is needed. These filesystems keep checksums (mathematical fingerprints) of every single byte of data written, and compare them on every read. This sacrifices a small amount of performance, although with modern CPU and RAM speeds, it goes largely unnoticed (I'm typing this from a single core Pentium M 1.5GHz and 30GB IDE hard disk running BtrFS with single data and duplicate metadata, and file system performance is within 2-3% of ext4fs). This means that data is verified constantly on all reads and writes, and if there are problems, is rebuilt in the backround. One caveat to all of this is that your system RAM *must* be reliable. If corruptions in-memory occur, then whatever is written to disk will never be guaranteed. As such, ECC RAM is recommended where next-gen file systems are used to prevent memory bitrot. This isn't mandatory for these file systems to work, however it's recommended particularly for dedicated storage arrays. Next-gen file systems also tend to be COW (Copy On Write). This means that they never update data in-place like legacy file systems tended to. If data in a block is modified, the file system will make a copy of that data block and modify it before writing it back to a different part of the disk. This results in a guaranteed consistency of data even in the result of a crash (the old block is still there and valid pre-crash), as well as allowing trivial addition of snapshotting. Snapshots themselves are a handy byproduct of COW. Instead of marking the old location of the data as deleted after a COW operation, the file system can track these changes, and build a virtual "snapshot" of the old data versus the new data. The benefit is that only the changed data space eats up actual disk space, so it appears as if you can have multiple copies of very large sets of data from points in time, but in reality you're only storing changes. Two popular next-gen filesystems on offer today are ZFS and BtrFS. ZFS was initially written by Sun Microsystems for their Solaris operating system, and ultimately bought by Oracle when they acquired Sun. BtrFS was ironically started as a Linux-based competitor to ZFS by Oracle, when they then found they now had two competing products under one roof. Both are open source, however ZFS is licensed under the CDDL, which is an incompatible license to Linux and BtrFS's GPL. This made distribution of ZFS on Linux tricky for a while, which is why most ZFS appliances and systems run either Solaris or BSD. After some time, open source ZFS development was moved to a group called OpenZFS, who worked on allowing people to compile their own ZFS easily under multiple operating systems, including Linux and MacOSX, as well as continuing work on BSD and Solaris vesions. Recently Ubuntu announced they would ship OpenZFS with Ubuntu natively, which is still under the legal microscope, however it's out there now in the most recent release of 16.04 LTS (along with BtrFS, which is native under most modern Linux kernels). BtrFS is also open source, licensed under the same GPL that covers the Linux kernel. BtrFS development is active, and contributors come in from a number of big companies around the world. The current lead developer works for Facebook (previously Oracle), and unverified word is that Facebook themselves use BtrFS heavily in their environment. Some general "pros and cons" of each: ZFS: ZFS is rock solid, thanks to a heck of a lot of people using it in commercial environments for some time now. ZFS offers a heap of cool features, such as: * RAIDZ - an updated version of RAID5, which avoids the "RAID5 write hole" - a problem where a power failure at the moment of a RAID5 parity write can leave data in an inconsistent state, and one that the RAID controller can't detect. Generally this problem reduced with UPS, battery backup, or NAND flash on the RAID controller, however still isn't guaranteed in that case. RAIDZ fixes that issue, and as long as ECC RAM is in place, guarantees consistency even in a hard crash on write. * RAIDZ2 and Z3 - offering extra parity (RAIDZ2 would be similar to RAID6, and RAIDZ3 would offer a third disk worth of redundancy for really large arrays). * Offers great volume management with many advanced features - some of them great for users of virtualisation and container software who want to clone/create/destroy data sets quickly. * Ability to make volumes exported as raw block information, which can be used as iSCSI LUNs, FC LUNs, swap volumes, and other things wher raw blocks are needed. * Built in realtime compression (various algorithms and levels including LZO, LZ4, LZJB, GZIP and others depending on the vendor). If your CPU is fast enough, this can actually improve performance in some cases, as less data is written back to the storage. It can also lengthen the life of SSDs! * Built in encryption * "Infinite" read-write or read-only snapshots * Ability to duplicate data even on a single disk * Online "scrub" (perform a file system check while the system is running and in use) * Ability to use SSDs as read and write cache infront of spindle disks, as a way to add some performance to slower storage Some cons of ZFS: * ZFS is RAM hungry. Most setups recommend a minimum of 8GB of RAM in the system running ZFS, and the more the better. Where I work, all of our ZFS appliances have 512GB ECC RAM in them to really maximise performance. * Unless you're running Solaris, ZFS is difficult (sometimes impossible) to boot from. Linux and Mac users will struggle here especially, and ZFS is better suited on these platforms as your storage area separate to your OS. * As the first major "next-gen" file system, ZFS is now slowing in development. It's assumed that other file systems will eventually surpass it, however that's not really a "con", but just the natural evolution of things. * Can't resize an existing RAID set by adding new disks to it BtrFS: BtrFS is much younger than ZFS, and as such hasn't had time to get all the features implemented. Starting with a few cons compared to ZFS: * RAID5/6 isn't quite ready yet. Code exists and is in testing, however it's not recommended for production use. * RAID56 is considered "mostly OK" as of Linux kernel 4.12 (ensure you're using the btrfs-tools/btrfs-progs package of the same version number as well). It also hasn't yet solved the RAID5 write hole like ZFS's RAIDZ has. There is work happening to try and solve that eventually, but until then invest in a good UPS. [late 2020 RAID56 update] - details here: https://forums.overclockers.com.au/posts/18700617/ * No SSD caching yet * No built in encryption yet * No advanced raw disk export (can't hold swap) * Poor performance for databases (you can disable COW at the file/driectory level to get around this if required). * Not available for BSD, Solaris or MacOSX yet. Native to most Linux distros, and there's currently a driver for Windows and ReactOS (bonus points to anyone using ReactOS in the real world). Pros of BtrFS: * Great volume management, and for some Linux distros, the ability to auto-snapshot your entire OS every time you update, add or remove packages via your package manager, resulting in easy rollback of a bad update. * "Infinite" read-write or read-only snapshots * Online "scrub" (perform a file system check while the system is running and in use) * Very light on memory usage. The BtrFS based laptop I'm sitting on now runs LXDE and the Midori browser, and the entire system is consuming 430MB of RAM (with 320MB of that going to the 6 browser tabs I have open). * GRUB2-compatible Bootable/system drive for Linux (even if you want ZFS on Linux, you'll still need to put your boot/root volume somewhere) * Built in compression, often resulting in faster reads and writes on slower storage, and improving SSD life * SSD optimisations (mostly to the queue depth settings) * Easily add disks to an existing RAID set and run a "rebalance" to use all space. You can do this while the RAID set is mounted and online. * Ability to have different data and metadata RAID levels mixed within a volume * Production-ready data levels are: * Single * Dup (write data or metadata twice, even to a single disk) * RAID0 * RAID1* (special case for BtrFS, see below) * RAID10 * BtrFS's RAID1 is special, in that it makes sure that each block of data is written to two separate devices. This is particularly useful if you have a number of mis-matched drives. For example, you can have 4x 1TB and 1x2TB disks totalling 6TB of space, and BtrFS will ensure that under RAID1, you get a usable 3TB of storage. Other RAID systems would give far less, depending on their definition of RAID1, and typically limit the usabe size of every disk to the size of the smallest drive. (Some would just offer 1TB of data, mirrored across all drives, and some would allow 2TB with an uneven amount of drives, and 1TB as the maximum usable space on any disk due to that being the smallest drive size). So, which of these should you use? That would largely depend on your use case. If you absolutely require RAID5/6 type volumes over many disks, ZFS is going to win that battle here and now (at least, until BtrFS sorts out their RAID5/6 stuff). For smaller home file servers, BtrFS is great. I use a RAID1 setup on 4x1TB and 1x2TB drives, and can happily replace my smaller disks with larger ones individually over time, and not waste space. If you want to speed up a lot of spindles, ZFS's SSD caching is fantastic. If you're already on an all-SSD array, then that matters less. BtrFS's low memory usag makes it great for smaller systems. Raspberry Pi users are likely to not be able to use ZFS at all, but for micro-NAS setups or build-your-own-Time-machine users, an RPi with Netatalk and BtrFS makes for an excellent "Time Capsule" type device for Apple MacOSX users to automatically back up over WiFi. (Or NFS/SMB for backup of anything else to your micro-NAS). MacOSX users, HFS+ is simply the second worst file system in existence today: https://blog.barthe.ph/2014/06/10/hfs-plus-bit-rot/ And for you guys, getting OpenZFS on Mac on any volume that needs data reliability is a very good option. Particularly so that Apple are slowly dropping support for software RAID, and hardware RAID options for Mac users are terrible (as a commercial Mac user, there is not a single decent commercial RAID array for Mac out there today). I see a lot of photographers corrupt data regularly thanks to HFS+'s terribleness, and ZFS-on-Mac means that precious data is just that little bit safer. As I said right at the top, I'd love for someone to add some details about the Windows equivalent, ReFS, and particularly where it's headed for Windows in Server 2016 and beyond.