Next gen filesystems (ZFS, BtrFS, ReFS, APFS, etc)

Discussion in 'Storage & Backup' started by elvis, May 20, 2016.

  1. Diode

    Diode Member

    Joined:
    Jun 17, 2011
    Messages:
    1,736
    Location:
    Melbourne
    Fantastic... out of 29,000 photos found 257 with miss matched hashes. Fun times going through and manually opening each photo and checking which ones are not damaged.
     
  2. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,454
    Location:
    Brisbane
    Well that is good news.

    Cynically though, I've been burned too many times by Microsoft telling me something is best practice this year, only to tell me that it's deprecated the year after (and then best practice the next year again).

    But hopefully ReFS sticks. NTFS is rapidly becoming outdated, and it needs to go.

    ReFS is missing a lot from what I read. No dedup, no block level compression (NTFS had file level compression, but it killed performance quite a bit), and a bunch of NTFS features like alternate data streams aren't there yet.

    Still, the fact that larger applications are recommending it is a good sign for the future. I also hope Microsoft realise there's a requirement for it on desktops and workstations too, and knock of this silly idea of "server only" features like sensible file systems.

    Yup, precisely why we need these new file systems.

    As our storage gets bigger, we're holding more precious information per device. Physical and logical errors are going to affect us in these sorts of ways without sensible file systems to aid us.

    I'm even finding BtrFS on USB drives to be useful, as I can tell it to write two copies of everything transparently. I set both the data and metadata levels to "dup", and then write to a single USB drive. It halves the effective space, but when I'm buying 1 and 2 TB 2.5" USB spindle drives for backing up home systems on the cheap, I've still got heaps of space.
     
    Last edited: May 21, 2016
  3. digizone

    digizone Member

    Joined:
    Jun 3, 2003
    Messages:
    339
    Location:
    Voyger1 is chasing me
    BtrFS is running flawless on UnRaid amazing stuff, Double parity and any hdd.
     
    Last edited: May 21, 2016
  4. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,454
    Location:
    Brisbane
    Unfortunately unRAID suffers all the problems of traditional RAID, including the write hole problem as well as other silent corruptions.

    Running BtrFS on top of this is better than other file systems, however there's a fundamental flaw in not allowing BtrFS access right down to the bare device, with unRAID presenting a virtual device to BtrFS.

    Like other parity-based RAID systems, unRAID users are recommended to run a UPS on their storage to prevent silent errors, and even still this won't protect against all possible issues (hard system crash or motherboard fault which could cause silent data corruption even with a UPS).

    It's important that people understand what ZFS and BtrFS aim to achieve, and why they need "bare metal" access to hard drives, without another RAID system in between the file system and the hard disks.

    I'm not picking on unRAID here either. Linux MDRAID, LVM, hardware RAID devices and other RAID systems all suffer the same problem, and layering BtrFS on top of those won't solve that.

    If you want an easy to use NAS with BtrFS, look at RockStor instead (community edition download is open source and dollar-free).
     
    Last edited: May 21, 2016
  5. fad

    fad Member

    Joined:
    Jun 26, 2001
    Messages:
    2,459
    Location:
    City, Canberra, Australia
    Dedupe is in the Storage space layer not the file system.

    The only thing is, it isn't validated for production loads for anything except VDI.

    It's also after the fact dedupe unlike ZFS. I think they were trying to prevent it needing alot of memory. What the ZFS formula? 5Gb per TB of data?

    https://blogs.technet.microsoft.com/ausoemteam/2015/03/31/storage-spaces-data-deduplication/

    The biggest issue with either ZFS or ReFS is getting hardware support. Most the main vendors have RAID cards with old storage methods. They will need to be brought over to the NextGen FS kicking and screaming. Look at the validated servers for EMC vSAN or MS SOFS. Both have kit lists from the major vendors with very high costs. They just need to change the HBA over.

    A Dell MD1200 direct attached disk unit isn't validated for connecting to anything other than a Dell H810 external LSI based IR card.

    Im a big fan of both (ReFS/ZFS). I run both at home for storage.
     
  6. fad

    fad Member

    Joined:
    Jun 26, 2001
    Messages:
    2,459
    Location:
    City, Canberra, Australia
    Yeah this year 2012 R2 was SOFS, with shared direct attached dual port SAS hardware arrays. Which made the cost of share SSDs expensive.

    As soon as 2016 hits, it will be non shared hardware, with SATA local storage only, with a shared network layer doing the copies.

    I found the difference to be quite big. However I am allocating 4x128Gb SSDs entirely to WBC. My understanding is, all you are doing is putting off the writes till you have time to service the required IO. If you never have time you will not see much difference. I found copying array to array to be really painful. Now the data is there, the VDI performance is good. (8x4Tb 7.2k,4x128Gb SSD, 200-300mb/sec at 75-100k IOP) 4k random 25% write 75% read. Its the standard IO profile from IOmeter.
     
    Last edited: May 21, 2016
  7. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,454
    Location:
    Brisbane
    Yes, realtime/online deduplication needs to keep a list of block checksums in memory so that the system can dedupe in realtime. Obviously this gets VERY big over time, and I'm not a fan of it at all.

    I'd much rather a background services that constantly crawls the file system, looking for duplicate blocks in a random pattern up to a fixed memory utilisation, replacing them with reflinks. It won't be nearly as efficient, but it would use far less memory than either realtime dedup or a complete file system crawl one-shot.

    I'm not so sure about this. I'm already seeing a lot of vendors very happy to change over to IT-mode storage. We buy a lot of Supermicro gear, and pretty much every single NAS unit Supermicro ship can now be purchased in either RAID mode or IT mode. They've figured out they can ship a lot of systems and spindles, and at a decent price (as it's all ECC spec gear with decent drives), make a good profit and still come out well under the price of some of the big-name storage vendors.

    For home/SOHO users, there's also some interesting possibilities with hardware like this:

    http://addonics.com/

    Using simple eSATA-III connectivity and port-multiplier compatible cards, you can attach a pretty decent volume of disks to an existing system without a heck of a lot of complex hardware or cost. The end result is some great scalable storage hardware for home users who don't need 10Gbit/s+ speeds.

    I've currently just got a crappy old tower system full of drives as my storage system, but really like the look of some of Addonics 6G "RAID towers":

    http://addonics.com/category/6grt.php

    Plug them straight into a compatible eSATA port, and you see all the drives direct over the one cable, ready for BtrFS and ZFS to use direct.
     
  8. NSanity

    NSanity Member

    Joined:
    Mar 11, 2002
    Messages:
    18,243
    Location:
    Canberra
    Yeah, look I buy primarily from the same people as Elvis - but I have no issues getting hardware from Dell either.
     
  9. Smokin Whale

    Smokin Whale Member

    Joined:
    Nov 29, 2006
    Messages:
    5,183
    Location:
    Pacific Ocean off SC
    It's a shame that native eSATA 3 is going the way of the dodo. Very few systems have it nowadays. I'm the same though, I just have an old tower at home, does the trick :)

    One thing I do find interesting though, is that you mentioned you used BTRFS as a boot drive on your laptop. Would I be right in saying that this filesystem is a worthy replacement for LVM or EXT4? It sounds like it's the way to go for SSD boot drives.
     
  10. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,454
    Location:
    Brisbane
    That Addonics gear is merely a SATA to eSATA connector. Nothing fancy there. They also supply eSATA 3gbit, 6gbit and mini-SAS SF8088 connectors to convert between these standards for different bandwith requirements.

    Similarly, they sell a number of cards with controllers on board that support standard SATA multiplexing, which means several SATA connections can be shared down one cable (to the maximum aggregate bandwidth of a single connection, either 3gbit/s or 6gbit/s for eSATA, or 12/24Gbit/s for the mini-SAS).

    All of that is pretty standard, and will be around for a while yet. PCI-E and M.2 is certainly picking up speed, but SATA and SAS will be around for a while yet. Addonics are pretty clever at doing smart things with low-end gear that gives you a lot of options if your software is smart enough. These guys were very popular with the Linux+MDADM+LVM crowd before ZFS and BtrFS came along, and now with these next gen file systems, they're a great option for hardware on the cheap.

    One of BtrFS's huge downfalls right now is that it doesn't support swap. That's coming (patches exist, but haven't made it to the mainline kernel yet), however right now you have to partition up your first disk still, with a dedicated swap partition, and BtrFS managing volumes inside the other partition.

    You still get the benefits of BtrFS inside the partition boundary (unlike if it was inside some sort of device mapper setup that can move extents around underneath BtrFS without it's knowledge), but it's not as good as BtrFS on a raw disk.

    Once BtrFS can properly deal with the equivalent of ZFS's "zvols", things will be much better on that front. Still, it's not terrible, and you get the important stuff like better SSD tuning, realtime block level compression, block checksumming and scrubbing, snapshots, etc.

    Alternatively you can not have swap (not recommended), put swap on a different device, or make a loopback file marked NOCOW mounted as a device, and put swap on that (a bit ugly and slow, but it works if you REALLY want an all-BtrFS disk),
     
  11. Smokin Whale

    Smokin Whale Member

    Joined:
    Nov 29, 2006
    Messages:
    5,183
    Location:
    Pacific Ocean off SC
    Gotcha, makes sense now. I gather if you want swap it kinda works like a standard EXT4 install at the moment, where there is the main EXT4 primary partition which has the swap attached at the end of the disk in an extended partition of sorts with some UUID wizardry in /etc/fstab? I guess that makes sense. Of course, like you mention, Btrfs excels where it exists as the only partition on a disk, and swap is kinda important for me, so I'll probably pass until swap is fully supported.

    What are your thoughts on the implementation on OpenZFS on Ubuntu 16.04? I know ZoL was frowned upon when it was first released, but I'm thinking of taking another look at it now.

    PS: This thread is great, btw. The storage gods are smiling upon ye, Elvis :)
     
    Last edited: May 21, 2016
  12. Annihilator69

    Annihilator69 Member

    Joined:
    Feb 17, 2003
    Messages:
    6,082
    Location:
    Perth
    I've been using ZFS for a while now, and what I've noticed is that overtime you get fragmentation and thus slower right speeds.

    This is directly proportional to the amount of freespace on the unit.
    The more free space the faster the write speeds as the average seek distance would be lower to find a 'free' block.

    I guess with all SSD it will become less of an issue.
    At the moment I try and keep my arrays under 70% utilised.
    at around 80% write performance drops off a cliff face.
     
  13. fad

    fad Member

    Joined:
    Jun 26, 2001
    Messages:
    2,459
    Location:
    City, Canberra, Australia
    Inline dedupe is good for some things. Depends on the load and IO profile.

    I have a 24x1Tb spindle with SSD L2 and ARC running with 8GB FC targets running ZFS. Providing storage to a Vmware cluster. It is very fast and cheap. I had to replace the HBA with a third party LSI card.

    I also have archive files on storage spaces with dedupe on. With some pretty amazing ratios.


    I would like to have another look at Ubuntu for ZFS. The last time I used it, with a fuse implementation, the performance was really bad. (20-30mb/sec over 16 spindles)
    The feature I would like to see from any of them would be OPAL hardware encryption.


    Do you have SSD caches? Or just spinning disks?
     
    Last edited: May 21, 2016
  14. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,454
    Location:
    Brisbane
    Under Ubuntu, if you select your root partition to be BtrFS and don't specify anything for /home, it will automatically create a subvolume called "@" for / and "@home" for /home.

    This is the expected minimum volume layout for APT's automatic snapshot mode (make a snapshot on any change that APT makes to the system, whether it's an add, remove or upgrade), which is a pretty cool feature for desktop and server systems alike.

    I think BtrFS as your primary file system on Linux is a very good idea, even if you're not using it to it's fullest extent (no pun intended).

    I am not a lawyer. However, IMHO the CDDL and GPL are incompatible, and I think that Canonical bundling binary ZFS in-kernel with Ubuntu is against the terms of both licenses.

    RMS and the FSF both agree with me. Canonical disagree. Most people don't understand or don't care.

    It should be dealt with via DKMS, just like other GPL-incompatible software. For example, DKMS is the way that Nvidia drivers, VirtualBox, and a bunch of other stuff deal with license incompatibilities. There should be no difference with ZFS, and it isn't that hard for end users to deal with (for most things it's completely transparent, and happens automatically on installation via APT).

    Important to note that I'm not against OpenZFS/ZoL. That's fine, and we have tools out there to deal with making that work easily. I just think that the GPL and CDDL should both be adhered to by everyone, ESPECIALLY members of the open source community.
     
    Last edited: May 21, 2016
  15. Smokin Whale

    Smokin Whale Member

    Joined:
    Nov 29, 2006
    Messages:
    5,183
    Location:
    Pacific Ocean off SC
    Yeah, so I was more asking in relation to performance, reliability and manageability etc. However I do agree, definitely a grey area and surprising that canonical went ahead with it. I really don't know enough about the licencing models so I figured they just found a loophole and ran with it. :Paranoid: I guess that puts me in the "don't care" category, and won't stop me from using it for now. If there's a genuine problem with the licencing, I'm sure we'll hear about it well before I'm ready to put it into any sort of business environment.
     
  16. Diode

    Diode Member

    Joined:
    Jun 17, 2011
    Messages:
    1,736
    Location:
    Melbourne
    My find has made me a little more paranoid on how I'm handing my photos. The good news is that I already took some half decent measures. I have 3 copies to compare against and already after comparing the hashes of 2 copies it seems one drive (my WD Green) was much worse than the other (my WD Black). Fortunately I have not come across where I've had double corruption on each side. I'll repair the files on the black and then I'll create new check-sum file and compare the repaired black against my other offline USB backup as an extra measure. Hopefully between all the copies I can clean it up as best as I can.

    Moving forward it might be the time to embrace cloud storage to keep a golden copy. I've been putting it off since to upload via ADSL will suck, but maybe I'll just make use of works gigabit internet. Cloud storage is going to have better file system integrity checking for fraction of the price than what I can implement at home. Even if the cloud solution didn't use ZFS in the back end it's still a step up.

    Through all the years of storing and copying these files between drives I really haven't experienced such corruption at this scale, so take hied to the warnings! :Paranoid:


    Edit: Considering going down the route of building a Free NAS box.
     
    Last edited: May 22, 2016
  17. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,454
    Location:
    Brisbane
    ZoL hasn't implemented all of the features of native ZFS on Solaris yet, but what has been implemented is pretty stable.

    Ditto for ZFS on Mac, which I'm going to be pushing into production workloads soon (which I'm really thankful about - let me once again say just how terrible HFS+ is, and I feel so sorry for media/photography users who rely on OSX and HFS+ for their livelihoods).

    How are you verifying that?
     
  18. Smokin Whale

    Smokin Whale Member

    Joined:
    Nov 29, 2006
    Messages:
    5,183
    Location:
    Pacific Ocean off SC
    Would I be incorrect in saying that that most, if not all reputable cloud providers use some form of next-gen filesystem on their underlying hardware with parity checking? People's data is kinda their bread and butter, you think that if they had to handle petabytes of the stuff, they'd want to cover their ass over something like bit rot? I wouldn't blame someone for making assumptions about it. I did a little googling and I couldn't find too much evidence of bit rot causing data corruption on client data on a few providers like Google Drive, Drobpox etc (it all seems to be caused by bitrot on the local disk syncing to the cloud, which is another reason why a sync is not a true backup).
     
  19. Diode

    Diode Member

    Joined:
    Jun 17, 2011
    Messages:
    1,736
    Location:
    Melbourne
    You could say it's a bit of a generalisation, but generally speaking the underlying infrastructure for most cloud services would have more error checking and correction on the storage layer then a stand alone HDD. That's not to say it's full proof.
     
  20. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,454
    Location:
    Brisbane
    I would certainly hope so, but as above, how do you verify this?

    Given that most popular vendor-provided storage solutions pre-ZFS (NetApp, WAFL, VMFS, etc) didn't do block level checksumming, and relied instead on standard SAS/FC technology reliability, there's a lot of vendors out there that still don't have access to this level of data protection.

    Probably not so much of an issue when we had 1TB drive densities. Now that 8TB is pretty common, it is certainly becoming a bigger problem.

    Where I work, we have a lot of legacy NAS units in production still on hardware LSI RAID controllers with battery backup and RAID6. They do automatic weekly scrubs which can take hours/days depending on the data volumes. The problem is the performance takes a beating, and there's a potential for actively read data to be in a bad state for a week or more before you notice. LIkewise, this is all handled by vendor-provided "black box" proprietary firmware, with a wink and a smile from the vendor saying "trust us" (which I don't). That's pretty much how most vendor storage has worked for a long time (even scaling right up to large SANs, which are more or less the same idea at bigger scale with more IO paths).

    New generation clustered filesystems are dealing with bitrot in their own ways. Ceph and Gluster both have checksum/scrub features. And as above, I'd *hope* that folks at AWS/Azure/Google levels would have their own checksum/scrub type features, but that data isn't available to us.

    It's all a bit super-paranoid, of course. In this day and age of the Internet, we'd hear about data corruptions happening very quickly. But just yelling "to the cloud" isn't always an answer for data reliability.
     
    Last edited: May 22, 2016

Share This Page

Advertisement: