Next gen filesystems (ZFS, BtrFS, ReFS, APFS, etc)

Discussion in 'Storage & Backup' started by elvis, May 20, 2016.

  1. MUTMAN

    MUTMAN Member

    Joined:
    Jun 27, 2001
    Messages:
    8,803
    Location:
    4109
    is it easy enough to setup in ubuntu ? platform would be a G8 microserver 8g ecc with 4x 3tb hdd and a 250gig ssd for a cache

    thanks, i'll go do some reading and think :thumbup:
     
  2. Jazper

    Jazper Member

    Joined:
    Jul 28, 2001
    Messages:
    2,668
    Location:
    Melbourne Australia
    Don't bother with the cache drive

    https://wiki.ubuntu.com/Kernel/Reference/ZFS
     
    Last edited: Aug 14, 2016
  3. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,506
    Location:
    Brisbane
    I would think so. You can at least do things like set your metadata to "dup" to get a little extra goodness out of it.
     
  4. Jazper

    Jazper Member

    Joined:
    Jul 28, 2001
    Messages:
    2,668
    Location:
    Melbourne Australia
    Ext4 can do checksumming of metadata-is there a benefit to duping that I don't know?
     
  5. NSanity

    NSanity Member

    Joined:
    Mar 11, 2002
    Messages:
    18,254
    Location:
    Canberra
    Agreed. L2ARC's aren't really required at all if all you want to do is flood 1G of random, large data files (e.g movies etc) for a handful of clients.
     
  6. MUTMAN

    MUTMAN Member

    Joined:
    Jun 27, 2001
    Messages:
    8,803
    Location:
    4109
    I thought it might take some head thrashing out from seeding lots of blocks ?
     
  7. NSanity

    NSanity Member

    Joined:
    Mar 11, 2002
    Messages:
    18,254
    Location:
    Canberra
    Not really. L2ARC is just going to cache stuff that doesn't land in ram. Its *read* only - and so if the data isn't hot, you're not going to notice it.

    On a VM host, I see about a 5% hit rate- vs 60% hit rate on my ARC.
     
  8. Jazper

    Jazper Member

    Joined:
    Jul 28, 2001
    Messages:
    2,668
    Location:
    Melbourne Australia
    For completeness
    (From arstechnica)
     
  9. yanman

    yanman Member

    Joined:
    Jun 4, 2002
    Messages:
    6,609
    Location:
    Hobart
    Interesting, but certainly seems to make sense.

    I wonder if it would be feasible to develop a tool that tested the dedupe performance, and/or other performance metrics to provide some indication of whether the RAM allocated is under and possibly by how much?

    Say you had a zpool of 15TB that contained a lot of VMs. There's going to be a lot of dupes there, but perhaps only a handful of VMs will be live. I'm guessing cache would be very important to performance tuning in that scenario?
     
    Last edited: Aug 14, 2016
  10. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,506
    Location:
    Brisbane
    BtrFS checksums everthing (data and metadata).

    Checksumming something doesn't let you fix it when it's broken. It merely let's you identify a fault. Keeping a second copy allows the filesystem to recover from faults.

    BtrFS has two modes of duplicating things. "Dup" allows a second copy to live on the same volume (useful for mdadm users) and "RAID1" mandates the second copy must live on a separate volume.

    You can tell BtrFS to treat data and metadata differently. Telling your data to dup will use a lot of disk space, but telling metadata to dup doesn't waste a whole lot of space, and gives you a bit of extra safety.

    On top of that, BtrFS offers LZO compression, which is likely to save whatever space you lost to metadata dup for negligible CPU overhead.
     
    Last edited: Aug 14, 2016
  11. Jazper

    Jazper Member

    Joined:
    Jul 28, 2001
    Messages:
    2,668
    Location:
    Melbourne Australia
    Thanks for that!
     
  12. NSanity

    NSanity Member

    Joined:
    Mar 11, 2002
    Messages:
    18,254
    Location:
    Canberra
    DO NOT EVER USE DEDUPE ON HIGH-IO REQUIREMENT SITUATIONS

    De-dupe is, at its most basic level, a trade off of iops for space. VM's - by definition are high-iops requirements (compared to just shifting files).

    ZFS specific De-dupe is a pretty average implementation - and comes with some pretty crazy ram requirements. This on top of the fact that serving VM's effectively from ZFS requires some pretty specific pool design and ram requirements to do effectively - means that its almost certainly a no-no for that implementation - unless you have literally 256GB+ ram, and maybe an All-Flash array.
     
    Last edited: Aug 15, 2016
  13. fad

    fad Member

    Joined:
    Jun 26, 2001
    Messages:
    2,460
    Location:
    City, Canberra, Australia
    Thanks for that information. Now to tend to my bleeding eyeballs.

    Is this the same for all dedupe? What about VSAN/EMC etc?
     
  14. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,506
    Location:
    Brisbane
    I can't speak for others, but BtrFS dedup is done via a manual "one shot" command.

    You can tell it to scan the filesystem and store the hash values either in RAM (default) or in an SQLite3 DB via an option. My crappy old home fileserver with a mere 8GB of RAM can happily dedup with minimal RAM usage.

    As for high IOPS setups, BtrFS is a bit crappy for that at the moment anyway, so the point is kind of moot. It's much better suited as a generic file store.
     
  15. NSanity

    NSanity Member

    Joined:
    Mar 11, 2002
    Messages:
    18,254
    Location:
    Canberra
    Fundamentally - it *has* to be the same.

    Think of the difference between Dedupe and compression. Lossless Compression has all the information to reassemble the file right there - realistically its a function of CPU time, which with CPU's as ludicrous as they are (and that file access and decompression is easily paralleled) its a non issue.

    The *very* nature of dedupe means that;

    1. There needs to be a table somewhere mapping where all the data is for each reference point. (yes, ram - or at least flash).
    2. This table merely provides a point of reference to rebuild that data

    In a way, De-Dupe is "compression" for an entire volume. Where as compression is for a single object.

    The quickest way to increase space savings is to shrink your de-dupe block size - which makes your de-dupe table larger, but also puts more potential load (in some circumstances - such as VM's) onto a particular point on a disk.

    If you have 500 Windows VM's (or linux, who cares) all referencing the same blocks - but all need to be re-assembled from a dedupe table - and you have a Boot Storm, watch your NAS/SAN crumble.

    Microsoft binned Single Instance Storage in Ex2007 for a reason - Disk is cheap. IOPS are expensive.

    Dedupe makes the most sense for things like Backups - so long as you can guarantee your dedupe table (remember, without it - you're fucked) - because its infrequently accessed data. File servers too (but you're less likely to see as great savings) - because realistically, File Sharing is typically fairly boring and easy - especially if you have hot data caching on a faster tier.
     
  16. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,506
    Location:
    Brisbane
    I'm not sure how BtrFS does it, to be honest. It uses COW+Reflink, and I don't think it maintains a dedup table once the deduplication is done.

    https://lwn.net/Articles/331808/
    http://www.pixelbeat.org/docs/unix_links.html

    As I understand it, a reflink is similar to a hard link, but (ab)uses COW to allow future changes to maintain separate metadata for that block. I have a feeling that data doesn't need a table of sorts, and rather just uses the native inode semantics to keep a track of things.

    Again, BtrFS is terrible for hosting large disk images, databases and can't do the equivalent of "zvol" volumes, so testing VM performance is kind of moot. I've not noticed any IOPs loss when deduping BtrFS file systems I've been using, but again they're all general purpose file stores, so I'm not entirely sure.

    I could be 100% wrong of course. I've not read deeply into it at all, specific to BtrFS.
     
  17. fad

    fad Member

    Joined:
    Jun 26, 2001
    Messages:
    2,460
    Location:
    City, Canberra, Australia
    Microsoft Exchange went to a Google like approach. No underlying RAID system, just lots of spindles and lots of copies. I think the internal install for MS has 16 copies across their 16 server DAG groups split between their three data centres. Lots of JBOD. The software defined approach for file systems is removing the need for expensive servers and RAID cards.
     
  18. NSanity

    NSanity Member

    Joined:
    Mar 11, 2002
    Messages:
    18,254
    Location:
    Canberra
    I'd be surprised if it wasn't an ultra-bleeding-edge Exchange Online build, backed on Hyper-V 2016 and SOFS backed on Storage Spaces (but yeah, without Raid cards) with DAG's (well 365 will be all based on this tech - they will just be on a much newer iteration of it).

    MS has heralded a long history of eating their own dogfood - even to their own detriment (there are various leaks about the Windows 8 and 10 productivity losses through the alpha/beta programs)
     
  19. Daemon

    Daemon Member

    Joined:
    Jun 27, 2001
    Messages:
    5,471
    Location:
    qld.au
    They're actually different. ZFS is in-band only, meaning it runs in realtime.

    BTRFS currently only supports out-of-band dedup, so it needs to be manually run (like a defrag). They're working on an inline solution (I think it was marked experimental a few months back) but it'll be like ZFS and eat RAM to do it. Does anyone know how Windows does it?

    Reality is, spindles are cheap. If you need more storage space with ZFS / BTRFS, rack and stack. That's the whole point, you're not paying the SAN / brandname premiums of the big products and have more flexibility.

    Out-of-band dedup may be a reasonable compromise, but I'll be waiting a bit longer before trying it :)
     
  20. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,506
    Location:
    Brisbane
    I keep getting told this, and then I look at my 2PB of production storage sitting at 90% utilisation, and groan at the idea of asking management for more storage budget.
     

Share This Page

Advertisement: