Next gen filesystems (ZFS, BtrFS, ReFS, APFS, etc)

Discussion in 'Storage & Backup' started by elvis, May 20, 2016.

  1. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    30,823
    Location:
    Brisbane
    No, he didn't say that at all. He said, and I quote direct from your link:

    "ZFS can mitigate this risk to some degree".

    That does not equal "you don't need ECC RAM". Risk mitigation does not mean removing risk all together. ("Mitigate" means "to make something less severe" - as in you can mitigate the loss of a leg with a prosthetic, but you still don't have your leg). He then went on to say (and again I quote):

    "if you love your data, use ECC RAM"

    That's pretty black and white. His comparisons to other file systems is that their integrity checking is no better without ECC than ZFS is without ECC, which is a bit of a "well duh" statement. As has been repeated in this thread, ECC is mandatory if you care about strict data integrity on any next-gen file system. That's not a ZFS-specific statement. That's just a fact about current computer architecture, independent of what's on your hard disk.
     
  2. Biel_Tann

    Biel_Tann Member

    Joined:
    Nov 8, 2004
    Messages:
    217
    I'm just gonna post the quote to get rid of ambiguity in what I say as I seem to not be communicating it well....

     
  3. Doc-of-FC

    Doc-of-FC Member

    Joined:
    Aug 30, 2001
    Messages:
    2,898
    Location:
    Canberra
    Fuck ram, people are still unknowingly passing data to non BBU write caches on disks, with 128MB caches these days on large tubs that's insanity right there.

    SLOG to your hearts content with SSD's that don't do power loss protection.

    It's why I've got an intel DC series SSD as my bcache caching device, cache disabled backing devices all layered with btrfs on luks.

    I invested in an E3 xeon a few years back and built an all in one.
     
  4. NSanity

    NSanity Member

    Joined:
    Mar 11, 2002
    Messages:
    16,500
    Location:
    Canberra
    Yes.

    Use LSI/Avago HBA's + Storage Spaces + ReFS (w/ Integrity streams turned on) + Cluster Storage Volume.

    If you want host based expansion, look into Scale Out File Server.

    Note: 2012 R2 SOFS is not great (read, don't). 2016 Only.

    Make sure your backup applications are ReFS and CSV aware.

    No you can't have a cluster be weeks out of sync and heal it self. You would have to have it on predominantly most of the time.

    Also replication is *not* a backup. It is an availability feature.

    110%. No PLP/Power-loss protection on write caching devices = fuck my data.
     
    Last edited: Jan 12, 2017
  5. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    30,823
    Location:
    Brisbane
    Agreed. Having said that, the whole point to COW is that you're never left in an inconsistent state, even on instant failure.

    Of course, that means that on instant failure, you potentially haven't committed many MB (maybe even GB over many disks) worth of changes to disk. But at least you'll reboot into to a consistent, if somewhat out of date and missing changes, state. :)
     
  6. ae00711

    ae00711 Member

    Joined:
    Apr 9, 2013
    Messages:
    1,123
    where do I check this?
    I've just made my first storage space/pool couple days ago.
    Do I have to have 'CSV' enabled(?) to combat bitrot? (I thought it was just (non-RAID HBA) + Storage Spaces + REFS(?))
     
  7. wwwww

    wwwww Member

    Joined:
    Aug 22, 2005
    Messages:
    4,434
    Location:
    Melbourne
    A cheaper option would be to just get a Samsung Pro and disable the disk cache, they perform remarkably well without it.
     
  8. Doc-of-FC

    Doc-of-FC Member

    Joined:
    Aug 30, 2001
    Messages:
    2,898
    Location:
    Canberra
    The Intel DC S3500 was a remnant of a ZFS build, its cost was $0, It has on PCB capacitors for power loss prevention, hence its use, the samsung 850 pro with write cache disabled probably wouldn't hit as hard as the intel, and can't guarantee safe unexpected power loss writes: http://www.storagereview.com/images/StorageReview-Intel-DC-S3500-SSD-Circuit-Board-Top.jpg

    The reason for it is as follows, I host several KVM virtual machines on my [AIO] computer, pfsense, windows (many editions) and lab servers. These compete for VM IO performance, especially at times when doing cold boots / restarts (Linux kernel update for example) and things start to get a bit messy with limited IO even on WD blacks, coupled with the write cache on the disks disabled things would grind to a halt.

    So armed with an SSD that does power loss prevention, that safely writes the memory contents back to flash means that I've got a fast disk cache (using bcache witeback for VMs) on top of sync writes to the backing disks, which when completed bcache then clears them from the SSD.

    Now this situation isn't immune from all scenarios under the sun, it's a marked upgrade from the standard on disk cache methodology though. I get massively reduced boot times as all the VM's frequent random blocks are served by the SSD (bcache LRU policy) (even with a cold block cache in linux memory) and write back gobbles data from SQL quite nicely, SQL aint ACID compliant if the design is flimsy.

    Now if I wanted paranoid level storage, I'd use ZFS on freebsd with mirrored SLOG's, hence where the DC S3500 came from.
     
  9. wwwww

    wwwww Member

    Joined:
    Aug 22, 2005
    Messages:
    4,434
    Location:
    Melbourne
    With disabled write cache it won't report data as being written until it's written to the static memory so it does guarantee data retention as well as any capacitor/battery/capacitor+flash backed unit. The DC probably has more memory dedicated to parity (the 850 is a consumer drive) so has a lower bit error rate but the whole point of this thread is about the usage of filesystems that account for drive bit errors isn't it?

    We use 850 pros with disabled disk cache in a production environment (though with separate flash backed DRAM) hosting many VM servers though we put the boot drives on mechanical disks, the SSDs are just for caches and databases and they perform remarkably well.

    Of course the DC with cache is superior but we're talking an order of magnitude more in cost per gigabyte but if you got it for free then that's all good.
     
  10. CirCit

    CirCit Member

    Joined:
    Apr 4, 2002
    Messages:
    126
    I thought the whole idea of storage spaces was to get away from vendor tie ins?

    Im sure I'll have to work out in vm's what cluster and scaleout do differently as I thought they did essentially the same thing

    scrapping the storage spaces replication accross servers does refs interact with smb3 at all? like does the hash make it to the other end to be checked data got there safe? in which case robocopy would probably still be my best bet.
     
  11. ae00711

    ae00711 Member

    Joined:
    Apr 9, 2013
    Messages:
    1,123
    I'm quite certain NSanity just used LSI as an example, as LSI is by far the most popular non-RAID ('IT' pass-thru) HBA out there, particularly for home server enthusiasts.
     
  12. NSanity

    NSanity Member

    Joined:
    Mar 11, 2002
    Messages:
    16,500
    Location:
    Canberra
    HAITCH BEEE AYYYY

    Not a raid card.

    LSI's shit works. Those IBM rebadged cross flashed stuff all the kids pick up on ebay for $15?

    LSI.
     
  13. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    30,823
    Location:
    Brisbane
    Being the pedantic dick I am for just a moment, LSI's shit started working when they bought out 3WARE, and stole all their tech. Prior to that everything LSI was a big proprietary mess, and Solaris/BSD/Linux support was next to zero (which is why all we *nix admins bought only 3WARE cards back in the day). But I digress...

    Avago have since bought LSI, and now I see Broadcom have bought Avago. Round and round we go. :)
     
  14. Doc-of-FC

    Doc-of-FC Member

    Joined:
    Aug 30, 2001
    Messages:
    2,898
    Location:
    Canberra
    not really, I consider the SSD $0 because its already served its purpose, admittedly it wasn't cheap at the time 160GB DC S3500 was about $1.50 a GB where as my 256 GB 850 pro was about $1.00 a GB.

    Out of interest I grabbed my spare S3500 160GB, taken a thrashing as an SLOG in its past life and hasn't been trimmed, and took my frequently trimmed windows 256GB 850 pro and passed them through DD and flashbench on the same HBA.

    results:
    tests in order (DD 850 pro / DCS3500) (flashbench 850 pro / DCS3500)
    Code:
    P9D-M flashbench-dev # dd if=/dev/urandom of=/dev/sde bs=4k count=100 conv=fdatasync
    100+0 records in
    100+0 records out
    409600 bytes (410 kB, 400 KiB) copied, 0.00882489 s, 46.4 MB/s
    P9D-M flashbench-dev # dd if=/dev/urandom of=/dev/sde bs=4k count=100 conv=fdatasync
    100+0 records in
    100+0 records out
    409600 bytes (410 kB, 400 KiB) copied, 0.0111258 s, 36.8 MB/s
    P9D-M flashbench-dev # dd if=/dev/urandom of=/dev/sde bs=4k count=100 conv=fdatasync
    100+0 records in
    100+0 records out
    409600 bytes (410 kB, 400 KiB) copied, 0.0104273 s, 39.3 MB/s
    P9D-M flashbench-dev # dd if=/dev/urandom of=/dev/sde bs=4k count=100 conv=fdatasync
    100+0 records in
    100+0 records out
    409600 bytes (410 kB, 400 KiB) copied, 0.011564 s, 35.4 MB/s
    P9D-M flashbench-dev # dd if=/dev/urandom of=/dev/sde bs=4k count=100 conv=fdatasync
    100+0 records in
    100+0 records out
    409600 bytes (410 kB, 400 KiB) copied, 0.0111702 s, 36.7 MB/s
    P9D-M flashbench-dev # dd if=/dev/urandom of=/dev/sde bs=4k count=100 conv=fdatasync
    100+0 records in
    100+0 records out
    409600 bytes (410 kB, 400 KiB) copied, 0.0148941 s, 27.5 MB/s
    
    
    Code:
    P9D-M flashbench-dev # dd if=/dev/urandom of=/dev/sda bs=4k count=100 conv=fdatasync
    100+0 records in
    100+0 records out
    409600 bytes (410 kB, 400 KiB) copied, 0.00313566 s, 131 MB/s
    P9D-M flashbench-dev # dd if=/dev/urandom of=/dev/sda bs=4k count=100 conv=fdatasync
    100+0 records in
    100+0 records out
    409600 bytes (410 kB, 400 KiB) copied, 0.00897152 s, 45.7 MB/s
    P9D-M flashbench-dev # dd if=/dev/urandom of=/dev/sda bs=4k count=100 conv=fdatasync
    100+0 records in
    100+0 records out
    409600 bytes (410 kB, 400 KiB) copied, 0.00332635 s, 123 MB/s
    P9D-M flashbench-dev # dd if=/dev/urandom of=/dev/sda bs=4k count=100 conv=fdatasync
    100+0 records in
    100+0 records out
    409600 bytes (410 kB, 400 KiB) copied, 0.00638372 s, 64.2 MB/s
    P9D-M flashbench-dev # dd if=/dev/urandom of=/dev/sda bs=4k count=100 conv=fdatasync
    100+0 records in
    100+0 records out
    409600 bytes (410 kB, 400 KiB) copied, 0.0076655 s, 53.4 MB/s
    P9D-M flashbench-dev # dd if=/dev/urandom of=/dev/sda bs=4k count=100 conv=fdatasync
    100+0 records in
    100+0 records out
    409600 bytes (410 kB, 400 KiB) copied, 0.00888064 s, 46.1 MB/s
    P9D-M flashbench-dev # dd if=/dev/urandom of=/dev/sda bs=4k count=100 conv=fdatasync
    100+0 records in
    100+0 records out
    409600 bytes (410 kB, 400 KiB) copied, 0.00399582 s, 103 MB/s
    
    Code:
    P9D-M nas # sdparm -i /dev/sde && sdparm --get=WCE /dev/sde
        /dev/sde: ATA       Samsung SSD 850   1B6Q
    Device identification VPD page:
      Addressed logical unit:
        designator type: NAA,  code set: Binary
          0x50025388a06dfbe6
        /dev/sde: ATA       Samsung SSD 850   1B6Q
    WCE         1  [cha: y]
    P9D-M nas # sdparm -i /dev/sda && sdparm --get=WCE /dev/sda
        /dev/sda: ATA       INTEL SSDSC2BB16  2370
    Device identification VPD page:
      Addressed logical unit:
        designator type: vendor specific [0x0],  code set: ASCII
          vendor specific: BTWL342306Q6160MGN  
        designator type: T10 vendor identification,  code set: ASCII
          vendor id: ATA     
          vendor specific: INTEL SSDSC2BB160G4                     BTWL342306Q6160MGN  
        designator type: NAA,  code set: Binary
          0x55cd2e404b4f23b7
        /dev/sda: ATA       INTEL SSDSC2BB16  2370
    WCE         1  [cha: y, def:  1]
    
    
    850 PRO 4k flash bench
    Code:
    P9D-M flashbench-dev # ./flashbench -a --blocksize=4096 /dev/sde
    align 68719476736	pre 88.7µs	on 90.5µs	post 84µs	diff 4.13µs
    align 34359738368	pre 95.5µs	on 101µs	post 95.9µs	diff 5.33µs
    align 17179869184	pre 93.9µs	on 101µs	post 96.4µs	diff 6.32µs
    align 8589934592	pre 97.6µs	on 100µs	post 94.4µs	diff 4.51µs
    align 4294967296	pre 83.9µs	on 89µs	post 84µs	diff 5.06µs
    align 2147483648	pre 99.3µs	on 109µs	post 107µs	diff 6.53µs
    align 1073741824	pre 92.2µs	on 99.8µs	post 96.3µs	diff 5.58µs
    align 536870912	pre 93.5µs	on 101µs	post 96.2µs	diff 5.76µs
    align 268435456	pre 92.3µs	on 98.7µs	post 95.8µs	diff 4.68µs
    align 134217728	pre 96.9µs	on 101µs	post 95.4µs	diff 5.12µs
    align 67108864	pre 85.2µs	on 89.8µs	post 85.2µs	diff 4.6µs
    align 33554432	pre 85.3µs	on 89.6µs	post 84.5µs	diff 4.74µs
    align 16777216	pre 77.5µs	on 80.6µs	post 74.1µs	diff 4.79µs
    align 8388608	pre 83.3µs	on 89.5µs	post 86.4µs	diff 4.59µs
    align 4194304	pre 86.7µs	on 91.6µs	post 83.9µs	diff 6.25µs
    align 2097152	pre 85.9µs	on 90.9µs	post 85.1µs	diff 5.42µs
    align 1048576	pre 86.5µs	on 88.4µs	post 82.6µs	diff 3.86µs
    align 524288	pre 85.8µs	on 89.6µs	post 85.2µs	diff 4.11µs
    align 262144	pre 85.9µs	on 92.3µs	post 87.2µs	diff 5.79µs
    align 131072	pre 84.7µs	on 90.2µs	post 85.1µs	diff 5.25µs
    align 65536	pre 85.9µs	on 89.5µs	post 84.6µs	diff 4.33µs
    align 32768	pre 82.6µs	on 88.9µs	post 85.8µs	diff 4.7µs
    align 16384	pre 83.9µs	on 88.9µs	post 84.6µs	diff 4.64µs
    align 8192	pre 85.3µs	on 89µs	post 84.5µs	diff 4.05µs
    
    DC S3500 4k flash bench
    Code:
    P9D-M flashbench-dev # ./flashbench -a --blocksize=4096 /dev/sda
    align 34359738368	pre 32.5µs	on 33.6µs	post 33.3µs	diff 718ns
    align 17179869184	pre 33.8µs	on 32.9µs	post 33.1µs	diff -568ns
    align 8589934592	pre 33.2µs	on 33.9µs	post 32.9µs	diff 866ns
    align 4294967296	pre 43.9µs	on 42.4µs	post 34.3µs	diff 3.32µs
    align 2147483648	pre 42.1µs	on 39.7µs	post 40.6µs	diff -1684ns
    align 1073741824	pre 27.7µs	on 28.7µs	post 28.1µs	diff 797ns
    align 536870912	pre 29.1µs	on 29.6µs	post 29.8µs	diff 94ns
    align 268435456	pre 42.4µs	on 41.1µs	post 42.7µs	diff -1495ns
    align 134217728	pre 40.5µs	on 40.7µs	post 42.3µs	diff -708ns
    align 67108864	pre 41.7µs	on 40.4µs	post 41.6µs	diff -1233ns
    align 33554432	pre 42.2µs	on 40.1µs	post 41.7µs	diff -1876ns
    align 16777216	pre 43.1µs	on 41µs	post 40.8µs	diff -966ns
    align 8388608	pre 40.2µs	on 39.9µs	post 40µs	diff -244ns
    align 4194304	pre 41µs	on 44.3µs	post 44.9µs	diff 1.36µs
    align 2097152	pre 46.2µs	on 43.9µs	post 41.8µs	diff -72ns
    align 1048576	pre 45µs	on 44.1µs	post 42.3µs	diff 404ns
    align 524288	pre 40.9µs	on 42.3µs	post 42.4µs	diff 592ns
    align 262144	pre 39.7µs	on 37.3µs	post 38.7µs	diff -1812ns
    align 131072	pre 40µs	on 37.8µs	post 38.6µs	diff -1528ns
    align 65536	pre 44.4µs	on 43.8µs	post 40µs	diff 1.62µs
    align 32768	pre 39.8µs	on 37.7µs	post 40.6µs	diff -2508ns
    align 16384	pre 39.7µs	on 38.9µs	post 40.3µs	diff -1120ns
    align 8192	pre 39.5µs	on 37.3µs	post 40.4µs	diff -2694ns
    

    what's interesting is the write throughput for 4k is quite volatile on the DCS3500, more than i believed it would be. the 850 pro comparatively was quite stable on that metric through the quick and dirty 4k test.

    the flash bench test shows the real value of the S3500 architecture, using a lookup index for blocks vs internal disk B-Tree it delivers data with half the latency of the 850 pro.
     
  15. Perko

    Perko Member

    Joined:
    Aug 12, 2011
    Messages:
    1,849
    Location:
    NW Tasmania
    Aitch*

    I remember the first server that I got to re-purpose, running an old version of SuSE with one of these, or something very similar in it. It had been powered down for six months, and when I fired it up, four out of the twelve old SCSI drives were clicking like champions, and the old AT full tower gave me a tingle just to say hi. Fun times.
     
  16. rainwulf

    rainwulf Member

    Joined:
    Jan 20, 2002
    Messages:
    3,976
    Location:
    bris.qld.aus
    This is the main reason i went to ZFS. NTFS bitrot started hitting me back when i started using 2tb disks on hardware raid 5.
     
  17. rainwulf

    rainwulf Member

    Joined:
    Jan 20, 2002
    Messages:
    3,976
    Location:
    bris.qld.aus
    As a person with an over 100tb file server using 8tb disks for just movies and stuff, the L2ARC is an utter waste of time. It doesn't make any difference at all for media serving, and thats with 8tb archive SMR drives, not even standard drives.

    The archive drives still read as fast as a normal hard drive, so 16 8tb drives have plenty of performance to fill a 1gb connection without wasting a ssd.
     
  18. rainwulf

    rainwulf Member

    Joined:
    Jan 20, 2002
    Messages:
    3,976
    Location:
    bris.qld.aus
    When i first saw that card i needed some time to myself....

    ahem *cough*

    That was back in the day i was JBODing 200gig disks with silicon image pata cards.

    btw, you posted a PATA card, a scsi raid wouldn't have used something like that. It would have been one single card that would have been relatively long due to the ram cache.
     
  19. MUTMAN

    MUTMAN Member

    Joined:
    Jun 27, 2001
    Messages:
    4,640
    Location:
    4109
    Yep, its been noted by others also that for just serving up media its a waste.
    But. I have the ssd here doing nothing anyway.
    And i can spin down a mechanical drive and let the torrent seeding hit a low power ssd, then thats a win for me.
    Heat and power are higher on the priority list for me than most others id say
     
  20. NSanity

    NSanity Member

    Joined:
    Mar 11, 2002
    Messages:
    16,500
    Location:
    Canberra
    Not really sure of the use case for L2ARC.

    My L2 Hitrate on a Virtual Host was ~4% (ARC was ~60%).

    Awful lot of $ for fuck all effectiveness
     

Share This Page