Next gen filesystems (ZFS, BtrFS, ReFS, APFS, etc)

Discussion in 'Storage & Backup' started by elvis, May 20, 2016.

  1. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,457
    Location:
    Brisbane
    To be expected.

    TB of space / MB/s write speed = write time.

    Do the same for read time, add them together for a single all-sector read/write test.

    I generally don't bother with new drives. Second hand, maybe run a couple of xxhash checksums over the drive and make sure they return the same value, but even then I just get impatient. Next gen filesystems will tell you really quick if reads or scrubs have even a single bit bad (acknowledging you aren't going every sector on a non-full drive), and you should have good backups where it counts.
     
    ae00711 likes this.
  2. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,457
    Location:
    Brisbane
    Does ZFS not use internal UUIDs?

    With BtrFS you can point to the drive any way you like. All subsequent operations are done by "blkid" uuid and sub-uuid information after that, drive order is totally irrelevant. mdadm is the same.

    Seems positively archaic to do it any other way. I would have assumed ZoL would be better than that.
     
    Last edited: Apr 25, 2020
  3. cvidler

    cvidler Member

    Joined:
    Jun 29, 2001
    Messages:
    14,610
    Location:
    Canberra
    dunno where you got that idea? of course it uses it's own internal ID system.

    you can load a bunch of ZFS disks into a new system, and run 'zpool import -a' and it'll find and import the pool. no need to specify any device paths.

    now that ae000001 is probably experiencing, is ZFS does 'cache' imported pool device paths for use to speed boot up. you can clear that cache (delete a file) and do a zpool import -a to reimport the pool.

    and use if you create the array with either the wwns, or preferable the disk/by-id/ using the serial numbers of the disks, you get a pool that's not confused by sdX naming changes. (you could also make a custom udev rule to keep the sdX names consistent based on serial # or wwn).


    however as usual RTFM helps here, zpool import uses /dev/disk for it's search path (and picks the first match). if you want to be specific with your device naming a -d /dev/disk/by-uuid or /dev/disk/by-id (what I use) may be useful.
     
  4. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,457
    Location:
    Brisbane
    It was a question. I don't use ZoL (only BSD and Solaris versions).

    Good to know they do it reasonably sensibly, although the "cached pool" issue still seems a bit annoying. If rather a few seconds slower at boot with a guaranteed array discovery than the alternative.
     
  5. cvidler

    cvidler Member

    Joined:
    Jun 29, 2001
    Messages:
    14,610
    Location:
    Canberra
    that's optional too. and a per-pool option at that.
     
    elvis likes this.
  6. HobartTas

    HobartTas Member

    Joined:
    Jun 22, 2006
    Messages:
    1,007
    I don't see this as much of an issue if it can't find all the drives because if you want to import the array you just enumerate them all individually such as "zpool import tank (sda, sdb. sdc, sdd)" and so on but check the exact syntax to use and then its got no problems opening the pool.
     
  7. cvidler

    cvidler Member

    Joined:
    Jun 29, 2001
    Messages:
    14,610
    Location:
    Canberra
    found a stack of disks (8) in a box, loaded them up to see what was on them.

    noticed they had the tell-tale ZFS partition layout, always shows as having two partitions, 1 and 9.
    Code:
    lrwxrwxrwx. 1 root root  9 Apr 30 20:59 wwn-0x600062b2004f1b40263d69d9564b574d -> ../../sdt
    lrwxrwxrwx. 1 root root 10 Apr 30 20:59 wwn-0x600062b2004f1b40263d69d9564b574d-part1 -> ../../sdt1
    lrwxrwxrwx. 1 root root 10 Apr 30 20:59 wwn-0x600062b2004f1b40263d69d9564b574d-part9 -> ../../sdt9
    lrwxrwxrwx. 1 root root  9 Apr 30 20:59 wwn-0x600062b2004f1b40263d69e456f99b26 -> ../../sdu
    lrwxrwxrwx. 1 root root 10 Apr 30 20:59 wwn-0x600062b2004f1b40263d69e456f99b26-part1 -> ../../sdu1
    lrwxrwxrwx. 1 root root 10 Apr 30 20:59 wwn-0x600062b2004f1b40263d69e456f99b26-part9 -> ../../sdu9
    lrwxrwxrwx. 1 root root  9 Apr 30 20:59 wwn-0x600062b2004f1b40263d69f958339fb4 -> ../../sdv
    lrwxrwxrwx. 1 root root 10 Apr 30 20:59 wwn-0x600062b2004f1b40263d69f958339fb4-part1 -> ../../sdv1
    lrwxrwxrwx. 1 root root 10 Apr 30 20:59 wwn-0x600062b2004f1b40263d69f958339fb4-part9 -> ../../sdv9
    lrwxrwxrwx. 1 root root  9 Apr 30 20:59 wwn-0x600062b2004f1b40263d6a0458e11812 -> ../../sdw
    lrwxrwxrwx. 1 root root 10 Apr 30 21:00 wwn-0x600062b2004f1b40263d6a0458e11812-part1 -> ../../sdw1
    lrwxrwxrwx. 1 root root 10 Apr 30 21:00 wwn-0x600062b2004f1b40263d6a0458e11812-part9 -> ../../sdw9
    lrwxrwxrwx. 1 root root  9 Apr 30 21:00 wwn-0x600062b2004f1b40263d6a1059988a92 -> ../../sdx
    lrwxrwxrwx. 1 root root 10 Apr 30 21:00 wwn-0x600062b2004f1b40263d6a1059988a92-part1 -> ../../sdx1
    lrwxrwxrwx. 1 root root 10 Apr 30 21:00 wwn-0x600062b2004f1b40263d6a1059988a92-part9 -> ../../sdx9
    lrwxrwxrwx. 1 root root  9 Apr 30 21:00 wwn-0x600062b2004f1b40263d6a1c5a4a686c -> ../../sdy
    lrwxrwxrwx. 1 root root 10 Apr 30 21:00 wwn-0x600062b2004f1b40263d6a1c5a4a686c-part1 -> ../../sdy1
    lrwxrwxrwx. 1 root root 10 Apr 30 21:00 wwn-0x600062b2004f1b40263d6a1c5a4a686c-part9 -> ../../sdy9
    lrwxrwxrwx. 1 root root  9 Apr 30 20:51 wwn-0x600062b2004f1b40263d6a315b8bc9c6 -> ../../sdz
    lrwxrwxrwx. 1 root root 10 Apr 30 20:51 wwn-0x600062b2004f1b40263d6a3c5c37ba7e -> ../../sdaa
    
    so ran 'zpool import'
    Code:
    # zpool import
       pool: zfs-storage
         id: 13421130254309412809
      state: DEGRADED
     status: One or more devices contains corrupted data.
     action: The pool can be imported despite missing or damaged devices.  The
       fault tolerance of the pool may be compromised if imported.
       see: http://zfsonlinux.org/msg/ZFS-8000-4J
     config:
    
       zfs-storage                             DEGRADED
         raidz2-0                              DEGRADED
           scsi-SATA_ST32000542AS_5XW0FMNZ     UNAVAIL
           scsi-SATA_ST2000DM001-1CH_Z240V8Y5  UNAVAIL
           sdx                                 ONLINE
           sdu                                 ONLINE
           sdt                                 ONLINE
           sdy                                 ONLINE
           sdw                                 ONLINE
           sdv                                 ONLINE
    
    Seems two of the 8 drives are not happy. Note the first listing doesn't show any partitions for them.


    Go to import it.
    Code:
    # zpool import zfs-storage
    cannot import 'zfs-storage': pool was previously in use from another system.
    Last accessed by <unknown> (hostid=0) at Thu Jul  3 22:13:53 2014
    The pool can be imported, use 'zpool import -f' to import the pool.
    
    Ok, pool not from this server, so force it.

    Code:
    # zpool import zfs-storage -f
    cannot import 'zfs-storage': I/O error
       Recovery is possible, but will result in some data loss.
       Returning the pool to its state as of Thu 03 Jul 2014 22:13:46 AEST
       should correct the problem.  Approximately 7 seconds of data
       must be discarded, irreversibly.  Recovery can be attempted
       by executing 'zpool import -F zfs-storage'.  A scrub of the pool
       is strongly recommended after recovery.
    

    Ok, must force and FIX it, and lose 7 seconds of data from 6 years ago.

    Code:
    # zpool import -f -F zfs-storage
    #
    
    Ok. scrub it as requested.

    Code:
    # zpool scrub zfs-storage
    # zpool status zfs-storage
      pool: zfs-storage
     state: DEGRADED
    status: One or more devices could not be used because the label is missing or
       invalid.  Sufficient replicas exist for the pool to continue
       functioning in a degraded state.
    action: Replace the device using 'zpool replace'.
       see: http://zfsonlinux.org/msg/ZFS-8000-4J
      scan: scrub in progress since Thu Apr 30 21:30:18 2020
       686M scanned at 21.4M/s, 491M issued at 15.3M/s, 692M total
       0B repaired, 70.94% done, 0 days 00:00:13 to go
    config:
    
       NAME                      STATE     READ WRITE CKSUM
       zfs-storage               DEGRADED     0     0     0
         raidz2-0                DEGRADED     0     0     0
           3831973345871989454   UNAVAIL      0     0     0  was /dev/disk/by-id/scsi-SATA_ST32000542AS_5XW0FMNZ-part1
           13160028098390289446  UNAVAIL      0     0     0  was /dev/disk/by-id/scsi-SATA_ST2000DM001-1CH_Z240V8Y5-part1
           sdx                   ONLINE       0     0     0
           sdu                   ONLINE       0     0     0
           sdt                   ONLINE       0     0     0
           sdy                   ONLINE       0     0     0
           sdw                   ONLINE       0     0     0
           sdv                   ONLINE       0     0     0
    
    errors: No known data errors
    
    Code:
    # zpool status zfs-storage
      pool: zfs-storage
     state: DEGRADED
    status: One or more devices could not be used because the label is missing or
       invalid.  Sufficient replicas exist for the pool to continue
       functioning in a degraded state.
    action: Replace the device using 'zpool replace'.
       see: http://zfsonlinux.org/msg/ZFS-8000-4J
      scan: scrub repaired 0B in 0 days 00:00:50 with 0 errors on Thu Apr 30 21:31:08 2020
    config:
    
       NAME                      STATE     READ WRITE CKSUM
       zfs-storage               DEGRADED     0     0     0
         raidz2-0                DEGRADED     0     0     0
           3831973345871989454   UNAVAIL      0     0     0  was /dev/disk/by-id/scsi-SATA_ST32000542AS_5XW0FMNZ-part1
           13160028098390289446  UNAVAIL      0     0     0  was /dev/disk/by-id/scsi-SATA_ST2000DM001-1CH_Z240V8Y5-part1
           sdx                   ONLINE       0     0     0
           sdu                   ONLINE       0     0     0
           sdt                   ONLINE       0     0     0
           sdy                   ONLINE       0     0     0
           sdw                   ONLINE       0     0     0
           sdv                   ONLINE       0     0     0
    
    errors: No known data errors
    
    
    scrub finished quickly.
    turns out the pool is empty.
    no data lost. lol no data to lose.

    Code:
    # ll /zfs-storage -a
    total 22
    drwxr-xr-x.  2 root root    2 May 22  2014 .
    dr-xr-xr-x. 22 root root 4096 Apr 30 21:29 ..
    
    was hoping to get some additional space, but with two dead drives, that's a bust already.

    posted anyway to show the process for importing a pool of disks, not once did I need to specify device names, autodetected, had to force the import as it was a pool from a foreign system (one of my previous server builds), and force a Fix to import a degraded pool. all nicely documented by the out put of the commands, and you're at no stage left wondering what to do, or what your actions will do.
     
    Xenomorph, ae00711 and elvis like this.
  8. timey_timeless

    timey_timeless Member

    Joined:
    Oct 14, 2007
    Messages:
    41
    out of interest

    that's a raidz2 volume with 2 disks missing
    therefore it will function in a degraded state and no data is yet lost.

    if you scrub the data, my understanding is you will not be able to repair anything because there is no parity available (and correct me if this is not correct). but will a scrub even be able to tell you that data is corrupt without any available parity? or am i getting the roles of parity and metadata confused?
     
  9. cvidler

    cvidler Member

    Joined:
    Jun 29, 2001
    Messages:
    14,610
    Location:
    Canberra
    yeah parity and metadata are two different parts.

    The parity is missing (we'll ignore as the parity and data is striped, we have parity AND data missing, but enough of both for no loss).
    the metadata - in this case block checksums, is enough for ZFS scrub to tell there's no data lost. if there was it wouldn't be able to fix it however.

    were I to find some more 2TB spinners, and replace the dead two, I'd run 'zpool scrub zfs-storage' again to enact the repair (which for some reason ZFS calls a 'resilver' in the case of a repair) - rebuild the missing two disks from the remaining data and parity.
     
  10. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,457
    Location:
    Brisbane
    You know what would let you rebuild the data without extra disks? BtrFS. That's what.

    Assuming there was enough free space.
     
  11. HobartTas

    HobartTas Member

    Joined:
    Jun 22, 2006
    Messages:
    1,007
    Partially correct as far as data goes because you're effectively at the equivalent of raid zero so if you strike a bad block you're lost the stripe as you've gone below minimum redundancy so now you've lost whatever file that has this issue, meanwhile everything else on there will be 100% OK because .......

    Yes, because in addition to parity each and every block in the volume is checksummed whether or not you have parity or mirrors so even if you're scrubbing your remaining six drives out of the eight drive Raid-Z2 array with zero parity then if all six checksums are valid for each relevant block then the entire stripe is still OK.

    This is different to hardware Raid 5 down one drive or Raid 6 down two drives because you can't tell if the blocks are valid or not, take for example if one block is on the way out to being a bad block and presently is returning corrupted data during reads (unlikely I know due to internal block ECC but just assume its happening for the purposes of my argument), ZFS would detect this because the individual block checksum either matches or it doesn't whereas hardware raid wouldn't have a clue.

    Metadata is different in that when copies is set to one (N) by default then there is one copy of the data but metadata is then set to (N+1) hence for a setting of one then metadata is set to two so this means that there are at least two copies of the metadata and in addition if the zpool consists of more than one disk then the two copies are also stored on different disks.

    These reasons are why ZFS and probably also BTRFS are way better filesystems than anything else. Microsoft claims REFS is up there with those two but since they haven't released any information about how REFS works internally I'm a bit dubious about their claim.
     
  12. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,457
    Location:
    Brisbane
    Seeing metadata checksums making it into a few file systems now, which is better then before. Still not the complete data checksum of ZFS/BtrFS, but better than nothing.

    File systems include:
    * xfs
    * ext4
    * f2fs
    * apfs (covered already in this thread).
     
  13. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,457
    Location:
    Brisbane
    Work continues on both WinBtrFS and Quibble, an open source bootloader for Windows XP through to 10 (1909 tested and working).

    It's now at a point where you can convert an existing NTFS installed Windows to BtrFS and boot from it. Still highly experimental, and the author warns "don't use this for anything serious".

    https://github.com/maharmstone/btrfs

    https://github.com/maharmstone/quibble
     
    Last edited: May 21, 2020
    ae00711 and MUTMAN like this.
  14. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,457
    Location:
    Brisbane
    Linux kernel 5.4 grants BtrFS the ability to detect memory bit flips without ECC RAM.

    And... my home NAS detected one! Time to upgrade the old clunker to ECC. :)

    (That's the only non-ECC NAS I have access to, so nice to see the feature in action).
     
    ae00711 likes this.
  15. cvidler

    cvidler Member

    Joined:
    Jun 29, 2001
    Messages:
    14,610
    Location:
    Canberra
    howzit do that?

    btrfs do it's own ECC on it's memory?
     
  16. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,457
    Location:
    Brisbane
    I don't actually know the exact details of it. 5.4 RC1 added a "strict check" to the tree-checker, which apparently can now detect a corruption, but isn't able to fix it.

    https://kernelnewbies.org/Linux_5.4#File_systems

    [edit]

    Well, it was certainly a crystal ball event. As I was typing this message, my NAS blew up, and didn't come back. Swapped it out to another system, and we're off and running again!

    Still no ECC in this one, so it's time to go shopping.
     
  17. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,457
    Location:
    Brisbane
    It's in the 5.4 kernel and the matching BtrFS tools, specifically the scrub component. Both of these are in 20.04 LTS

    To clarify, it can only detect certain errors after the fact, not during, and it's not fool proof. It's not a replacement for ECC, but will better identify the results of bit flips with the new strict checker.

    I've said for some time that I think ECC will become a standard in more devices, including low end systems. As speed and number of cores increase, I think manufacturers will be forced to do so, just like we're forced to use checksumming file systems on large drives now.
     
    Last edited: May 26, 2020
    ae00711 likes this.
  18. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,457
    Location:
    Brisbane
    Thread is 4 years old, how exciting.

    Some new advice on BtrFS RAID5/6, which has been lagging IMHO, but changes are happening slowly. Still not a great option, and still not recommended for "data you really care about".

    My recommendations remain that BtrFS users should stick to RAID1 (quite different to traditional RAID1 - more like a clustered file system where 2 (or more if you wish) copies of every bit of data get made and placed on physically separate volumes), which can scale happily over a large volume of mismatching drives. Particularly for home users on hodge-podge hardware that don't want to purchase multiple expensive drives all at the same time, it's brilliant, and allows you to scale over time as you replace physical disks. I'm on the "same" BtrFS RAID1 setup I've been on for years, despite 3 generations of disk changes thanks to hot/online migrations, which is a bit of this sort of thing: https://en.wikipedia.org/wiki/Ship_of_Theseus

    But I digress, on to the late 2020 BtrFS RAID5/6 recommendations:

    https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@hungrycats.org/

    In summary:

    * Don't use RAID5/6 on metadata. Do so only on data. Use RAID1 (preferably the new RAID1C3 or 1C4 that creates more than 2 copies) on metadata.

    Metadata is tiny compared to data. On my baby home NAS in RAID1:

    Code:
    # btrfs filesystem usage /data
    Overall:
        Device size:                  14.55TiB
        Device allocated:             13.24TiB
        Device unallocated:            1.32TiB
        Device missing:                  0.00B
        Used:                         12.72TiB
        Free (estimated):            926.35GiB      (min: 926.35GiB)
        Data ratio:                       2.00
        Metadata ratio:                   2.00
        Global reserve:              512.00MiB      (used: 0.00B)
    Data,RAID1: Size:6.59TiB, Used:6.34TiB (96.26%)
       /dev/sdb        6.59TiB
       /dev/sda        6.59TiB
    Metadata,RAID1: Size:34.00GiB, Used:23.33GiB (68.63%)
       /dev/sdb       34.00GiB
       /dev/sda       34.00GiB
    System,RAID1: Size:32.00MiB, Used:960.00KiB (2.93%)
       /dev/sdb       32.00MiB
       /dev/sda       32.00MiB
    Unallocated:
       /dev/sdb      674.00GiB
       /dev/sda      674.00GiB
    
    Just shy of 8TB usable disk space sitting at 6.4TB used. Metadata takes up around 24GB, or about 0.4% of total used space. Could be more if you have *lots* of small files, but is still not a problem to RAID1C4.

    * Scrub often. They don't really say what "often" is. But it sounds like something that should be aimed for at least once a week. However read on for the caveat...

    * Scrub devices one by one. Rather than scrub a volume (which scrubs all disks in parallel), sequentially scrub each device. Use the "-B" flag to "not background" (negative options are so stupid), i.e.: foreground the task, and then you can scrub subsequent devices in parallel. Not that this will obviously take *much* longer compared to RAID1/10 if you have a lot of disks. A big consideration for anyone hosting lots of drives.

    * Don't fix failed disks with "btrfs device remove". You need to use "replace" instead. That also means not adding your spare disks to your array (either add them in unused to the system if you want them "hot", or keep them on the shelf if you want them "cold"). By comparison, other RAID modes allow you to remove a disk and balance (assuming you have the space), which is a far easier way to recover from failure. They also note that this process pretty much renders the system unusable during recovery, which sucks.

    * During a RAID5/6 recovery, incorrect error messages can sometimes be printed to dmesg. Ugh.

    * RAID5 write hole still exists.
     
    Last edited: Oct 8, 2020
    MUTMAN and GumbyNoTalent like this.
  19. cvidler

    cvidler Member

    Joined:
    Jun 29, 2001
    Messages:
    14,610
    Location:
    Canberra
    so reading that, still says to me, don't use RAID5/6 on BTRFS. too many caveats to remember - especially in a time where you're trying to recover your array.


    on the topic of updates. OpenZFS is almost reaching version 2 (it's at 2.0 RC3 now).
    They're skipping v1 for some reason, going instead from ZFS on Linux 0.8.x to OpenZFS 2.0

    the name change, reflects the fact they're supporting more than just Linux now, with BSD support now, and Mac support in the future.

    Ubuntu now even supports booting and root ZFS volumes. pretty good to have major distro support in that regard.

    the only thing really holding it back is the license incompatibility CDDL vs. GPLv2. so it'll likely never be mainlined.
     
  20. OP
    OP
    elvis

    elvis Old school old fool

    Joined:
    Jun 27, 2001
    Messages:
    43,457
    Location:
    Brisbane
    Certainly if you've got big/expensive/important things on there.

    If you're just running a home NAS for your "acquired" media, then meh. The big advantage of BtrFS is that you can run it over mismatching devices and better use the space available, as well as grow+rebalance the array a single disk at a time. Something ZFS can't do (and likely won't due to design considerations).

    If you're scrubbing frequently, you can get this sort of info:

    Code:
    # /bin/btrfs device stats /data
    [/dev/sdb].write_io_errs    0
    [/dev/sdb].read_io_errs     0
    [/dev/sdb].flush_io_errs    0
    [/dev/sdb].corruption_errs  0
    [/dev/sdb].generation_errs  0
    [/dev/sda].write_io_errs    0
    [/dev/sda].read_io_errs     0
    [/dev/sda].flush_io_errs    0
    [/dev/sda].corruption_errs  0
    [/dev/sda].generation_errs  0
    
    If you see one of those numbers go up, time to replace a device before it fails. Combine that with the "5 SMART values that matter", and you've got a good indication when a disk is on the way out:
    https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/

    I've run ZFS on Mac before. Is this something different? Or just "more official"?

    Lord knows macOS needs some decent options. APFS is rubbish, and most consumer NAS devices targeting Mac are utterly shithouse.
     
    Last edited: Oct 8, 2020

Share This Page

Advertisement: