Increase storage size and move from mdraid+LVM to ZFS - feedback and comments please

Discussion in 'Storage & Backup' started by daehenoc, Aug 9, 2021.

  1. daehenoc

    daehenoc Member

    Joined:
    Nov 4, 2005
    Messages:
    2,879
    Location:
    Mt Gravatt E, BNE, QLD
    Hi all,

    So I have 3x5Tb 2.5" drives and a 4Tb 2.5" drive in my shiny Streacom case, which gives me 9Tb of RAID5 with mdraid, I use LVM on top of the md device and LVM on the 4Tb drive as backup for a couple of Apple laptops. I also use about 2.5Tb of space off the 4Tb as overflow storage, since my RAID5 is pretty much full.

    I've got a few shares I put ISOs on (yes, actual operating system ISO files), media, torrents and MythTV recordings. The primary purpose is to provide the MythTV files to the three MythTV frontends. Secondary is messing around with torrents.

    I've been thinking about how to move to more storage, and I think I'm going to move to 3x12Tb 3.5" drives with ZFS and a 1Tb SSD to provide L2ARC (christ it takes a long time to create a torrent on the mdraid sometimes). I'm not fussed about the write performance (all my files should live somewhere in the ZFS pool, MPEG2 streams aren't huge amounts of data, torrents get written relatively slowly) so it's mainly the read performance I want to be 'better' than my current RAID5 speeds (which is about 40Mb/sec).

    A few questions:
    • I know that ZFS is not a straight equivalent to RAID, but does this sounds like a reasonable approach?
    • I should get about 24TiB of raw storage, right?
    • Can I replace the 12Tb drives with 18Tb drives in the future, one at a time, to build out to 36GiB of raw storage?
    • If I have 32Gb of RAM, should I bother with an SSD for L2ARC?

    PC Specs:
    References:
     
  2. HobartTas

    HobartTas Member

    Joined:
    Jun 22, 2006
    Messages:
    1,096
    Yes, three 12TB drives in ZFS Raid-Z (Raid-5) will give you a net amount of 24TB in storage (or whatever that works out to be in TiB) and yes you have to replace a drive with an 18TB one and re-silver and repeat for the other two drives as well to get the extra storage as long as you set "Autoexpand=On" to use up the extra space.

    "I know that ZFS is not a straight equivalent to RAID, but does this sounds like a reasonable approach?" yes, its actually better because ZFS is COW (copy on write) with no Raid-5 write hole and also each block is checksummed.

    If you're just running ZFS on top of an OS then I've run a ZFS ten drive Raid-Z2 with just 4GB of RAM with no problems. You only need more if you're going to be running stuff on top of the server like VM's or whatever.

    If you're just streaming stuff then ZFS will note that you are accessing data on a sequential basis and will intelligently read ahead and buffer stuff for you automatically, therefore I'm not sure if you need an L2ARC but I'm guessing probably not but I'd suggest getting someone else's opinion on that.

    For storing bulk data I usually set Recordsize=1 MB, compression=On, and CaseSensitivity=mixed to accomodate both Windows and Linux filesystems. I just use the default compression as I don't want to bog it down with something CPU intensive like compression level Gzip-9 so if anything is easily compressible it will be and if not then nothing happens and its stored directly.
     
    Last edited: Aug 10, 2021
    davros123 and daehenoc like this.
  3. NSanity

    NSanity Member

    Joined:
    Mar 11, 2002
    Messages:
    18,393
    Location:
    Brisbane
    1. Sure.
    2. There abouts.
    3. yes... kinda. Rebalancing data across drives can be a challenge when you add additional drives, but this isn't as big an issue if you're changing the size of member drives.
    3. L2Arc is pointless in your usecase. L2Arc can improved cached reads for hotblocks. A Home Media server doesn't really have hot blocks - and even if it did, whats the point - the files will spill over cache size anyway. Fwiw I ran 2TB of L2Arc across 28TB of VM's/low-traffic file for years, and saw less than a 5% hit rate (vs 30-35% on the ARC in ram).

    If you were going to do anything - you'd do a SLOG to manage the ZIL - but thats a fucken nightmare to get right (fundamentally you want unbuffered 4K io - but it has to be mirrored - Really this is 100% an Intel Optane Usecase), and again, largely pointless for media use cases. If you start asking VM's or Database workloads from it - you need this.

    Adding to HobartTas - i'd be setting ATime=off as well. Its a bunch of fuck off tiny writes that 99% of people just don't care about.
     
    Last edited: Aug 11, 2021
    davros123 and daehenoc like this.
  4. OP
    OP
    daehenoc

    daehenoc Member

    Joined:
    Nov 4, 2005
    Messages:
    2,879
    Location:
    Mt Gravatt E, BNE, QLD
    Awesome, thanks! Is it very similar to: remove a 12Tb drive from the array, power down, remove that drive (make sure you take the correct one!), install and connect the 18Tb drive, power on, attach that drive to the array, rebalance, waaaiiiiittt... then do it twice more to replace the other 12Tb drives?

    Mainly I want the potential larger speed increase of files (i.e. that are/should be stored in ARC or L2ARC), when creating a torrent file (i.e. doing a checksum across 5-30Gb of files).

    Cool, thanks for the information - would having an L2ARC make a difference to creating a torrent file (i.e. checksumming a bunch of files) - sure, I may get a very low hit rate on the L2ARC, but I'm happy with low L2ARC hit rate vs not having to wait 5-6 minutes for the checksum to be done... And I've got a 960Gb SSD lying here doing nothing, so I figured I could put that in as L2ARC.
    Of course I could be misunderstanding how the L2ARC works, I thought that it would act like a FIFO buffer for writes to the HDDs, and act as a cache for reads, i.e. if I'm chucking a 1Gb file in, say off an external USB-C drive, that file would land on the L2ARC first, lickety-split, then be written from the L2ARC to the HDDs. I don't know if that file, once written to the HDDs, stays on the L2ARC - that is what I imagined would happen, so that if I come back a few hours later (and there haven't been 959Gb of files written to the L2ARC) and read that file, it would be delivered to me straight from the L2ARC - is that right?
    (Hmm, 'hot blocks' may be blocks that are often accessed, not recently accessed...?)

    Thanks for the config tips, there's going to be f.all compressible files (should I set compression=off if all I care about is speed?) and I will be accessing these files with Linux and Windows. Obvs tiny writes won't feature much at all :thumbup:
     
  5. NSanity

    NSanity Member

    Joined:
    Mar 11, 2002
    Messages:
    18,393
    Location:
    Brisbane
    https://www.dlford.io/linux-zfs-raid-disk-replacement-procedure/

    the R in ARC is why it doesn't do writes - its right there on the tin. Unsync writes are stored in Ram then destaged to disk in an Transaction Group. Sync writes are either committed to disk directly or written to a SLOG, then destaged to the pool where it can (there is a logic to it, but think of it like this.).

    (here is a good description how it works - https://arstechnica.com/information...-understanding-zfs-storage-and-performance/3/)

    In your example, the file will land either in ram, ready to be written, or written. There is no write caching beyond Ram for unsync writes.

    You might be confused as to what Copy on Write means then - well its no guarantee of data sent to a ZFS file system is written - its a guarantee that if ZFS says it was written, it was - and it only commits when the block is 100% complete - meaning you don't get filesystem corruption in the middle of a write.

    L2ARC *used* to be live - but I think there was an update that let you reserve a percentage as persistent on reboots, don't quote me on this. Forcing blocks into L2Arc (or even the Arc) is only ever done by using them - this is not tiering, its cache.

    Many Compression algorithms are free with reasonably modern CPU's - and will ultimately result in slightly increased performance given disk are probably the slowest part of your storage solution.

    Dedupe is not free - do not do this. ZFS's dedupe is... average to say the best.

    Overall realise that the key focus of ZFS is a modern filesystem that places a priority on Data Integrity and Scale, not Performance. Even then ZFS's architecture was settled just towards the end of single, monolithic file server architecture - and then hampered mid-life by the Sun Microsystems collapse and the 2+ years it spent in the wilderness waiting for Oracle to work out what the fuck they were going to do with it. Then fractured - mostly due to its licensing scheme and again, Oracles lack of care/fucks for FOSS/GPL/Linux.

    Other next gen file systems, and modules (e.g bcache and btrfs) may give you a superior result - but fundamentally you're working with fuck all disk members - as such the benefits a filesystem (any of them) can give you performance wise are minimal.
     
    Last edited: Aug 11, 2021
    daehenoc and davros123 like this.
  6. HobartTas

    HobartTas Member

    Joined:
    Jun 22, 2006
    Messages:
    1,096
    Yes, it should be that straightforward, don't forget that ZFS has had sequential re-silver for quite some time so this should take as long as whatever sequential speeds the drives go at.

    There's an easy fix for this and that is to just replace your "4Tb 2.5" drive" which is probably an SMR one anyway with an SSD like a Samsung QVO and use that as your work area for everything and use the Raid-Z for media and other static data. QVO's maintain full read speeds always and write speeds only drop down to about 160 MB's for QVO's that are 2TB+ in size once you've filled up the fairly large SLC cache, also the SLC cache will be continuously emptying at the 160 MB's writing out to the quad level cells so its probably unlikely you'd ever notice a slow down.

    I'd turn it on no matter what because even if you select the weakest compression it won't have a noticeable impact on the CPU but what it will do is shrink the data and fit it into the smallest number of stripes needed. I don't know where people think you need heavy duty hardware to run ZFS because I picked up several of those Ebay Xeon e5-2670v1 CPU's and since my 10 drive Raid-Z2 only has Gbe connected to it then I used the file manager on the server itself to test speeds and I picked up a couple hundred GB of large files in one directory and copied it to another one using that CPU. The transfer was going at around 460 MB's so that's simultaneously reading 460 MB's and writing 460 MB's with default compression plus Raid-6 parity calculations for the Raid-Z2 plus checksums and all the other housekeeping needed to keep the data in a consistent state. Scrubs went at 1.3 GB's across those ten drives whereas with my previous CPU which was a 4820K which although faster clock speeds had half the number of cores as the Xeon only managed scrubs of about 1.0 GB's and even when I was one drive down due to hardware failure with the 4820K then a scrub which then had to calculate everything and reconstruct the missing data still went at about 950 MB's. It's a very impressive file system!
     
    daehenoc likes this.
  7. gea

    gea Member

    Joined:
    May 22, 2011
    Messages:
    222
    There are only a few rules I would always care about with ZFS

    - Disks: never use SMR disks
    - Disk replace: if possible, never remove a disk from a vdev as this reduces redundancy,
    (hot often possible) add a new disk, start a replace and when finished, (hot) remove the old disk
    - Compress: Enable LZ4 in nearly all cases
    - Dedup: ZFS dedup is realtime dedup. If you cannot hold the dedup table in RAM,
    a snap destroy or pool import can last ages. Calculate 1-5 GB RAM per TB dedup data.
    You need the dedup RAM additionally to OS needs (ex Solaris/OmniOS 2-4 GB),
    write cache (up to 4GB, affects write performance), Arc (around 80% of free RAM, affects read performance).
    - L2Arc, mainly good for many users and volatile small files (mailserver, office use)
    current Open-ZFS has persistent L2Arc. Max L2Arc is around 5 x RAM (needs RAM to organize)
    - For databases and VM storage you must enable sync write or you are in danger of corrupt data/guest filesystems
    add an Slog for performance then ex Intel DC 3700 Sata, WD SS530 SAS SSD or Intel Optane (NVMe)

    more https://napp-it.org/doc/downloads/zfs_design_principles.pdf
     
  8. luke o

    luke o Member

    Joined:
    Jun 15, 2003
    Messages:
    3,739
    Location:
    WA
    Considered FreeNAS (TrueNAS now)?

    Any ZFS is recommended to use ECC RAM, at least for any data you care about. If it's just movies/shows then meh. If it's something like documents that you care about then go ECC. I did.

    My NAS is in a Fractal case, 6x10TB HDD shucked WD drives. Supermicro board, Xeon 4 CPU lower power CPU, 32GB ECC RAM. Never missed a beat very happy. Xeon CPU allows for encrypted disks, hardware decryption supported. Very neat!
     
  9. gea

    gea Member

    Joined:
    May 22, 2011
    Messages:
    222
    Any server is recommended to use ECC. With ZFS that cares about all possible ways to loose data, ECC is recommended to avoid data corruptions due RAM errors as even ZFS cannot protect without ECC.

    Only the myth of scrub to death of pools without ECC is a myth. When ZFS detects too many checksums errors, it takes disks offline instead repair them to death.
     
  10. OP
    OP
    daehenoc

    daehenoc Member

    Joined:
    Nov 4, 2005
    Messages:
    2,879
    Location:
    Mt Gravatt E, BNE, QLD
    Thanks very much for all the replies! Working in them in reverse order:
    • ECC RAM: I'd love to have ECC RAM, but the motherboard I have doesn't support ECC, so that means a replacement motherboard and RAM and I haven't got the HDDs yet!

    • Free/TrueNAS: I'm aiming for pretty simple, so having the HDDs inside the MythTV/torrent box is my preference. Surely transfer rates to on-board HDDs is going to be tons quicker than pushing data over the network? (There are two 1Gb NICs on the motherboard, I can team them, then I'll need to get an enclosure/PC to turn into a NAS.)

    • Disks: SMR disks need some 'idle' time to reorganize the 'shingled' writes, which can impact write times, and make the drive 'a lot' 'slower' if there is no idle time for the drive to do that reorganization. Is this a concern with ZFS? I don't know if there is any 'idle' time for the drives, if they're in a RAID-Z, and I have L2ARC, won't there be 'some' idle time?
    I've implemented dedup on NetApp appliances in the past. Dedup is hard to do :rolleyes:
    • Compression: looks like L4Z is the way to go, thanks for the advice!
    • Do you mean that I store the files on the QVO, work on them there then manually move them onto the RAID-Z? I'd like to avoid having to move files around drives.
    So I'm not too fussed about the Copy-on-Write stuff, I mean, I love the reliability, but what I'm really after is the potential speed boost to having 'recently accessed files' in ARC or L2ARC. I understand that they are replacement cache systems, this part of the Wikipedia entry on ZFS seems to say that the read capabilities of ARC or L2ARC will do what I want. I understand that these cache writes to the HDDs, but once those files are written, the data should remain in the cache... right? Here's an example:
    1. I pull down 10x 1Gb files and they're written to the RAID-Z over an hour, traversing the ARC and L2ARC as they are written,
    2. I want to create a torrent based on these files, I tell my torrent software to create a torrent, which does a checksum of those ten files.
    Without ZFS, reading directly off a HDD, there's a bit of cache which might serve some of that data, but most of it has to come off the actual HDD media. With putting ZFS in, I am hoping that the data I wrote over an hour will come straight from the ARC or L2ARC, greatly speeding up that checksum activity. Unless there's something I'm not understanding about how the ARC or L2ARC works! (It only caches pages, which doesn't translate into files?)
     
    GumbyNoTalent likes this.
  11. xeghia

    xeghia Member

    Joined:
    Nov 7, 2003
    Messages:
    27
    Location:
    Gold Coast
    I think everyone has pretty much covered the big ones, 1MB recordsize, use LZ4, atime off, L2ARC never being as 'useful' as people think it will be, dedupe being of limited/niche use as well broadly speaking. Depending on a few things ZFS really hates torrents but these days it is not so bad as torrents are chunked up into 4-16MB pieces and most clients hold each piece in memory and write it out when complete, and with 1MB recordsize you would not notice really for low speed streaming.

    The one thing I have not seen mentioned yet is using a special device for metadata (literally called a special device, it will also be used for dedupe as well if you enable that) - honestly that's going to give you a more noticeable improvement over L2ARC I think. One thing that always gave me the shits was how it was basically impossible to cache metadata 'permanently' as the ARC/L2ARC would constantly 'lose' the metadata, but special devices solve that. They can also store small files up to a configurable size, but It sounds like mostly this will be iso and media files so that might be of limited use really but it is a night and day difference having the metadata on an SSD. You can also use partitions to split that up, so partition 1 = special, partition 2 = L2ARC etc.

    Ideally you also want to mirror that, and you can also mix and match configuration ie mirrored ssd special, raidz spinning disks. I added a single SSD as a special for testing and liked it so much I kept that, and just bought and attached a second SSD a few months later (attach will replicate an existing single device and create a mirror) - very happy with special devices. I may be biased though because I still use rsync to do a weekly backup to a second ZFS pool, I'll figure out snapshots and replication one day but rsync just works for replicating A to B quickly.
     
  12. gea

    gea Member

    Joined:
    May 22, 2011
    Messages:
    222
    While modern SMR disks perform quite well on shortrunning benchmarks or average desktop use due a large cache, never add them to a (ZFS) raid. Their main problem is long during writes as you see it on a raid rebuild/resilver when write performance is only a fraction after a few minutes of constant writes.

    https://www.servethehome.com/wd-red-smr-vs-cmr-tested-avoid-red-smr/

    about L2Arc
    L2Arc is now persistent. With volatile hot data ex mailserver, many small office files this can make a huge difference as all hot metadata is in cache, ex https://github.com/omniosorg/omnios-build/blob/r151038/doc/ReleaseNotes.md

    about recsize=1M
    This means that your whole data is spit into ZFS blocks of 1M. This is ok for large files but not so good with many small files, databases or VM storage. In many cases the default of 128k is a good compromise. For VM I would go down to 32-64k (never go below as ZFS is very inefficient then).

    If you want to use special vdevs, use them in mirrors. While you can remove them when ashift is same as pool vdev ashift, the pool is lost when a special vdev is lost as there is critical data like on a regular vdev. This is different to L2Arc and Slog. Main use for special vdevs is small io, filesystems with lower recsize than a threshold, metadata and dedup data. In special use cases this can boost performance, for average use not so, https://www.napp-it.org/doc/downloads/special-vdev.pdf

    about rsync vs replication
    Replication is pure data streaming of a snap so much faster than rsync that is based on a file compare
    Replication preserves many ZFS properties-even raw encryption. On Solaris/OmniOS you loose the Windows ntfs alike ACL as rsync does not support the extended ACL (Windows SID as a ZFS property)-one of the main advantages of the Solaris ZFS/kernelbased SMB server as it preserves Windows AD ACLs when you restore a filesystem from backup.
    Replication allows versioning of backups based on ZFS snaps (require only the space of modified datablocks). 1000 replications of unmodified data: no space used
     
    Last edited: Aug 15, 2021
    daehenoc likes this.
  13. HobartTas

    HobartTas Member

    Joined:
    Jun 22, 2006
    Messages:
    1,096
    Apparently ZFS can't cope with the disks being non-responsive for that long which is why they aren't recommended at all.

    In all likelihood the HDD's probably will be able to cope with all those IOPS.
     
  14. OP
    OP
    daehenoc

    daehenoc Member

    Joined:
    Nov 4, 2005
    Messages:
    2,879
    Location:
    Mt Gravatt E, BNE, QLD
    You're talking about using the QVO SSD as a non-ZFS member, right? Like this:
    • QVO SSD: place to store files, create torrents, etc, then move it all to the ZFS array once I'm done with it, mounted on /mnt/working
    • ZFS RAID-Z: 3x 12Tb (non-SMR) HDDs, mounted on /pool/thing1
    Very interesting article, thanks!

    I've checked and my current 2.5" 5Tb drives appear to not be SMR drives: ST5000LM000-2AN170

    Can I use different block sizes for different parts of the pool, or is that a pool-level setting? i.e. /pool/hugefiles has bs-1Mb, but /pool/tiddles has bs=128Mb
     
  15. jtir

    jtir Member

    Joined:
    Oct 1, 2003
    Messages:
    1,180
    Location:
    Sydney
    When I originally setup a L2ARC years ago, it was mainly useless and burn't a lot of the SSD's write cycle/endurance with almost zero gain (less than 1% hit).

    I explored tuning the L2ARC again, once it became persistent across reboots.

    • With NVMe SSDs, it took such a long time to fill the cache with useful content and warm up the L2ARC to become effective - therefore resulting in a lot of cache 'misses'
    Code:
    vfs.zfs.l2arc_write_max: 8388608
    vfs.zfs.l2arc_write_boost: 8388608
    
    By default, the L2ARC write speed is limited to 8 MB/s. I have bumped mine to 40MB/s.

    The current warranty of the Samsung 970 EVO 1TB NVMe is 600 TB written over 5 years.
    Even at 40MB/s I've written 3.65TB over 45 days, so I'll keep an eye on this and perform further tuning if required.

    • Persistent L2ARC will also help reduce the write cycles and 'warm-up' time
    Code:
    vfs.zfs.l2arc.rebuild_enabled: 1
    
    • L2ARC does not cache sequential data by default, and the logic at the time is that L2ARC (early single device SSDs 300-500MBs transfer) would not compete with sequential read of 5-10 spindles (150 MB/s per spindle).
    With a single current generation NVMe pushing 3000-5000 MB/s transfer rate, I enabled pre-fetch.

    Code:
    vfs.zfs.l2arc_noprefetch: 1
    

    My L2ARC stats from this morning, average of 30% effective.
    The ARC (memory) is where you should invest primary resources, 99% effective.


    full size link here [​IMG]
     
    Last edited: Aug 16, 2021
    xeghia and daehenoc like this.
  16. jtir

    jtir Member

    Joined:
    Oct 1, 2003
    Messages:
    1,180
    Location:
    Sydney
    The other way you can expand the storage zpool is

    • Purchase another 3x 12TB (can be any size really) and add them to the existing zpool to grow by another usable 24TB (assuming RAID-Z1 layout).


    So a total of 48 TB useable storage.
     
    daehenoc likes this.
  17. xeghia

    xeghia Member

    Joined:
    Nov 7, 2003
    Messages:
    27
    Location:
    Gold Coast
    Not entirely accurate, recordsize=1M means records up to 1M however if you modify a 1M record it does mean rewriting that whole 1M even if you only changed one byte so it really does depend on the use case and desired outcome. I guess in short small files will only use what they need but still have some overhead like checksum, padding, raidz width will also impact that as well.

    Honestly there is no real downside to recordsize=1M as you will generally get better compression and faster reads, but on the flip side you might be rewriting a lot of data which I feel is perfectly fine for hard disks, not so great for SSD obviously.

    Basically it will be heavily data dependent and use case specific, you can always create multiple filesystems for multiple use cases though, and testing is the only way to know for sure what works best.
     
    daehenoc likes this.
  18. HobartTas

    HobartTas Member

    Joined:
    Jun 22, 2006
    Messages:
    1,096
    Correct as there's no point mixing SSD's and HDD's together in a Raid-Z and if the SSD is your boot OS drive and you can boot ZFS as well then I'd consider making it ZFS instead of Ext4 or whatever. As previously mentioned a smaller recordsize might be the way to go and the default is 128 KB but as Gea said "For VM I would go down to 32-64k (never go below as ZFS is very inefficient then)."

    You set it for the pool at creation but that just becomes the default that's subsequently inherited for ZFS Filesystems if its different from the standard 128KB and when you create them then you can specify each filesystem as you like so yes for bulk storage you would probably use 1MB and for a work area storage then 32/64/128 KB as required.
     
    daehenoc likes this.
  19. OP
    OP
    daehenoc

    daehenoc Member

    Joined:
    Nov 4, 2005
    Messages:
    2,879
    Location:
    Mt Gravatt E, BNE, QLD
    Cool, thanks for the clarifications!
     
  20. OP
    OP
    daehenoc

    daehenoc Member

    Joined:
    Nov 4, 2005
    Messages:
    2,879
    Location:
    Mt Gravatt E, BNE, QLD
    Well I just need the funds to buy the drives, and bump from 16Gb to 32Gb of RAM :lol:

    Can RAID-Z be 'built' from a single drive, similarly to how a mdraid can be built? e.g.:
    1. Create a pool with one disk, drop some data on it,
    2. Add another disk to the pool, and ... stripe it, mirror the first one?
    3. Add a third disk to the pool and make a RAID-Z?
    I know how to do this with mdraid and LVM, is it possible with ZFS?

    Oo, this is a good read, despite being from 2010: https://constantin.glez.de/2010/06/04/a-closer-look-zfs-vdevs-and-performance/
     
    Last edited: Aug 18, 2021

Share This Page

Advertisement: