Solaris, ZFS and 4K Drives - A Success Story (LONG)

Discussion in 'Storage & Backup' started by TrenShadow, Sep 14, 2011.

  1. TrenShadow

    TrenShadow Member

    Joined:
    Sep 29, 2003
    Messages:
    78
    With all the negativity out there about using 4K drives with ZFS, I thought I'd share my experience. For those of you who have a zfs setup with 4K drives and not getting what you expect/want from it, hopefully this can give you some confidence about fixing it up without giving up and falling back to 512B drives. As we move into >2TB drive territory, 4K sectors become mandatory so setting zfs up now to deal with it means you won't have to re-visit later on.

    A few years ago I built a Windows Home Server in a CM Stacker, 12 drives of varying sizes in 3 CM 4x3 drive bays. WHS drive extender was great for keeping my data secure, as 2 copies of every file were stored on different drives, and I could add any quantity and size of drives to increase the storage pool. However, as read and write operations are only done at the speed of a single drive I decided I wanted something different.

    Enter the Norco 4220 group buy last year, and the growing popularity of (Open)Solaris. Even though my data size is currently fairly small (~2TB) I wanted to fill the Norco and change to ZFS on Solaris to realise speed gains, and set up for future expansion.

    I'd read a few posts on issues with 4K drives, WD Green in particular with the idle/tler issues as well, but with my limited budget and (in hindsight) inadequate understanding of the issues, went ahead and set up my Solaris Express 11 Norco box with a bunch of existing drives, augmented with some new WD Green drives. Initial configuration was 4xraidz1 vdevs of 5 drives each - 5x15EARS, 5x808AAKS, 5x320gb (mix of brands) and 5x200gb (mixed). 2x250gb in a mirror provided the system rpool leaving 2 slots in the Norco for SSD ZiL/L2ARC/hotspare/backups.

    Migrating from WHS was a fairly painful (slow) process as I was reusing the drives from WHS in the new build, so I had to detach each drive from WHS, let it re-balance, create an initial 5 disk raidz1 vdev, migrate data, then add new vdevs to the new pool as disks became available. Transfer speeds were slow, which I now put down to a combination of 4K issues and trying to do everything on a single server which entailed first moving WHS to a virtualised VBox machine running Solaris.

    Finally all the data was migrated and I had a running zfs pool with sharesmb properties on the zfs filesystems so my Windows desktop and laptops could access. My wifi setup doesn't seem to be working all that well so transfer speeds over wifi are horrible (next on my list to fix), but of greater concern was that transfers with my desktop (1Gb ethernet) and even internal speeds on the server were well below what I was expecting. Typical maximum write speeds were ~100MB/s with ~2MB/s being quite common. Read wasn't as bad, but still below what I'd expect - a full scrub 2.71Tb would take around 4 hours (~195MB/s average, sometimes running twice that, sometimes at ~30MB/s).

    I put up with this for about a year (work got in the way), but a couple of weeks ago I'd finally had enough. Hunting on Google turned up a lot of good (but some conflicting) information on what exactly the problems are with 4K drives. It became apparent that a lot of people were speculating and just saying "stay away from 4K drives" and only a few were actually testing and investigating. At this point I'd like to throw a huge shout out to sub.mesa for all his work testing the problems and working out the solutions/workarounds; as you'll see below my experience has shown him to be spot on.

    The first issue with the WD Green drives is whether TLER should be on or off, and whether the drives even have the ability to do so. I read many people saying it's gotta be on, and may saying it's gotta be off, and a few saying it doesn't really matter for a home environment. I decided I didn't care either way, which turned out to be a helpful decision as the 10EARS and 15EARS drives that I now have are unable to change the setting (I don't even know what the drives are set to).

    Next up is the idle timeout/head parking every 8 seconds. I don't think will affect read/write speeds (except maybe requiring spin-up time at the start if the drive hasn't been accessed recently), but the high number of load cycles appears to be a concern for the longevity of the drives. I can't easily check the current stats, but when I was rebuilding my pool (below) using BSD I ran SMART checks on all drives and found that the load cycle count was roughly an order of magnitude higher than the power on hours which is definitely of concern. This problem turned out to be easy to fix for the drives that I have. Found an ISO of FreeDOS with wdidle3, booted to DOS with bios set to IDE mode, then plugged each drive in and ran "C:\>wdidle3 /s0". The wdidle3 usage suggests that "C:\>wdidle3 /d" should disable the timer, but I found it set the timer to 3700 seconds for the 15EARS drives and to 11 seconds for the 10EARS drives, whereas "/s0" returned the message "timer disabled" for both models.

    Now to the actual issue with 4K drives (WD or Samsung) and ZFS on Solaris - sector aligned I/O. Solaris would have no problem at all with 4K drives if the drives reported correctly to the OS about their physical arrangement. "$zpool create" will correctly align things to 4K boundaries if the drive reports as such. Unfortunately these drives report having 512B sectors so Solaris will align things on 512B boundaries which may or may not correlate to a 4K boundary. Without compiling my own version on zpool there is no way to override this behaviour so all the vdevs in my original pool had "ashift == 9", ie aligned to 2^9 (512B) boundaries.

    sub.mesa has also determined that the number of data (as opposed to parity) drives in the vdev will also have an effect. Basically the 128K zfs recordsize needs to be divided evenly between all data drives in the pool, and if each drive ends up with a multiple of 4K everything is peachy. Otherwise writes will end up being slowed due to having to write at least one partial sector to each drive per 128K of data being written - this means a read/modify/write on the next pass of the drive head and a minimum delay of ~7ms for every 128K being written (depending on your rotation speed).

    So for a raidz1 of 5 drives, there are 4 data drives thus every 128K write operation has 32K written to each drive; all is aligned and well with the world. Same thing for 6 drives in raidz2. But for 6 drives in raidz1 or 7 in raidz2, you have 5 data drives, so the 128K write gets divided into 25.6K writes for each drive, so you have 6x4K writes = 24K at full speed, but the last 1.6K write takes minimum ~7ms due to the read/modify/write cycle.

    So there are two things I need to do - get Solaris to somehow use 4K alignment (ashift == 12 - 2^12=4096), and use the correct number of data drives to evenly divide the 128K zfs recordsize.

    I had already determined that I wanted 4x5 disk vdevs due to the layout of the Norco, allowing two bays for system rpool and 2 spare. This pretty much locks me into raidz1 (4 data = 32K per drive) as raidz2 would be 3 data drives = 42.667K per drive. Only consideration here is am I happy with the level of redundancy - in my case I will be setting "copies=2" on important zfs filesystems and using offsite backups to mitigate any risk in this regard.

    So, how to get ashift=12 on my pool for 4K alignment? Aside from a self-compiled zpool executable there is no way to do this on Solaris 11. FreeBSD can achieve this by way of the gnop command, but I wanted to stick with Solaris for my OS. In theory zfs pools created on one system should be portable to another system even with a different OS so this formed the basis for my plan.

    As per my previous migration from WHS, all current drives were in use so I needed temporary storage to fit all my data on. 5 Samsung HD204UI were acquired to serve as the migration store, and to be integrated into the new pool once migration was complete.

    My first idea was to use sub.mesa's zfsguru live CD to create a 4x5 zpool on 20 64MB files using gnop to set the ashift, then spread those 20 files over 3 of the new HD204 drives, mount those drives containing the files on Solaris, zpool import, set autoexpand=on, zpool replace each 64MB file for a 200GB file (to spread my data evenly), zfs send | zfs receive onto the files pool, then zpool replace each file onto a physical drive.

    Note: I used zfsguru, but due to some limitations ended up performing most of the commands either at the console shell, or via the command line interface on his web form. Any recent FreeBSD LiveCD should be able to achieve the same thing.

    This had to aborted due to the fact that doing "$zpool replace tank fileX diskX" worked fine up until the last disk in each vdev when autoexpand should kick in. At this point Solaris would hang indefinitely with no disk activity lights requiring a hard power off (and the subsequent power on would only boot into single user mode). As a test, I created a 3 disk raidz1 zpool on 64MB files and could succesfully grow those files up to 1.5TB (albeit getting slower as it grew even with no data) but anything above 1.5TB would hang the system whether I was replacing file with file or disk.

    Then I tried creating a 3xHD204UI raidz1 pool with zfsguru and zpool import on Solaris:
    Code:
    zfsguru#gnop create -S 4096 da0
    da0.nop created
    zfsguru#zpool create migrate da0.nop da1 da2"
    zfsguru#zdb migrate | grep ashift
        ashift    12
        ashift    12
    zfsguru#zpool export migrate
    
    solaris#zpool import migrate
    solaris#zpool status migrate
    pool:  migrate
        migrate
            raidz1-0
                c7t1d0
                c7t2d0p0    //note the p0 - I'll come back to this later
                c7t3d0
    solaris#zdb -L | grep ashift
        ashift    9         //rpool
        ashift    9         //rpool
        ashift    12        //migrate
        ashift    12        //migrate
    solaris#zfs snapshot -r tank@current
    solaris#zfs send -R tank@current | zfs recv -F migrate
    
    This all worked fine and now I had all my data on the migrate pool. Transfer took around 4 hours at ~150MB/s (~75MB/s per drive) - looking much better already. Now to create my 4x5 disk zpool and migrate the data back onto that.

    In order to spread the data evenly between my vdevs, I elected to create the initial pool with one 320GB drive per vdev, and I only had 2 of the new HD204UI left at this point. So I wanted 2x808+2xHD204+1x320 for vdev1, 4x15EARS+1x320 vdev2, 4x10EARS+1x320 vdev3 and I also got some 5 Hitachi 1TB drives for vdev4 (4 plus a 320 initially). Once all data was back on the big pool I would zpool replace all the 320GB drives with the appropriate larger model, and the same for the 808GB.

    Booting zfsguru with 20 drives I couldn't easily work out which was which so back to Solaris to create the initial pool (-f required due different disk sizes) then back to zfsguru to create the same structure that was observed:
    Code:
    solaris#zpool create tank -f raidz1 c7t{1..5}d0 raidz1 c8t{1..5}d0 raidz1 c9t{1..5}d0 c10t{1..5}d0
    solaris#zpool status tank
    pool:    tank
        tank
            raidz1-0
                5xdisks
            raidz1-1
                5xdisks
            raidz1-2
                5xdisks
            raidz1-3
                5xdisks
    solaris#zdb -L | grep ashift    #zdb -L is much faster than zdb tank
        ashift    9         //repeated multiple times
    solaris#zpool export tank
    
    zfsguru#zpool import            #Import hidden pools using the web interface
    pool:    tank
    version: 31
        tank
            raidz1-0
                5xdisks
            raidz1-1
                5xdisks
            raidz1-2
                5xdisks
            raidz1-3
                5xdisks
    zpool unable to import tank, pool version 31, v28 maximum supported
    zfsguru#           #Not worried about the version as I will recreate anyway
                       #Now I know which disk is which
                       #Create gnop for one disk per vdev
    zfsguru#gnop create -S 4096 da0
    da0.nop created
    zfsguru#gnop create -S 4096 da5
    da5.nop created
    zfsguru#gnop create -S 4096 da10
    da10.nop created
    zfsguru#gnop create -S 4096 da15
    da15.nop created
    
    zfsguru#zpool create -f tank raidz1 da0.nop da{1..4} raidz1 da5.nop da{6..9} raidz1 da10.nop da1{1..4} raidz1 da15.nop da1{6..9}
    zfsguru#zpool status
    pool:    tank
    version: 28
        tank
            raidz1-0
                da0.nop + 4xdisks
            raidz1-1
                da5.nop + 4xdisks
            raidz1-2
                da10.nop + 4xdisks
            raidz1-3
                da15.nop + 4xdisks
    zfsguru#zdb -L | grep ashift
        ashift 12        //repeated multiple times, looking good
    zfsguru#zpool export tank
    
    solaris#zpool import tank
    zpool: warning zpool version 28 not current
    solaris#zpool upgrade tank
    zpool: tank upgraded from v28 to v31
    solaris#zdb -L | grep ashift
        ashift    9         //rpool
        ashift    12        //multiple repeats - tank is looking good
    solaris#zpool status tank
    pool:    tank
        tank
            raidz1-0
                5xdisks, some with d0 some with d0p0
            raidz1-1
                5xdisks, some with d0 some with d0p0
            raidz1-2
                5xdisks, some with d0 some with d0p0
            raidz1-3
                5xdisks, some with d0 some with d0p0
    
            #Hmm, looks ok, but what's with the d0 and d0p0...
    solaris#zpool offline tank c7t1d0
    zpool: device not in pool
    solaris#zpool offline tank c7t2d0p0
    zpool: device not in pool
            #WTF!!!
    
    As you can see, test operations on the drives in the pool don't work, how am I going to expand the pool by replacing drives if the current drives aren't in the pool?!

    Google suggested referring to the drives by their assigned guid:
    Code:
    solaris#zdb -L | less           #get all the guids
    solaris#zpool offline tank 43651b0987ca872    #c7t1d0
    solaris#zpool status tank
        DEGRADED, one disk offline
    solaris#zpool online tank 43651b0987ca872
    solaris#zpool status tank
        HEALTHY
    solaris#zpool replace tank 43651b0987ca872 c7t1d0
    solaris#zpool offline tank c7t1d0
    solaris#zpool status tank
        DEGRADED, one disk offline
    solaris#zpool online tank c7t1d0
    solaris#zpool status tank
        HEALTHY
    
    That seems to have fixed it for one drive, rinse repeat for the other 19. Now zpool status lists each drive as cXtXd0 with no p0 listings, and each drive can be offlined/onlined/replaced with no problem.
    Code:
    solaris#zdb -L | grep ashift
        ashift    9         //rpool
        ashift    12        //multiple repeats for tank, all good
    solaris#zfs snapshot -r migrate@newcurrent
    solaris#zfs send -R migrate@newcurrent | zfs recv -F tank
    
    A few hours later all my data is sitting on the new zpool, correctly 4K aligned and with the right number of data drives to evenly split the 128K records in 4K chunks. Each pool still has a 320GB drive to spread the data, but before I kill the migrate pool and zpool replace all the bigger drives in, do a scrub and run some tests:
    Code:
    solaris#zpool scrub tank
    solaris#zpool iostat -v 5
        #Read operations appear to be running ~750+MB/s
        #Some time later
    solaris#zpool status tank
        scrub: 2.76TB completed in 00h53m       #Average ~900MB/s!!!!
    solaris#date +%H:%M:%S.%N && cp /tank/some8GBfile /tank/tmpfile && date +%H:%M:%S.%N
    22:04:13.823640
    22:04:33.937472                   #~410MB/s write :)
    solaris#date +%H:%M:%S.%N && cp /tank/some12GBfile /tank/copies2FS/tmpfile && date +%H:%M:%S.%N
    22:08:21.018475
    22:09:20.198447                   #~205MB/s write with copies=2
    solaris#zpool set autoexpand=on tank
    solaris#zpool replace tank {320&808GBDrives} {LargerDrives}
    solaris#zpool list
    NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT                       
    rpool   232G  17.8G   214G     7%  1.00x  ONLINE  -                             
    tank   25.0T  2.76T  22.2T    11%  1.00x  ONLINE  -                             
    
    I am a happy camper :)

    As a partial aside, I had one of my 808GB drives fail during this whole process - it's possible this drive had been on the blink for a while and was contributing to my initial speed problems.

    Huge thanks go to sub.mesa for his work investigating and organising testing for 4K issues, and for making zfsguru - Suggestions for improvement are to allow for creation of zpools on files for testing, and allowing creation of zpool with multiple vdevs (I think this is possible by making a single vdev pool and then adding vdevs via the separate web page, but I didn't try this).

    Thanks also to davros123 for his inspirational zfs/Solaris thread that got me started, and for splitting the shipping costs for our motherboard and RAM last year.

    For those interested in my exact hardware setup:
    Norco 4220
    Corsair HX620 PSU
    Intel S3210SHLC M/B
    Q9450 CPU
    8GB (2x4GB) 800MHz ECC RAM
    3xIBM BR10i (LSI SAS3082E-R) SAS/SATA Controllers
    - Configured for staggered HDD spinup
    2xSeagate 250GB 3.5" drives mirrored rpool
    20xDrives in 4 vdevs each 5xraidz1:
    - 5xSamsung HD204UI #4K drives
    - 5xWD 15EARS #4K drives
    - 5xWD 10EARS #4K drives
    - 5xHitachi 1TB 7200rpm #512B drives
    2x Spare bays in the Norco, have played with SSD for ZiL/L2ARC but don't see much benefit for my use case, will just use them to plug in backup drives rather than running eSata from the motherboard.
     
  2. sub.mesa

    sub.mesa Member

    Joined:
    Jun 23, 2010
    Messages:
    271
    Location:
    Europe
    Not only that, any current pool with ashift=9 will NOT accept any native 4K sector disk. All current Advanced Format disks are physically 4K sectors, but still report having 512-byte sectors and emulate a lower sectorsize whenever I/O is not done in exact multiples of 4K. But in the future native 4K sector disks may come out, and you CANNOT add those to an existing pool with ashift=9 set. You can add 512-byte sector disks to an ashift=12 pool though. So you can go lower with sectorsize, but you cannot go higher with sectorsize. This is a permanent setting for your pool.

    Thanks, that's nice to hear!

    TLER should be off when using any non-Windows Software RAID, for the following reason:

    1) these systems do not require TLER to prevent them being dropped out of the RAID array
    2) it may even be harmful in a case where you lost your redundancy (i.e. degraded RAID5; degraded RAID-Z) and an existing disk member has an unreadable sector. There's no other copy of that data, so if the disk fails to retrieve the information, there is partial dataloss. In such a case, you would want to give the disk as much time as it needs to try to recover that data. Whereas with TLER, the disk is forced to abandon its recovery after typically 7 seconds, which is ample enough time for a successful recovery. Hence, enabling TLER could be dangerous and lead to unnecessary dataloss.

    Others argue that it prevents long hickups in case of a bad sector, though. But when using a system like ZFS it shouldn't cause a hickup due to ZFS writing a redundant copy to the bad/weak sector, fixing the damage instantly. This is one of the main advantages of ZFS. It is almost impervious to bit rot on your disks.

    I've seen 400.000 load cycles for a disk of just over 1 year spinning time, which is absurd and exceeding the design specification of 300.000 load cycles. I recommend not to disable headparking, however, but to increase the threshold from 7/8 seconds to a more appropriate 120 seconds.

    Headparking will protect better against shock during moments of no activity, and will also reduce the idle power consumption. Both things are nice to have in a 24/7 home NAS environment.

    I may be off here, but I remember reading that 512-byte sectors is hardcoded in Solaris/ZFS, such as that when native 4K sector disks would come out Solaris couldn't even use them. I could be wrong in this, however. I certainly do not know Solaris as well as FreeBSD.

    As I understand the low performance is a combination of using badly aligned 'dirty partitions' like Windows XP used them, which start of sector 63 (31.5KiB) offset. This is very very poor and never in modern storage should such a partition be used. Windows Vista SP1+ creates partitions the proper way, with a 1 megabyte (1MiB) offset that works great with any sectorsize.

    The second problem is that Solaris platforms have no equivalent of a GEOM framework like on FreeBSD, to 'upgrade' the sectorsize to 4K during pool creation. There are patched binaries in circulation that force ashift=12 on the pool, but will also use that on ashift=9 pools which may corrupt and destroy existing pools. Needless to say, such patched binaries are dangerous and should not be used.

    Just one little note about using copies=2. If one disk dies and another disk in the same RAID-Z vdev has a permanently unreadable sector, copies=2 will help protect your data. But if two disks die in the same RAID-Z, copies=2 won't help you and your entire pool is lost. Theoretically, this should not be necessary as enough redundant copies exist to make available at least that data which was written with copies=2. Metadata always has ditto copies, and as such should also be available. In the future I hope using RAID0 (multiple single disk vdevs) combined with copies=2 would achieve the desired effect of filesystem redundancy coping with missing/failed disks, and simply cause the copies=1 written data to become (temporarily) unavailable.

    Ouch, that sounds complicated!

    Why not simply:

    1) boot ZFSguru LiveCD, fire up the web-interface
    2) format all disks with GEOM on the Disks page
    3) create a pool on the Pools->Create page, using the 4K sectorsize override feature
    4) export the pool
    5) connect the disks to solaris platform and import the pool

    That should work, but of course requires the disks to be empty due to formatting.

    One note about this: keep your pool at version 28 if you want to be able to migrate to other platforms in the future! ZFS v31 is NOT open-source and thus a proprietary Solaris-only release which may never be compatible with any other system and relies totally on Oracle releasing the sourcecode as CDDL.

    Thus, if you want the freedom of being able to move between different ZFS platforms, it is highly recommended you create your pool at no higher than version 28, using a command like:
    zpool create tank -o version=28 -O version=5 (...)

    Appears your hard work finally paid off. :)

    And thank you for writing this guide; hope it helps other people with the same issue!

    As for creating a pool with multiple vdevs; that works as follows:
    1) create the pool with first vdev
    2) expand the pool with additional vdevs

    It makes no difference whether you create the pool in one batch with multiple vdevs or use expansion (add command); as long as you do not write any data in between, as existing data will not automatically be relocated on new vdevs.

    Cheers!
     
  3. davros123

    davros123 Member

    Joined:
    Jun 18, 2008
    Messages:
    3,048
    MATE! Long time no chat.

    Thanks infact, go to you for getting me started with that board :thumbup:

    Great to see you are up and running and wow, that IS one heck of a post!

    PS. Now come to the dark side and join me on ESXi :Pirate:

    PPS. Hats off to sub.messa - I have learned so much reading your posts both here and over at Hardfoums. Thankyou.
     
    Last edited: Sep 14, 2011
  4. OP
    OP
    TrenShadow

    TrenShadow Member

    Joined:
    Sep 29, 2003
    Messages:
    78
    Yep, I was trying to say that, but you did it better ;) If you're creating a pool now with 512B drives, take the time to create with ashift=12 so that you can just zpool replace 512B 4K later rather than having to re-create your pool.

    That was pretty much my assessment of the arguments, but the fact that I can't change my WD drives make it a moot point for me. I suspect that it's off which is good, but if it's actually on I'm not going to be overly concerned to the point of losing money on replacing drives just for this.

    I couldn't set the timer higher than 30 seconds, and using /d to disable would set to ~1 hour on some drives, and 11 seconds on others. I decided to just use /s0 to disable entirely and I'll use Solaris' power management features to spin the drives down.

    I'm pretty sure (but not 100%) the patch to support native 4K drives made it into Solaris 11, but there's no property defined to override these drives that lie about their geometry. Ie, you can't do "#zpool create -o ashift=12 tank ..."

    I considered using a patched binary (or compiling my own version of one) just for pool creation and falling back to the original binary for subsequent operations, but decided against it.

    Yep, that was a consideration. My hope is (and I have done zero research into this) that in the event of multiple drive failure in the one vdev, all copies=2 data, and some copies=1 data might be retrievable; if it's possible I'm sure some fancy command line gymnastics will be required. If it's not possible then I restore all my copies=2 data from offsite backup and re-rip all my media collection :wired:

    Yep :tongue: In some of my early prototyping, I created a zpool on disks in zfsguru and when I imported to Solaris it complained about corrupted GPT labels and I couldn't re-apply from backup labels. Dunno why. So I thought I'd leave Solaris to do all the actual disk processing, and just use BSD to create the GEOM corrected zpool structure on files. As you saw, I ended up doing it the way you suggest anyway (albeit via the CLI to make the pool in one hit and only making one .nop per vdev). I never had that GPT label problem again...

    Again, something I considered, but aside from the performance problems that I have now sorted out, I'm happy with Solaris 11. I'm hoping to basically treat it as an appliance once everything is fully configure. I'll upgrade Solaris when new versions are available if I feel like it and there are compelling reasons to, but otherwise I'm happy with what I've got :)

    I worked in IT for ~10 years, but have never had good luck with my home systems. Despite buying all the recommended components, all my efforts in overclocking have yielded nil results. It's definitely nice to finally have something working the way I've heard it should be able to!

    As you can see from my post count and join date, I tend to lurk a lot and post little; decided it was time to give back to the community I've learnt so much from :thumbup:

    Thought it would probably work like that, but only saw the expand page in zfsguru web interface after I'd already created the pool in one hit via CLI :tongue:

    Funny you say that - Before I decided on Solaris and virtualising other stuff on top via zones and VBox, I was thinking of running Solaris on top of ESXi and iSCSI-ing zfs volumes to Windows Home Server to get the advantages of things like wife acceptance factor and Windows Media Center connector. Ultimately decided I wasn't sure if it was possible at the time so put that idea back in the cupboard. Despite what I said above about running this like an appliance, the idea of ESXi still greatly appeals to my inner-geek (or not so inner :p)
     
  5. frankgobbo

    frankgobbo Member

    Joined:
    Sep 23, 2008
    Messages:
    107
    What awesome timing for you to have written this post.

    I built a ZFS array on 4 x 3TB disks on my HP microserver (with Solaris 11 Early Adopters release) and was dismayed at the IO performance. Didn't take long to figure out the problem, then note that Sol11 EA also doesn't have an ashift option in it.

    I'm downloading ZFSguru as I type - I'll report back soon :)

    Thanks for the tips!
     
  6. frankgobbo

    frankgobbo Member

    Joined:
    Sep 23, 2008
    Messages:
    107
    The operation was a complete success, and everything appears to be the way it should, but I'm still having performance issues.

    It's worth noting that it started out at 30MB/s and has dropped progressively over the past few hours to now averaging around 6-7MB/s.

    It's definitely faster (nearly a 100% improvement) than it was, but a long way short of what I'd expect 4 drives to be able to do.

    The only thing I can think of is the CPU isn't able to do the deduplication hashes fast enough, but then again it's also 60% idle -

    However, executing a sync takes ages, 2-3 minutes.. so it seems to be pure IO load, but wait time shows 0 at the moment.

    Plenty of free RAM - just don't get it.

    Oh well, I'll investigate more tomorrow. Thanks for your informative posts!
     
  7. davros123

    davros123 Member

    Joined:
    Jun 18, 2008
    Messages:
    3,048
    do not use dedup.

    you'll need 2gb of free ram per tb or the tables will be stored on disk...which equals sloooooowwwww.
     
  8. sub.mesa

    sub.mesa Member

    Joined:
    Jun 23, 2010
    Messages:
    271
    Location:
    Europe
    If you have not stored real data on your pool and can still destroy the pool, using the (destructive) benchmark feature in ZFSguru is advised.

    On the Disk page you can see the Benchmark tab. This tab allows you to benchmark your disks in various ZFS configurations, generating nice graphs with comparable performance figures. This should tell whether your hardware is working like it should, and what configurations work best on your system.

    As davros said, using dedup is not really recommended due to the RAM and L2ARC requirements and for most users the saved space does not outweigh the performance impact of deduplication.
     
  9. OP
    OP
    TrenShadow

    TrenShadow Member

    Joined:
    Sep 29, 2003
    Messages:
    78
    @frankgobbo:
    What model hard drives are you using? What is your zpool configuration (ie, post the output of "zpool status _pool_name_"?
     
  10. DavidRa

    DavidRa Member

    Joined:
    Jun 8, 2002
    Messages:
    3,090
    Location:
    NSW Central Coast
    Just a quick nitpick - ashift is set per-vdev, not per-pool. So you need to follow the same process to create a new vdev as you do for creating the pool in the first place. Check out the output of zdb, or taemun's post on this forum thread to see what I mean.
     
  11. frankgobbo

    frankgobbo Member

    Joined:
    Sep 23, 2008
    Messages:
    107
    It's 4 x Hitachi 5K3000 3TB disks. I know they're not the fastest drives in the world, but the transfer is now down to 1.8MB/s - which is slower than my internet link.

    zpool config is 4 drives in RAIDZ configuration. The outdated message is because I created it on ZFSguru and I'm running Solaris 11 Early Adopter (b173 I think it is):

    Code:
      pool: data414
     state: ONLINE
    status: The pool is formatted using an older on-disk format.  The pool can
            still be used, but some features are unavailable.
    action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
            pool will no longer be accessible on older software versions.
      scan: none requested
    config:
    
            NAME          STATE     READ WRITE CKSUM
            data414       ONLINE       0     0     0
              raidz1-0    ONLINE       0     0     0
                c7t0d0p0  ONLINE       -     -     -
                c7t1d0p0  ONLINE       -     -     -
                c7t2d0p0  ONLINE       -     -     -
                c7t3d0p0  ONLINE       -     -     -
    
    errors: No known data errors
    
    I _can_ destroy the zpool but I'm a little loathe to do so since it's taken 5 days or something to get to this point. That said, at this speed it's not plausible to use it anyway and I'd be better off with a RAID5 device instead.

    I'd like to keep dedupe on, it's one of the main reasons I'm looking at ZFS as a decent number of client machines will be dumping backups to this device. As it stands now it's approx a 18% data saving on disk (it's finished copying on maybe 20% of data). The con is the device has 8GB ram, so if it's true at 2GB per TB of storage then it'll never keep up, but right now there's 1.4TB of data in the zpool and the machine isn't heavy on its RAM usage yet

    Memory: 8063M phys mem, 7111M free mem, 2048M total swap, 2048M free swap

    Edit: Some more box stats. You can see the CPU isn't busy, load average is low, IO wait is non existant and scan rate is practically zero. NIC is set to 1000FDX, there's just no reason I can think of that it should be running _this_ slowly.

    Code:
    last pid:  3808;  load avg:  0.20,  0.19,  0.19;  up 4+16:24:48        11:25:20
    63 processes: 62 sleeping, 1 on cpu
    CPU states: 93.2% idle,  0.0% user,  6.8% kernel,  0.0% iowait,  0.0% swap
    Kernel: 1111 ctxsw, 3 trap, 1106 intr, 64 syscall, 3 flt
    Code:
    root@solaris:/home/pulse# vmstat 5
     kthr      memory            page            disk          faults      cpu
     r b w   swap  free  re  mf pi po fr de sr cd s0 s1 s2   in   sy   cs us sy id
     0 0 0 8401100 7334816 1  3  0  0  0  0  1  1 77 78 76 2222 3858 2423  0 12 87
     0 0 0 8349688 7283088 3 13  0  0  0  0  0  0 65 61 63 1079   32 1095  0  6 94
     0 0 0 8349608 7283020 0  0  0  0  0  0  0  0 66 63 69 2101 3690 2146  0 10 90
    Code:
    root@solaris:/home/pulse# iostat -Mnxz 5
                        extended device statistics
        r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
        0.2    0.9    0.0    0.0  0.0  0.0    0.4    1.2   0   0 c9d1
       36.5   40.0    0.1    1.7  0.0  0.3    0.0    4.6   0  31 c7t0d0
       39.1   39.2    0.2    1.6  0.0  0.4    0.0    4.7   0  33 c7t1d0
       36.1   40.0    0.1    1.7  0.0  0.3    0.0    4.6   0  31 c7t2d0
       39.5   39.3    0.2    1.6  0.0  0.4    0.0    4.6   0  33 c7t3d0
                        extended device statistics
        r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
       34.7   20.4    0.1    0.9  0.0  0.3    0.0    5.2   0  27 c7t0d0
       32.1   20.4    0.1    0.9  0.0  0.3    0.0    5.6   0  27 c7t1d0
       32.9   20.8    0.1    0.9  0.0  0.3    0.0    5.2   0  26 c7t2d0
       34.1   20.4    0.1    0.9  0.0  0.3    0.0    5.3   0  27 c7t3d0
    Code:
    root@solaris:/home/pulse# zpool iostat 5
                   capacity     operations    bandwidth
    pool        alloc   free   read  write   read  write
    ----------  -----  -----  -----  -----  -----  -----
    rpool       8.33G   224G      0      0  7.23K  6.82K
    data414     1.67T  9.21T    149    296   608K  4.35M
    ----------  -----  -----  -----  -----  -----  -----
    rpool       8.33G   224G      0      0      0      0
    data414     1.67T  9.21T    171     31   686K  2.93M
    ----------  -----  -----  -----  -----  -----  -----
    rpool       8.33G   224G      0      0  13.5K      0
    data414     1.67T  9.21T    143    327   595K  2.74M
    ----------  -----  -----  -----  -----  -----  -----
    Code:
    root@solaris:/home/pulse# dladm show-ether
    LINK              PTYPE    STATE    AUTO  SPEED-DUPLEX                    PAUSE
    net0              current  up       yes   1G-f                            none
    The device it's copying from is capable of R/W speed over the network in excess of 40MB/s (even just to a desktop on a 1gb connection), and the transfer is getting slower and slower.
     
    Last edited: Sep 27, 2011
  12. frankgobbo

    frankgobbo Member

    Joined:
    Sep 23, 2008
    Messages:
    107
    Ok, so interesting test. I tried playing around with the ZIL and added a ramdisk for the log device - no change. The ZIL performance was one of my theories.

    I did some other testing, and disabled dedupe while a write was occurring. The data transfer went from 1.8 to 30MB/s instantly.

    So there's a huge overhead in deduplication while writing on this system, which is strange because the load doesn't indicate the system is struggling to keep up with it.

    I've tried upgrading the zpool from version 28 to 33 to see if that helps too, but seems to be much the same. Unfortunately it just looks like dedupe is too big a performance tradeoff on this system... but really it's the entire reason why I'm using ZFS to begin with as it would be more useful if it was a Linux or FreeBSD box than Solaris.

    I'm pretty disappointed with this result. Can have performance or storage capacity, not both. The thing I don't understand is, because I'm using it in a RAIDZ pool, it _already_ has to do the checksum of the data as it writes the parity bits to the 4th disk. So the deduplication checksum should be 'free'.

    My next theory, and I don't know if this would work, is to disable deduplication, copy all data on, enable deduplication and perform a scrub. Would this work? I don't even know.. then it can sit there and recalculate/scrub when the system is idle and it's not being waited for.

    Any thoughts?
     
  13. flain

    flain Member

    Joined:
    Oct 5, 2005
    Messages:
    2,950
    Location:
    Sydney
    Id imagine checksumming is just the beginning - it then needs to search a table containing checksums for every block in your filesystem that is a part of dedupe. This is so it knows if data is dedupeable(?) or not. If you consider how many blocks there is per 1TB of data, a few TBs and you are now searching a massive table of checksums for *each* block write.
     
  14. DavidRa

    DavidRa Member

    Joined:
    Jun 8, 2002
    Messages:
    3,090
    Location:
    NSW Central Coast
    Bingo - try adding an L2ARC device (a fast SSD - or multiple fast SSDs). It's virtually a requirement for dedupe.
     
  15. frankgobbo

    frankgobbo Member

    Joined:
    Sep 23, 2008
    Messages:
    107
    Hmm. Yeah.. the downside of that being that the L2ARC device would cost more than the server itself :)

    Maybe I'm better off reusing one of the desktops.. at least they have 16GB ram.

    My issue is that it's not even using the RAM it has available to it.. ~800MB used is barely 10%, so surely while the box is running _nothing_ else it should store the dedupe database in memory?
     
  16. DavidRa

    DavidRa Member

    Joined:
    Jun 8, 2002
    Messages:
    3,090
    Location:
    NSW Central Coast
    You're still running Solaris, right? Or have you moved to FreeBSD/FreeNAS? The latter might need some tweaking. The former should just work.
     
  17. frankgobbo

    frankgobbo Member

    Joined:
    Sep 23, 2008
    Messages:
    107
  18. DavidRa

    DavidRa Member

    Joined:
    Jun 8, 2002
    Messages:
    3,090
    Location:
    NSW Central Coast
    Hmm. Is that still build 151a? I wonder if there are any (useful) differences if not ...

    Hmm. No it looks like it's newer - because I'm on zfs version 31 and you seem to have 33. So what's new in your build - and does that have an effect on performance of the pool?
     
  19. frankgobbo

    frankgobbo Member

    Joined:
    Sep 23, 2008
    Messages:
    107
    It's 173 build 2, number of issues fixed (though a lot still outstanding).

    I'd argue there's bugger all difference to be honest. There's "improvements", and who knows what they are - Oracle is an enigma. Can't find a change log, and Oracle doesn't specify what build it is (but it's a bit obvious when /etc/motd and uname say it - snv_173). There might be some info in the release notes available on that site though.

    This is the output of 'zpool upgrade -v' for the ZFS differences throughout the ages; not much difference between 31 & 33.

    Code:
    VER  DESCRIPTION
    ---  --------------------------------------------------------
     1   Initial ZFS version
     2   Ditto blocks (replicated metadata)
     3   Hot spares and double parity RAID-Z
     4   zpool history
     5   Compression using the gzip algorithm
     6   bootfs pool property
     7   Separate intent log devices
     8   Delegated administration
     9   refquota and refreservation properties
     10  Cache devices
     11  Improved scrub performance
     12  Snapshot properties
     13  snapused property
     14  passthrough-x aclinherit
     15  user/group space accounting
     16  stmf property support
     17  Triple-parity RAID-Z
     18  Snapshot user holds
     19  Log device removal
     20  Compression using zle (zero-length encoding)
     21  Deduplication
     22  Received properties
     23  Slim ZIL
     24  System attributes
     25  Improved scrub stats
     26  Improved snapshot deletion performance
     27  Improved snapshot creation performance
     28  Multiple vdev replacements
     29  RAID-Z/mirror hybrid allocator
     30  Encryption
     31  Improved 'zfs list' performance
     32  One MB blocksize
     33  Improved share support
     
    Last edited: Sep 27, 2011
  20. frankgobbo

    frankgobbo Member

    Joined:
    Sep 23, 2008
    Messages:
    107

Share This Page

Advertisement: