ZFS under FreeBSD: JBOQ (Just a Bunch Of Questions)

Discussion in 'Storage & Backup' started by rowan194, Aug 31, 2011.

  1. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    Something I just noticed...

    Code:
    ZFS v15, FreeBSD 8.2-RELEASE, no AHCI
            NAME        STATE     READ WRITE CKSUM
            zfs01       ONLINE       0     0     0
              raidz1    ONLINE       0     0     0
                ad10    ONLINE       0     0     0
                ad12    ONLINE       0     0     0
                ad14    ONLINE       0     0     0
                ad16    ONLINE       0     0     0
    
    
    ZFS v28, FreeBSD 8.2-RELEASE-p2, AHCI enabled, plus L2ARC cache configured
    (same pool, but upgraded)
    
            NAME        STATE     READ WRITE CKSUM
            zfs01       ONLINE       0     0     0
              raidz1[b]-0[/b]  ONLINE       0     0     0
                ada2    ONLINE       0     0     0
                ada3    ONLINE       0     0     0
                ada4    ONLINE       0     0     0
                ada5    ONLINE       0     0     0
            cache
              ada0p3    ONLINE       0     0     0
              ada1p3    ONLINE       0     0     0
    
    What's the significance of the "-0" after raidz1?
     
  2. sub.mesa

    sub.mesa Member

    Joined:
    Jun 23, 2010
    Messages:
    271
    Location:
    Europe
    The extra 0 is just that ZFS v28 puts a number on each vdev; ignore it.

    If I understand correctly, your source is a GELI encrypted UFS disk and your destination is a ZFS filesystem with compression enabled?

    Let's try to rule out as many variables, and stick to normal testing like:

    zfs create tank/testfs
    zfs set compression=off tank/testfs
    dd if=/dev/zero of=/tank/testfs/zerofile.000 bs=1m count=32000
    dd if=/tank/testfs/zerofile.000 bs=1m
    (replace tank with the name of your pool)

    During both write and read cycles, start up gstat -a somewhere else (or gstat -f ada if you want them to show up continuously). During the transfer look for any disk that has to work longer than the other disks, such as 2 or 4 seconds where only one disk is at work. You could be having a dud that is performing much worse than the other disks, for example.

    Did you enter the tuning variables in /boot/loader.conf by the way? But even without them your performance appears particularly poor. There must be a reason for it.

    You can also try:
    dd if=/dev/ada2 of=/dev/null bs=1m count=10000 &
    dd if=/dev/ada3 of=/dev/null bs=1m count=10000 &
    dd if=/dev/ada4 of=/dev/null bs=1m count=10000 &
    dd if=/dev/ada5 of=/dev/null bs=1m count=10000 &

    Execute all these four DD at once so you begin reading from ada2-ada5 at the same time. When finished, you should see the speeds. Are they all about the same? Are they as you would expect? Note that you can run gstat -a on another terminal to see the current throughput.
     
  3. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    Here's the individual drive speed tests...

    Code:
    10485760000 bytes transferred in 65.509083 secs (160065742 bytes/sec)
    10485760000 bytes transferred in 79.781890 secs (131430329 bytes/sec)
    10485760000 bytes transferred in 81.281972 secs (129004744 bytes/sec)
    10485760000 bytes transferred in 81.434371 secs (128763320 bytes/sec)
    
    As you can see one drive is faster, it is a slightly newer model (or at least revision). I don't know if this would have such a remarkable affect on performance.

    zero byte test:

    Code:
    nas01# zfs create zfs01/testfs
    nas01# zfs set compression=off zfs01/testfs
    nas01# dd if=/dev/zero of=/nas01/testfs/zerofile.000 bs=1m count=32000
    32000+0 records in
    32000+0 records out
    [b]33554432000 bytes transferred in 258.341519 secs (129884008 bytes/sec)[/b]
    nas01# dd if=/nas01/testfs/zerofile.000 of=/dev/null bs=1m
    32000+0 records in
    32000+0 records out
    [b]33554432000 bytes transferred in 177.750907 secs (188772212 bytes/sec)[/b]
    
    (Note the pool is zfs01, but the mount point is /nas01)

    When writing the array is thrashing, and I can see from the individual activity lights that things are not I/O bound... activity is bursty, about 75%/25% on/off over a period of about 1-2 seconds. Same pattern over all of the drives activity lights, including the slightly faster one.

    When reading the activity lights stay on almost constantly, and I can't hear any seeking at all, let alone thrashing.

    Is the thrashing on write normal for ZFS? I presume it has to do more activity than most file systems because of its block integrity tree.

    Now gstat is interesting... this is for read:

    Code:
    dT: 1.002s  w: 1.000s
     L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
        9    610    610  46761   15.9      0      0    0.0  100.6| ada2
       10    599    599  45973   16.5      0      0    0.0   99.5| ada3
        0    625    625  46633    3.1      0      0    0.0   22.3| ada4
        0    631    631  47421    7.4      0      0    0.0   50.0| ada5
    
    ada2=HDS722020ALA330 Firmware JKAOA20N
    ada3=HDS722020ALA330 Firmware JKAOA20N
    ada4=[b]HDS723020BLA642[/b] Firmware [b]MN6OA5C0[/b]
    ada5=HDS722020ALA330 Firmware [b]JKAOA3EA[/b]
    
    ada4 is the newer drive but it looks like ada5 is also different - it's showing 50% util while the other seemingly identical models are showing 100%.

    For writes the %busy is more consistent, 3 drives show about 85-90% while the faster drive shows about 5% less.

    Even with this slightly oddball mix of drives the basic sequential read and write speed of a zero byte file is SIGNIFICANTLY faster than the real world experience.

    Trying another test, copying nonzero data from SSD (ZFS mirror) to RAIDZ...

    1. dd if=/dev/random of=/var/rand.bin bs=1m count=8000 (/var is a ZFS file system on SSD)
    2. unmount UFS file system to free all stolen Inact memory -> Mem: 28M Active, 884K Inact, 3671M Wired, 827M Buf, 4213M Free
    3. dd if=/var/rand.bin of=/nas01/testfs/rand.bin bs=1m

    8388608000 bytes transferred in 53.614366 secs (156461945 bytes/sec)

    Again, transfer rates which are many times greater than the speeds I've been achieving when copying my data.

    Benching the encrypted UFS gstripe system does show up something unusual...

    Code:
    nas01# dd if=/var/rand.bin of=/mnt/rand.bin bs=1m
    8000+0 records in
    8000+0 records out
    8388608000 bytes transferred in 53.614366 secs ([b]156461945[/b] bytes/sec)
    nas01# umount /mnt  [i](to clear cache)[/i]
    nas01# mount -o rw,noatime /dev/stripe/localsync.eli /mnt
    nas01# dd if=/mnt/rand.bin of=/dev/null bs=1m
    8000+0 records in
    8000+0 records out
    8388608000 bytes transferred in 156.529683 secs ([b]53591165[/b] bytes/sec)
    
    Why is a read 3 times SLOWER than a write???


    Anyway... I think I'm just going to leave this for now. Once all my data is (slowly) copied over, I'll kill the encrypted UFS array and recreate it as ZFS. No point trying to figure out this weird issue if it disappears when the UFS array does.

    Sorry for rambling on. Hopefully this prob fixes itself. :thumbup:
     
  4. sub.mesa

    sub.mesa Member

    Joined:
    Jun 23, 2010
    Messages:
    271
    Location:
    Europe
    Reading from UFS slower than writing, set:
    sysctl vfs.read_max=64

    As for those individual dd benchmarks; did you all run them at the same time? That is important! The & (ampersand) character at the end of the dd command will let it run in background, so you can run all 4 DDs at once. Then look at the gstat -a output.

    As for the zpool gstat you posted, i can see 3.9ms for the fastest disk and 16.5ms for the slowest disk; those are wildly fluctuating latencies and might be caused by different disks. RAID-Z is very sensitive to latencies and as such having the same disks is important for maximum performance. Firmware changes can cause changes, but even with the same firmware you can have differences. Keep in mind different firmware almost always means different hardware model! Minor unadvertized hardware changes can have different performance characteristics as well.

    With the read_max you should have better read performance from UFS, however.
     
  5. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    Have tried both 64 and 128, but it doesn't seem to have made much difference. Write speed depends on what rsync is copying at that moment, right now it's doing a series of smaller files and ZFS write speed is barely hitting 10Mbytes/sec. Holy cow. :( Luckily there's not too much more to go, maybe another day at this speed.

    Yes, they were started as a block (cut & paste) and the 4 dd's ran concurrently.

    I'll worry about that once I'm rid of UFS. :)

    I did briefly stop rsync and umount the UFS volume, then copied a file via NFS from another server... 20-25Mbytes/sec write speed. Not really that great considering I was able to achieve nearly 130Mbytes/sec locally.

    Hmmm, I wonder if you could export /dev/null to measure raw NFS write speeds. :)

    ---

    edit: I have 4 x 5400RPM Hitachis which were bought at the same time, which I could progressively move into the array to replace the 7200RPM drives. I wonder if the performance benefit of using truly identical models would be enough to overcome the speed and latency drop due to the lower RPM. Probably not. But it's still interesting to consider. :)
     
    Last edited: Sep 17, 2011
  6. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    It's been more than 3 days now, just to copy 2.5TB worth of data...

    I left "zpool iostat zfs01 1800" (30 min period) running overnight...

    Code:
                   capacity     operations    bandwidth
    pool        alloc   free   read  write   read  write
    ----------  -----  -----  -----  -----  -----  -----
    zfs01       2.96T  4.29T     65    559   152K  3.53M
    zfs01       2.96T  4.29T     63    587   151K  3.74M
    zfs01       2.96T  4.29T     64    599   168K  3.98M
    zfs01       2.97T  4.28T    155    408   314K  2.96M
    zfs01       2.97T  4.28T     91    465   253K  2.57M
    zfs01       2.97T  4.28T     58    622   157K  2.89M
    zfs01       2.97T  4.28T     65    541   152K  2.65M
    zfs01       2.97T  4.28T     74    458   162K  2.07M
    zfs01       2.97T  4.28T     66    474   148K  2.55M
    zfs01       2.98T  4.27T     68    451   148K  2.36M
    zfs01       2.98T  4.27T     63    500   146K  2.64M
    zfs01       2.98T  4.27T     69    460   154K  2.28M
    zfs01       2.98T  4.27T     69    462   154K  2.24M
    zfs01       2.98T  4.27T     66    460   159K  2.55M
    zfs01       2.98T  4.27T     69    464   158K  2.24M
    zfs01       2.98T  4.27T     68    473   151K  2.34M
    zfs01       2.98T  4.27T     72    490   171K  2.39M
    
    If I can't fix this excruciatingly slow write speed issue once the data is copied I'll have no choice but to go back to RAID3 with verify. That setup doesn't self heal, but at least it checks parity on read and signals an error when the data and parity don't match... which HAS happened before.

    One thought. Does ZFS control the HD's writeback cache under FreeBSD? How do I check that it hasn't disabled it? Would there be some other reason that ZFS might be forcing synchronous writes, like that the volume is exported under NFS? Thrash, thrash, thrash on write... but it's still not I/O bound, none of the HDs are at 100%...
     
  7. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    Something REALLY strange just happened.

    1. I paused (using ^Z) the rsync from the geli+UFS array to the new ZFS array
    2. I copied some new files onto the ZFS array from my Windows workstation (via NFS)
    3. I restarted the rsync

    At this point the rsync was flying, the list of files scrolling faster than I could read, even though it was copying a bunch of relatively small <5k files. I managed to get a zpool iostat started and it reported 32.3Mbytes/sec average write over a 10 second period, before dropping back down to the normal crawling (half a screen of files scrolls past, then pauses for a couple of seconds; average write speeds are <5Mbytes/sec and sometimes even measured in kilobytes.)

    So writing new data via NFS seems to have kicked it into gear, even if just momentarily? Maybe there's hope yet.

    ^&*#... just got a delayed write failed error on my winbox. What next? :(


    edit: I think the sudden increase in speed after a break in copying, and the delayed write error may be related. I notice that when I stop the copy the HDs keep thrashing for some time (more than half a minute) which I guess means that there's some serious caching going on; it seems like it may be accepting writes faster than it can commit them so things start clogging up. The delayed write error may be due to the write request effectively hanging because ZFS has no more cache available and/or is very busy trying to write the existing cache contents to disk. Question is, why is the I/O so slow and bursty when it seems to have a heap of data that needs to be committed?

    I might add the ZIL back in just to see what happens.
     
    Last edited: Sep 18, 2011
  8. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    Update. Added the ZIL and cache which locked up the machine. (sigh)

    When I rebooted things were suddenly much faster, even after the UFS file system had grabbed a few gigs worth of memory for itself. I don't know if this was because of adding the ZIL, or the reboot cured it.

    I completed the copying of data and verified that every file was accounted for, destroyed all traces of geli+UFS, created a new ZFS file system on the two drives that previously had RAID0+geli+UFS.

    Then I rebooted and... it's hung. I waited 5 mins then hit ^C so that it would cancel whatever it was stuck on and complete booting... but that wasn't to work either.

    Code:
    Trying to mount root from zfs:boot/root
    Setting hostuuid: 00000000-0000-0000-0000-00241d544643
    Setting hostid: 0xc7cf7863
    
    ^C
    
    ^C^C^Cadsfpokj9io^C^C\d\dfssdf
    
    Same thing happens every boot.

    I've had just about enough of ZFS, I am seriously considering going back to what I had: a working and stable NAS...
     
  9. snick

    snick Member

    Joined:
    Sep 8, 2001
    Messages:
    114
    as stated before, you could give zfsguru a try, also, you can try solaris express 11.

    what im trying to say is, its probably not zfs itself, that is at fault, as many people are running it with success.
     
  10. davros123

    davros123 Member

    Joined:
    Jun 18, 2008
    Messages:
    3,048
    yep, as many others have said many times in this thread, you have done this the hard way...which is fine if you like doing things rough (he'll I had plenty of long way round stuffups on my nas journey)...but don't complain when it turns into a mess.

    just install solaris 11 express and napp-it....or give zfs guru a go and stop stuffing about with your roll your own half solution...

    ps...why did you add a zil and cache? do you actually know you need it? is the zil a supercap ssd (pref. MLC) or similar battery backed up unit?...otherwise no point...might as well just disable sync for nfs if you are dead set on nfs. iir, then nfs shares on zfs are the same as cifs and for linux nfs...yes you lose the zfs bonus of guaranteed sync write integrity over nfs...but no one sweats about this for iscsi or samba...of nfs on linux so it's worth it for the risk/reward IMHO.

    However, if copies are from a win box, why not cifs/samba shares for these.

    my view is you are just adding and changing stuff to an unstable and overly complex baseline...little wonder there are issues.
     
    Last edited: Sep 19, 2011
  11. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    I don't understand what is so complex and roll-your-own about my setup? Remember I scrapped my existing install once I realised there were serious problems running UFS and ZFS together: after that I downloaded mfsbsd, ran the magic script "zfsinstall" with the appropriate parameters, rebooted and changed the config to enable NFS, started rsync. Can't get much more simple than that.

    Everything past that has been trying to figure out why write performance was abysmal. The ZIL was basically added in desperation as the effective write speed at that point was only a couple of meg per second. I can see it's being written to, but I don't know if it's that or the reboot which made everything suddenly come good.

    I've gotten everything booting again properly and I'm happy to say that I'm copying from the RAIDZ1 array back to the ZFS file system I just created and I'm achieving write speeds of about 75-80Mbytes/sec. NFS is also back to being nicely responsive and snappy rather than totally stalling during heavy writes. The speeds are probably not ideal as far as potential performance benchmarks go, but it's acceptable, and I'm not about to fiddle more! :thumbup: :D
     
  12. davros123

    davros123 Member

    Joined:
    Jun 18, 2008
    Messages:
    3,048
    great to hear it's all working...well done.
     
  13. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    I noticed when I first switched to this setup that "du /" dies with an out of memory error. In fact, it gets killed by the OS rather than gracefully erroring out. I've now just seen the same thing happen to rsync.

    Sep 30 17:21:25 nas01 kernel: pid 33875 (rsync), uid 0, was killed: out of swap space

    I'd never previously seen either of these utils run out of mem, and rsync has worked with this setup for the past couple of weeks, until now.

    The previous UFS based setup had exactly the same set of files, but less RAM (4GB versus 8GB) so I don't think it's an rsync thing. ZFS is supposed to release memory back to the system as it gets low, right?

    Any ideas? I'm watching 'top' right now as rsync runs, it's only consuming 71MB with 38MB resident...

    edit: I can see the value for 'Free' memory decreasing, and 'Inact' increasing. The only non-ZFS file system is /devfs.

    Here's some periodic lines from 'top', I can see that ZFS appears to be releasing memory (ZFS caching is under 'Wired', right?):

    Mem: 13M Active, 140M Inact, 7364M Wired, 14M Cache, 29M Buf, 381M Free
    Mem: 30M Active, 167M Inact, 6637M Wired, 14M Cache, 29M Buf, 1064M Free
    Mem: 51M Active, 194M Inact, 6712M Wired, 14M Cache, 29M Buf, 940M Free
    Mem: 71M Active, 211M Inact, 6740M Wired, 14M Cache, 29M Buf, 876M Free
    Mem: 87M Active, 231M Inact, 6914M Wired, 14M Cache, 29M Buf, 666M Free
    Mem: 93M Active, 235M Inact, 5808M Wired, 14M Cache, 29M Buf, 1762M Free

    rsync is still running. Doesn't seem to be any issues with free memory...
     
    Last edited: Sep 30, 2011
  14. sub.mesa

    sub.mesa Member

    Joined:
    Jun 23, 2010
    Messages:
    271
    Location:
    Europe
    Not sure why your Inact is that big, but do you have any SWAP volumes?

    If not, simply create it:
    zfs create -V 4g tank/SWAP
    zfs set org.freebsd:swap=on tank/SWAP
    swapon /dev/zvol/tank/SWAP

    That should activate a 4GiB swap ZVOL and automatically activate it upon reboot as well.
     
  15. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    Inact grew with rsync's memory usage. The amount it was using was closer to the value of Inact, than Active.

    No swap - thanks for the info on how to create one. So you don't even need a separate slice/partition? More pleasant surprises with ZFS. :)

    FWIW I restarted the rsync and it ran fine. Still trying to understand why it was forcibly killed, rather than terminating gracefully when a malloc failed. Perhaps there was a moment where a memory request pulled the available low enough for the kernel to say "time to die", before ZFS could release more...

    Is there a tunable to tell ZFS to take a little less memory for itself?
     
  16. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    Happened again...

    Oct 1 03:07:56 nas01 kernel: pid 35088 (find), uid 0, was killed: out of swap space
    Oct 1 23:18:53 nas01 kernel: pid 37321 (rsync), uid 0, was killed: out of swap space


    'find' would have been automated daily maint. 'rsync' was manual.

    Why are utilities that traverse the entire file system getting killed due to lack of memory??! I have never seen this before.

    edit: have set vfs.zfs.arc_max="6400M" in loader.conf which has reduced the value by about 500MB (was 7,222,816,768 without being explicitly set, is now 6,710,886,400). Hopefully the extra headroom will stop these processes being killed.

    edit2: I (deliberately) destroyed a major part of the file system in order to regenerate with compression, which would have required traversing a large number of dirs, although I didn't really think about that side effect at the time... when I got to the comp this morning the screen showed the OS had killed off relatively benign processes like getty, sshd and ntpd, and the machine seemed to be up but non responsive (possibly because it had killed off getty)

    This message http://lists.freebsd.org/pipermail/freebsd-stable/2011-April/062195.html seems to agree with what I suspect, that ZFS isn't releasing memory fast enough when the system suddenly needs it. I will add in swap and see what happens.
     
    Last edited: Oct 2, 2011
  17. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    Oh man.

    Since it crashed I had to reboot, the array has been thrashing around for about 2 hours while waiting to mount root. I assumed it was completing the interrupted destroy and making sure things were consistent before mounting the file system and handing over to the OS. No probs there, just an inconvenience having to wait. But after a couple of hours this appeared on the screen...

    pid 38 (zfs), uid 0, was killed: out of swap space

    It hasn't even fully booted yet. :upset: I'm not sure there's any point trying to add swap right now because (a) I'd have to interrupt the current operation (whatever it's doing is file system related so I'd rather leave it alone) and (b) I may not even be able to boot into single user mode without it starting the thrash again, so it's back to square one.

    It's still thrashing, even after killing zfs, so I'm going to let it keep going and see what happens. Then I'll add the swap.
     
    Last edited: Oct 2, 2011
  18. sub.mesa

    sub.mesa Member

    Joined:
    Jun 23, 2010
    Messages:
    271
    Location:
    Europe
    It should not do that unless you have very agressive tuning in your /boot/loader.conf. Can you show me your full loader.conf?

    For now, why not disable the tuning; having an unstable system is not what you want I can imagine. If you are using compression, de-duplication your memory usage can far exceed the ARC itself. This causes all memory to be depleted, and the application who wants a couple of megs of memory is denied this request after which it is killed.

    To be honest I've not seen this happening often, at least not in the severity that you mention. And in all cases adding some swap would solve the issue. Can you confirm the output of 'top' shows you the swap is active after a fresh boot?
     
  19. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    Sure.

    Code:
    zfs_load="YES"
    vfs.root.mountfrom="zfs:boot/root"
    ahci_load="YES"
    geom_concat_load="YES"
    vfs.zfs.txg.synctime=5
    vfs.zfs.txg.timeout=10
    vfs.zfs.txg.write_limit_override=67108864
    vfs.zfs.arc_max="6400M"
    
    I think the vfs.zfs.* tuneables were all recommended by you, apart from the last one.

    Swap is now added and active, including after a reboot. Hopefully that will provide enough headroom. The out of memory situation seems to happen very suddenly - I spent some time watching top and the change in memory was very gradual. I never managed to catch the moment when the process was killed, but it wasn't expected (ie the memory had been decreasing very slowly previously and would have taken hours or days to run out at that same rate...)

    The thrashing during boot went for a total of about 5 1/2 hours before it fully booted into multiuser mode. Some other process was "Killed" during that time too. Not sure exactly what it was doing all that time, and why it continued after 'zfs' and that something else was killed, but it seems to have completed the destroy successfully. Copying the data back now, which should give it a good workout (rsync traversing lots of dirs)
     
  20. sub.mesa

    sub.mesa Member

    Joined:
    Jun 23, 2010
    Messages:
    271
    Location:
    Europe
    Well that tuning looks very reasonable, except no kmem tuning or arc_min.

    Please tell me if you still have issues now that your swap is working properly. It's quite strange that you have this problem at boot time, however. For my ZFSguru project I'm using up to 75% RAM for the aggressive memory tuning profile, which did not exhibit any such behavior.

    Should you still have problems, I would recommend lowering the arc_max. Especially when using deduplication you have to lower it.
     

Share This Page

Advertisement: