ZFS under FreeBSD: JBOQ (Just a Bunch Of Questions)

Discussion in 'Storage & Backup' started by rowan194, Aug 31, 2011.

  1. davros123

    davros123 Member

    Joined:
    Jun 18, 2008
    Messages:
    3,048
    something is very wrong. you should not get those speeds.

    not sure what to suggest though.
     
  2. sub.mesa

    sub.mesa Member

    Joined:
    Jun 23, 2010
    Messages:
    271
    Location:
    Europe
    rowan194, can I see your setup in more detail? For example:

    - zpool status <poolname> output
    - fdisk <disk> output if you use partitions
    - what controller do you use?
    - have a look at gstat during copying files; are the disks highly loaded?
    - have a look at top output; do you see much memory going to InAct instead of 90%+ memory to Wired?

    Can you try some standard benchmarks:

    # write 32GB
    dd if=/dev/zero of=/poolname/zerofile.000 bs=1m count=32000

    # read 32GB
    dd if=/poolname/zerofile.000 of=/dev/null bs=1m

    (replace poolname with the name of your pool)
     
  3. flain

    flain Member

    Joined:
    Oct 5, 2005
    Messages:
    2,950
    Location:
    Sydney
    while you are at it, do a

    cat /var/adm/messages

    Anything in there? Should be empty on a healthy system. Also whats an "iostat -en" give you?
     
  4. davros123

    davros123 Member

    Joined:
    Jun 18, 2008
    Messages:
    3,048
    all good suggestions sub.Mesa.

    might slao be worth testing with a single disk in the pool and also check for errors with "iostat -exn"

    edit: flain got in first...was having dinner and didn't press post :)
     
  5. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    Using bare disks only. Controller is Intel ICH9R in AHCI mode.

    atapci2: <Intel ICH9 SATA300 controller> port 0xe600-0xe607,0xe700-0xe703,0xe800-0xe807,0xe900-0xe903,0xea00-0xea1f mem 0xec102000-0xec1027ff irq 19 at device 31.2 on pci0
    atapci2: AHCI v1.20 controller with 6 3Gbps ports, PM supported


    Never used gstat before so I'm not sure how to interpret the output, but "iostat -w 1 -x" doesn't suggest that any of the disks are highly loaded. In fact, the writing to the RAIDZ members is bursty (a big write to all members over 1 sec, then nothing for the next 4) rather than continuous. During the burst of writing the output, the source drives go to 0% util. (As well as seeing it with iostat I also have separate drive activity indicators)

    I forgot to add that I have vfs.zfs.txg.timeout="5" and vfs.zfs.txg.write_limit_override=1073741824 set, as suggested in the ZFS tuning guide.

    Here's zpool status and top:

    Code:
    nas01-b# zpool status zfs01
      pool: zfs01
     state: ONLINE
     scrub: scrub completed after 0h1m with 0 errors on Thu Sep 15 02:25:57 2011
    config:
    
            NAME        STATE     READ WRITE CKSUM
            zfs01       ONLINE       0     0     0
              raidz1    ONLINE       0     0     0
                ad10    ONLINE       0     0     0
                ad12    ONLINE       0     0     0
                ad14    ONLINE       0     0     0
                ad16    ONLINE       0     0     0
    
    errors: No known data errors
    
    -----------------
    
    Mem: 63M Active, 5441M Inact, 2193M Wired, 133M Cache, 828M Buf, 88M Free
    Swap: 1025M Total, 212K Used, 1024M Free

    Sure:

    Code:
    # write 32GB
    nas01-b# dd if=/dev/zero of=/zfs01/zerofile.000 bs=1m count=32000
    32000+0 records in
    32000+0 records out
    [b]33554432000 bytes transferred in 265.619973 secs (126324958 bytes/sec)[/b]
    
    
    
    # read 32GB
    nas01-b# dd if=/zfs01/zerofile.000 of=/dev/null bs=1m
    32000+0 records in
    32000+0 records out
    [b]33554432000 bytes transferred in 99.437083 secs (337443849 bytes/sec)[/b]
    
    I also benched the SOURCE array just in case there was an issue, but there's plenty of speed available there.

    Benchmarks look good, so why is my real world data so slow? :(
     
  6. davros123

    davros123 Member

    Joined:
    Jun 18, 2008
    Messages:
    3,048
    are you SURE compression is off? that would account for the diff...as dd will write highly compressable data (zeroes)

    also does iostat -en show and errors, hard, transport etc...
     
    Last edited: Sep 15, 2011
  7. sub.mesa

    sub.mesa Member

    Joined:
    Jun 23, 2010
    Messages:
    271
    Location:
    Europe
    Some issues I can see:

    - 1GiB transaction groups are too large; try again with:
    vfs.zfs.txg.synctime=5
    vfs.zfs.txg.timeout=10
    vfs.zfs.txg.write_limit_override=67108864

    - you do not have the ahci kernel module loaded; add to your /boot/loader.conf:
    ahci_load="YES"
    (note: this will rename disks from /dev/ad to /dev/ada; you might have problems if you use a static /etc/fstab without using labels!)

    - you use UFS + ZFS which is not recommended on FreeBSD; now only 2GiB is available to ZFS as can be seen in your top memory output: 5441M Inact, 2193M Wired. A Root-on-ZFS 100% ZFS configuration is highly recommended on FreeBSD platform. Inact should be 10MB at most, leaving the rest for ZFS. Both ZFSguru and mfsBSD allow for a 100% ZFS Root-on-ZFS installation.
     
  8. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    Changed, but then...

    That royally FUBAR'd things, it can no longer mount root on /dev/mirror/gm0s1a, and ? to list valid devices shows NO valid devices to boot from! This happens even if I escape to the loader prompt and do "unload ahci". Help...!!!

    This I didn't know... UFS restricts ZFS to only 2GB? WTF? Do you have any links which explain this situation and why it occurs?
     
  9. sub.mesa

    sub.mesa Member

    Joined:
    Jun 23, 2010
    Messages:
    271
    Location:
    Europe
    You can try booting without AHCI driver again:
    unload ahci
    disable-module ahci
    boot

    But are you sure you load the geom_mirror kernel module?
    geom_mirror_load="YES"

    Because when using the mirror device (in your /etc/fstab) it should have worked.

    When using UFS, it will use Inactive memory to cache any file you read. So with 1000 gigabyte RAM and reading a 500GB file, your Inact will be little over 500GB. When using both UFS and ZFS, UFS will steal memory away from ZFS and FreeBSD lacks a way to properly share memory between UFS and ZFS. It's a well known problem, for some background see:
    http://lists.freebsd.org/pipermail/freebsd-stable/2010-July/057688.html
     
  10. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    I ended up booting a liveCD on a memory stick and edited loader.conf that way...

    It's definitely there...

    Code:
    #geom_raid3_load="YES"
    #geom_cache_load="YES"
    geom_mirror_load="YES"
    geom_stripe_load="YES"
    #geom_concat_load="YES"
    #geom_journal_load="YES"
    #kern.geom.stripe.maxmem=33554432
    #ahci_load="YES"
    vfs.zfs.txg.synctime=5
    vfs.zfs.txg.timeout=10
    vfs.zfs.txg.write_limit_override=67108864
    
    When it fails to boot, "?" to show the list of devices is completely empty... it doesn't even offer either of the gmirror members (ada0 or ada1)

    *sigh* well so much for my test rig to play with first... reinstalling from scratch with a root ZFS only file system is not going to be a small thing... in the meantime my NAS is down, and my old RAID3 array is gone. It's going to be a very long night.
     
  11. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    $@(@$...removing one SSD I managed to pull the lid off it. No screws, it's just held in by a couple of dimples and a warranty sticker!!!

    Now the BIOS is hanging again. FFS. What next? I'm starting to wish I'd never touched my NAS.
     
  12. sub.mesa

    sub.mesa Member

    Joined:
    Jun 23, 2010
    Messages:
    271
    Location:
    Europe
    Sorry you have so much trouble with your NAS. One piece of advice: when fiddling with stupid computer problems, do not spend too much time on it. You'll only frustrate yourself. Why don't you do something else first and start again fresh when you calmed down? In most cases I find that starting fresh after a rest period helps to clear my mind and find some obvious problems (or solutions) I overlooked the first time.

    And seriously; why don't you make it yourself easy and try ZFSguru? Not wanting to advertise my creation, but it does make installing Root-on-ZFS a breeze and pretty much handles all the nasty things you don't want to do yourself. In 10 minutes you have a working solution where you can start copying data and have no problems with UFS or boot issues with geom_mirror and other crap that must annoy the hell out of you.
     
  13. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    I was looking at mfsbsd first, wrote the image to a USB stick then realised it's actually an ISO... then discovered it seems I have no blank CDs left to burn.

    So I'm left wondering how I'm supposed to boot it.

    This is the kind of shit that happens when I try to do things, little problems that brings everything to an abrupt halt. :)

    I'll look at zfsguru if I can't figure out a workaround.

    edit: seems zfsguru is a cd thing too, so I'm up the creek either way :)

    ...and my windows workstation is half dead because of the NAS... the start button freezes for a couple of minutes...

    Then... found a CD-RW from about 10 years ago still in its original packing. Gave it a try, but the write failed at the same point each time, even after a full erase. I figured it was kaput because of its age. So I went to the supermarket to get some new CD-Rs, but with one of those the write failed at exactly the same point. Turns out there's something wrong with my writer or computer. Anyway, successfully burned it on another comp and have booted mfsbsd. the fun is only just starting.
     
    Last edited: Sep 15, 2011
  14. sub.mesa

    sub.mesa Member

    Joined:
    Jun 23, 2010
    Messages:
    271
    Location:
    Europe
    Well if you do not want to burn a CD, you can try installing ZFSguru on a USB stick:

    1) download ZFSguru .iso
    2) download+install Virtualbox (latest version)
    3) create a VM for 64-bit FreeBSD (you need CPU hardware virtualization for this!)
    4) assign your USB stick to the VM (you may need the Oracle Extension Pack i don't remember)
    5) boot the VM from the ZFSguru .iso
    6) you should see a screen with IP address, fire up your browser and browse to that address (http://10.0.0.50 for example)
    7) format the USB stick on the Disks page with GPT, giving it a name like USBBOOT
    8) create a pool on the USB stick on the Pools->Create page
    9) install Root-on-ZFS on the USB stick on the System->Install page
    10) shutdown the VM gracefully and plug the USB stick in your target system and boot from it

    Note that booting from USB stick via Virtualbox never worked for me. And other systems may have problems with USB booting as well. So this isn't guaranteed to work. But if you don't want to buy blank cd's for now this is what you can try. If it works, you can use the USB stick to install to another device, or just keep using the USB stick as system/OS disk. Note that in that case you would want to perform the memory tuning again under System->Tuning! This is done automatically when you perform a Root-on-ZFS installation but is system dependent. So installing in Virtualbox will assign lower values than is suitable to your target system.

    I have to say though; you don't make it easy for yourself! Going the hard route often isn't the most joyful. :)
     
    Last edited: Sep 15, 2011
  15. TrenShadow

    TrenShadow Member

    Joined:
    Sep 29, 2003
    Messages:
    78
    This almost sounds like the sector alignment type issues that we normally associate with 4K drives, maybe the same problem also occurs with native 512B sectors (assuming that your Hitachi drives are actually native 512B drives?)

    128K recordsize * 1024 = 131072 bytes / 3 data drives = 43690.667 bytes written per drive. Obviously this is not a multiple of 512B sectors. 7200rpm = 120 revs per second = 1 rotation every 8.333 milliseconds. 128K data written in 8.3333 milliseconds = 15.4 MBps.

    I wouldn't have thought that native 512B drives should be doing a read/modify/write cycle, but you never know.

    As a couple of tests, try making your 4 drives into a raidz2 - 2 data and 2 parity drives. Try adding an extra drive for a 5 disk raidz1 (any old disk you have handy will work). Try just 3 drives in raidz1. If this theory is correct, all of the above should see vast improvement.

    Assuming the above holds true, you might find that 4 disks in a raidz1 could work if you set the zfs filesystem recordsize property to 96K:
    Code:
    $zpool create tank raidz1 disk1 disk2 disk3 disk4
    $zfs set recordsize=96K tank
    Then, every other zfs filesystem you create on the pool would also need that property set (or inherited)
     
    Last edited: Sep 15, 2011
  16. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    I've got mfsbsd running, and have created a native ZFS root file system on my 2 SSDs, mirrored over 2 GPT partitions (one on each drive), with the remainder of the space allocated to another partition which is used for the L2ARC cache.

    ZFS root file system and GPT partitions, it's all new stuff for a dinosaur like me, I'm used to the old school UFS file system and MBR partitions... and I also learned you can't kill GPT partitions by zeroing out the first few sectors of a disk, it works for MBR, but with GPT they just keep coming back! :lol: I ended up zero filling one of the disks just to get rid of the remnant partitions...

    Wish I'd done all this fiddling a few days ago before I attempted to put it into production, because apart from the basics of ZFS I've pretty much started again from scratch tonight. Not done yet, by a long shot...
     
    Last edited: Sep 16, 2011
  17. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    I've just realised (facepalm/head in hands) that since the source array that I am copying from is a UFS file system I'm going to be stuck with this low RAM/low speed situation until I can copy all the data. All this work to upgrade to a whiz-bang ZFS only root file system and the copy is still plodding along at 25Mbytes/sec.

    I still think there's something else amiss because it's writing at that low speed before UFS ends up swiping a huge chunk of the memory for its own cache. ZFS write speed is <30Mbytes/sec from the very start, even after a boot or an unmount+mount (which frees up the Inact RAM)

    Hopefully just some weird quirk copying from RAID0+geli+UFS to ZFS.

    edit: a ZFS-ZFS copy (same pool, doing it to force compression) is peaking at 60Mbytes/sec, so there's hope yet.

    I still don't understand why it's so bursty: a flurry of activity then nothing for a couple of seconds. You would think a same system copy would have the activity virtually light stuck on as the copy would be I/O bound, either reading or writing.
     
    Last edited: Sep 16, 2011
  18. flain

    flain Member

    Joined:
    Oct 5, 2005
    Messages:
    2,950
    Location:
    Sydney
    What do you get on an iostat -en?

    Something seems not right and also cat /var/adm/messages?
     
  19. OP
    OP
    rowan194

    rowan194 Member

    Joined:
    Jan 5, 2009
    Messages:
    2,046
    Remember it's FreeBSD, no -en option for iostat or /var/adm here. :)

    Average speed seems to have picked up slightly since I redid everything from 8.2-RELEASE/ZFS v15 to 8.2-RELEASE-p2/ZFS v28 with an all ZFS file system... but it's still sllllooowwwww.

    Code:
                   capacity     operations    bandwidth
    pool        alloc   free   read  write   read  write
    ----------  -----  -----  -----  -----  -----  -----
    zfs01       2.10T  5.15T      0    203      0  19.8M
    zfs01       2.10T  5.15T      0    242      0  23.9M
    zfs01       2.10T  5.15T      0    237      0  23.3M
    zfs01       2.11T  5.14T      0    229      0  23.2M
    
    Here's the output of a couple minutes worth of iostat -w 60 -x (extended stats, averaged over 60 sec window)

    Code:
                            extended device statistics
    device     r/s   w/s    kr/s    kw/s wait svc_t  %b
    ada0       0.0  96.1     3.7 11915.1    0   8.3   9
    ada1       0.0  98.8     1.4 12213.8    0   8.1   9
    ada2       0.0 114.4     0.0  [b]7832.2[/b]    0   7.1   9
    ada3       0.0 112.7     0.0  [b]7828.9[/b]    0   6.9   8
    ada4       0.0 114.6     0.0  [b]7832.3[/b]    0   6.9   9
    ada5       0.0 113.4     0.0  [b]7829.3[/b]    0   7.3   9
    ada6     237.8   0.0 11922.5     0.0    0   0.3   7
    ada7     250.4   0.0 11921.2     0.0    0   0.3   7
    
                            extended device statistics
    device     r/s   w/s    kr/s    kw/s wait svc_t  %b
    ada0       0.1  93.5     3.6 11354.8    0   7.6   8
    ada1       0.1 100.0     3.0 12030.2    0   7.6   9
    ada2       0.0 117.7     0.0  [b]8076.6[/b]   10   6.8   9
    ada3       0.0 116.2     0.0  [b]8070.8[/b]   10   6.7   9
    ada4       0.0 116.8     0.0  [b]8016.2[/b]    9   7.0   9
    ada5       0.0 116.0     0.0  [b]8061.4[/b]   10   7.2   9
    ada6     273.3   0.0 11887.5     0.0    0   0.3   7
    ada7     244.0   0.0 11886.2     0.0    0   0.3   7
    
    ada0, 1: 2 x 32GB SSD. boot, L2ARC
    ada2, 3, 4, 5: ZFS 4 x 2TB RAIDZ (destination array)
    ada6, 7: gstripe+geli+UFS 2 x 2TB encrypted RAID0 (source array)

    Drives are at <10% average util... CPU is 50%+ idle... so what on earth is causing all this slowdown?
     
  20. flain

    flain Member

    Joined:
    Oct 5, 2005
    Messages:
    2,950
    Location:
    Sydney
    Doh! silly me :p
     

Share This Page

Advertisement: