Help! FreeNAS 8 ZFS status DEGRADED!

Discussion in 'Storage & Backup' started by W0MB13, Jun 25, 2012.

  1. W0MB13

    W0MB13 Member

    Joined:
    Apr 2, 2006
    Messages:
    1,207
    Location:
    Sydney
    My original build thread here
    http://forums.overclockers.com.au/showthread.php?t=957083&

    6x2TB Hitachi drives in RAID-Z2


    I fired up my ZFS box to backup some files the other night, and I could not get it to go :o

    I troubleshooted for a bit, and it seems to be a possibly dodgy video card causing the issue. For some reason with this particular motherboard, if there is no video card present, the PC will not actually boot properly in the background past a certain point. I took the card out, put it back in. Voila! Working again. I was about 8 BIOS versions behind, so I decided I'd do an upgrade. This seemed to fix problems I had with the USB auto-booting, so all good as far as I'm concerned.

    Now...

    I'm not sure if this is just PURE COINCIDENCE... but once I booted into FreeNAS (embedded), I was greeted with a nasty ZFS pool = DEGRADED..

    It looks like one of the drives is "unavailable". I had a look through the serial numbers that FreeNAS could see, and cancelled them all out until I was left with one missing drive/serial number. I've physically marked the drive with a paint marker so I know which one it is.

    It seems a bit odd to me that this has just suddenly happened. I thought for a moment it might be a problem with one of the onboard SATA controllers, however there is another drive on the controller that is working fine. (6 ports in total, 4 on one controller, 2 on another). I tried bringing the unavail drive online using "zpool online mypoolsname weirdlongnumbershowingup", doesn't seem to work.

    Anyone have any ideas? further steps/checks I can do? Does it sound like the drive is simply dead and needs to be swapped out? or perhaps something is bugging out?

    Your opinions and help greatly appreciated :lol:


    Click to view full size!


    Click to view full size!


    Click to view full size!
     
    Last edited: Jun 25, 2012
  2. HobartTas

    HobartTas Member

    Joined:
    Jun 22, 2006
    Messages:
    603
    Greetings

    I doubt it happened at the same time, I'm guessing it happened a while ago and you only just got around to noticing, I had a similar problem and Solaris 11 didn't notify me either that it was degraded and I just noticed it when I went to do some maintenance (scrubs/snapshops etc).

    Anyway, I had a similar problem and concluded I either had a dodgy cable or it wasn't seated properly and after a sufficient number of I/O errors ZFS offlined the drive.

    There are just 2 possibilities

    (1) The hard drive is actually dead or useless,

    does it no longer appear in the BIOS?

    what happens when you swap sata ports?

    what happens when you swap sata cables?

    what happens if you attach the drive to a different PC?

    what happens if you put it into an external USB case, does the PC recognise its there, obviously it won't recognise the partition but disk management in admin tools should still pick up that the drive exists.

    can you download some bootable diagnostic diskette/cdrom from the manufacturers website or something similar to do a NON-DESTRUCTIVE test of the hard drive e.g. I use ESTOOL for my Samsung drives, does it pass or fail?

    (2) If hardware-wise its physically OK then it's probably the scenario I outlined above

    In my case I would zpool clear the error, does your version allow you to do the same?

    http://docs.oracle.com/cd/E19253-01/819-5461/gazge/index.html

    After I did this then with some more usage the drive would have errors and would get offlined again, I can't remember exactly whether it was "offline" or "unavailable", all I know is that I would clear the error and it would re-occur within a short period of time again.

    In your case if you get it back online then put it through its paces and do something like a scrub, fortunately ZFS time/date stamps all reads/writes so re-silvering is very quick as all it needs to do is write the changes since it went offline which I presume was not all that long ago.

    If it does go offline again then

    (a) try a different cable and make sure you've inserted it properly, if this doesn't work then

    (b) export the pool and change the ports around and then re-import the pool, it will re-detect the drives in the new positions.

    (c) does the same drive go offline again with a different port and different cable? if it does then its more likely that its physically faulty. If its now a different drive then its most likely the port (or another dodgy cable).

    In my case the problem did not re-occur so I concluded I didn't put the cable on the connector properly. I also had an additional problem at the time in that for what ever reason ZFS also took a second drive offline at the time (maybe it could have been dropping bits on the transfer and was spoofing the second drive! who knows for sure what the hell was going on?)

    It's not a fun feeling to initially think you have two dead hard drives on your Raid-Z2 array and that your only one more drive away from complete disaster.

    Anyway removing the first problem drive from the system meant that the second one would resume working normally again so after I fixed the problem with the first drive as I described above then everything was working again.

    Since now that the quality of the drives is a bit more suspect given they can have warranties as low as a year and also there's not a lot to pick from I'm thinking the next array I'm going to do will most likely be a Raid-Z3 with cheaper (green/5400 RPM) drives rather than a Raid-Z2 with the more expensive enterprise drives. If I was going to get a 24 bay case and completely fill it with drives I'd rather do one big Raid-Z4 or Raid-Z5 array if such a thing existed.

    Cheers
     
  3. OP
    OP
    W0MB13

    W0MB13 Member

    Joined:
    Apr 2, 2006
    Messages:
    1,207
    Location:
    Sydney
    None of the SATA cabling has been touched since initial install, so seems a bit odd if it is a dud cable, I've never really experienced such a thing before though.

    I'll try a different cable and also the ZFS error clearing tonight, and see how I go. Seems unlikely to me that the drive is actually dead, as the pool has been totally fine until last night with this one drive being unavailable. Note that I don't run this box often, goes online for maybe an hour once every two months. Could just be wishful thinking though.
     
  4. davros123

    davros123 Member

    Joined:
    Jun 18, 2008
    Messages:
    2,837
    not sure why it's happened...

    however, you might also try a zpool export and azpool import to see if that will bring it back online....perhaps a reboot in between export and import.

    I get some weird thing on my esxi server and my LSI card...must look into it some time...but I assume it's a dud port! cable/drive...
     
  5. BBITS

    BBITS Member

    Joined:
    Oct 18, 2007
    Messages:
    914
    Location:
    Brisbane Southside
    You say you changed a video card, perhaps you nudged a cable?
     
  6. OP
    OP
    W0MB13

    W0MB13 Member

    Joined:
    Apr 2, 2006
    Messages:
    1,207
    Location:
    Sydney
    Thanks for all the suggestions guys.

    Oddly enough (to me at least) the drive is totally dead. Isn't seen in the bios by the server. I put it into my main PC and the onboard SATA controller sits there for a very brief moment, and does not detect the drive.

    RMA time I guess :Paranoid:

    Having only dealt with Samsung HDD RMA in the past (who were EXCELLENT/SO FAST), this could be interesting.

    My question now is... I've taken this drive out. When I put a new drive in the server in it's place, will Freenas/ZFS by design start rebuilding automatically? or does the replace command somehow still apply? I assumed it only would in hot-swap situations.
     
  7. sreg0r

    sreg0r Member

    Joined:
    Jul 9, 2001
    Messages:
    1,146
    Location:
    Melbourne

Share This Page