random hard reboots

Discussion in 'Troubleshooting Help' started by dogthinker, Aug 18, 2017.

  1. dogthinker

    dogthinker Member

    Joined:
    May 5, 2006
    Messages:
    1,995
    Location:
    Sydney (Glebe)
    Spec:
    i7-4790k, GB Z97X-SOC, 32Gb DDR3, GTX 1070 3xHDD, 2xSDD, sound card (creative recon 3d), wifi card, usb kb/mouse, noctua d14 cooler, a couple of case fans, antec EDGE 750w PSU.

    Backhistory:
    This system has been somewhat unreliable since I built it ~2 years ago. The 4790k was overheating in a fraction of a second, until I disabled turbo boost, dialed down the processor current limit from 1000A to 105A, and set core voltage to 1.105v (following suggestions in a Intel forum thread). Since doing that it's been 'effectively' stable, I've done plenty of machine learning, and plenty of gaming, even triple screens @ 1440p, although it still wouldn't pass a prime95 stress test.

    Fast forward to today...

    The system randomly resets. No error. There's nothing interesting in the system logs (just the 'oh hey, your system didn't shut down properly last time' logs). It seems pretty obvious that it's a hardware issue of some form, rather than a software crash. It almost (not quite) exclusively happens when the system is under heavy load

    Examples:
    - DX:HR kills it in single digit minutes. A few months back I could play it indefinately without issues.
    - Aven Colony dies in minutes on max settings, but can last over an hour on minimal settings. Just 2 days ago it was fine to play indefinately on max. No software changes between then and the problem starting.
    - Low use of the computer (browsing, spreadsheets, etc.) works 100% fine.

    Gaming at low settings instead of ultra seems to reduce, but not eliminate, the problem.

    Halving the CPU multiplier from 40x to 20x didn't help, so despite the history, I'm leaning towards thinking it's probably not the CPU.

    A cat bumped heavily into all the cabling at the back very recently, but I think that's a red herring. I pulled the 1070, took a good look, and reseated it just in case. No effect. Visually checked the power connections to the mobo, couldn't see anything that looked loose.

    I've been watching the temps. CPU and GPU look quite reasonable right up to the moment of reset (i.e. logging at 200ms intervals, no nasty spike logged before crash. Neither GPU nor CPU going much over ~60^C, which is quite tolerable.)

    It seems quite hard not to blame the PSU, but without a suitable spare, I'm not quite sure how to go about verifying this. The PSU *ought* be to fine - it's a premium-ish model, 2 years into a 5 year warranty. Power calculators suggest 500W is enough to drive this system, and it's rated at 750W, which is a pretty hefty margin of error. 6 months ago I had a much more power hungry 980Ti in this system without issues.

    It's been a long time since I troubleshooted hardware... Any advice, before I have to descend into the pit of semi-randomly buying replacement parts until the problem vanishes...?
     
  2. enzo_450

    enzo_450 Member

    Joined:
    Jun 13, 2007
    Messages:
    2,560
    Location:
    Brisbane Southside
  3. UnCoNoob

    UnCoNoob Member

    Joined:
    Jun 10, 2016
    Messages:
    107
    I would start by updating BIOS if any new versions are out. Maybe run a memtest also to test memory. Check hard drive smart data if availible of not do a hard drive surface test.

    If all that passes and issue is still there maybe try find some one with a PSU that they can lend you for testing and go from there.

    Kinda does sound like a power issue but it could be anything
     
  4. OP
    OP
    dogthinker

    dogthinker Member

    Joined:
    May 5, 2006
    Messages:
    1,995
    Location:
    Sydney (Glebe)
    On latest BIOS, so nothing I can try there. I'd say it couldn't possibly be that, because it's been fine for well over a year on this BIOS already, but that doesn't explain your experience. :wired:

    Interesting! I do have my switch set to 'instant off', I'll change it to '5 seconds'. This is probably not it though, since it (almost) never resets except under load. I leave the system running 24/7. Resets overnight are rare enough (multiple months between incidents) that I'm inclined to put them down to mains power brownouts.)

    Running latest BIOS already (version F7, from way back in 2015.)

    To be honest I wouldn't be surprised if this system failed a memtest, given the history I had with the CPU. However, I would expect a memory fault (or HDD bad sectors) to result in a software crash (i.e. bluescreen, memory dump, stuff in the logs, etc.) at least some of the time, rather than nothing but hard resets. Do you have experience to the contrary?
     
    Last edited: Aug 18, 2017
  5. jpw007

    jpw007 Member

    Joined:
    Nov 23, 2008
    Messages:
    2,858
    Location:
    Melbourne
    My rig kept restarting under the recent Nvidia drivers. I ended up full reinstall etc only to get same issue again. Then i remembered i recently updated drivers
    so installed the ones i'd used previously and stable as a rock again.

    I wouldn't rule that out if you've changed them recently :thumbdn:

    I nearly went and bought a larger PSU too as a result (since this one wass 6months old so i thought load exceeded PSU max wattage)
     
    Last edited: Aug 18, 2017
  6. OP
    OP
    dogthinker

    dogthinker Member

    Joined:
    May 5, 2006
    Messages:
    1,995
    Location:
    Sydney (Glebe)
    My problem started well after that my last driver update (May), but trying a different driver is nice and easy, so I'll try updating to the current one.
     
  7. Bold Eagle

    Bold Eagle Member

    Joined:
    Jun 28, 2008
    Messages:
    6,733
    Location:
    Brisbane
    In the BIOS look at Hardware Monitor (or Health) what is the CPU Temp at?

    What is your specific OS version - open the run box and enter Winver.

    Consider running memtest86+ via a CD/DVD pre-OS - you can DL it as part of UBCD and then boot the disc:
    http://www.ultimatebootcd.com/

    When in the OS have a look at Reliability Monitor flags:
    https://www.howtogeek.com/166911/re...windows-troubleshooting-tool-you-arent-using/

    When you note "system logs" is this via Event Viewer? The Flag that states "your system didn't shut down properly last time'' normally the issue has occurred just before that and they are the flags of interest for example look at the picture and comments I noted:
    http://forums.overclockers.com.au/showthread.php?t=1210045

    Also turn off Fast Startup as it is a significant cause of stability issues for many users:
    https://www.tenforums.com/tutorials/4189-turn-off-fast-startup-windows-10-a.html

    I would also be running DDU in Safe Mode and then reinstalling the latest nVidia Drivers "385.28-desktop-win10-64bit-international-whql".
     
    Last edited: Aug 19, 2017
  8. OP
    OP
    dogthinker

    dogthinker Member

    Joined:
    May 5, 2006
    Messages:
    1,995
    Location:
    Sydney (Glebe)
    Temps are fine, as mentioned I've been watching/logging those and there's nothing out of the ordinary there.

    Win 10, 1703, 15063.540

    I'll do this if the problem persists (see below), although I wouldn't expect to see only see problems when gaming if this was the case.

    Yeah, I used to do this sort of thing as a profession (dialing back most of a decade...) The Reliability Monitor post-dates that though, and I do like it! It's nice to confirm I wasn't going crazy (the reliability score did indeed go from 9-10, off a cliff to basically 0-1, the day before I posted here.

    Yep, long ago.

    As jpw007 also pointed out, it could be the nividia driver. Seems to be it too, things have been all good since I updated it.

    That makes me feel pretty stupid :lol:

    It used to be, driver problems = nice crash dump to analyse, thanks to the HAL, so I didn't even consider looking at the driver in this case.

    I'm a bit bemused why I was able to game (e.g. Aven Colony) for hours the day before without issues, then suddenly only for minutes at a time, without doing anything at all to the system in between, but everything does seem fine since updating the display driver. I guess I'll just have to wait and see.

    Thanks for your help everyone. :thumbup:
     
  9. OP
    OP
    dogthinker

    dogthinker Member

    Joined:
    May 5, 2006
    Messages:
    1,995
    Location:
    Sydney (Glebe)
    Well, it seems it probably wasn't driver problems... It's just really intermittent. My perception is that everything is fine for days/weeks, but then it'll start happening frequently (minutes!) until I walk away and leave it for a day. Then everything'll be fine again for a while.

    e.g. play D:OS 2 all night no problems... Then suddenly can't play it for more than a few minutes at a time. Wait a day, fine again, for a while...

    I tried stress testing the GPU with Furmark, immediately after two resets in 10 minutes, and it seems stable, happily puttering along at 100% ulitization & power consumption. That's not the same as playing a game, of course, since there's basically no data flowing over the bus in that benchmark.

    I'll test the memory next, but I'll have to dig around my desk for a working flash drive first :)

    EDIT: aaaaand spoke too soon. hard reset while running Furmark, about 20 minutes in (GPU temps stable at ~70^C). Not sure what conclusion I can draw from that... PSU/GPU/mobo/CPU/memory/drivers all still feel plausible causes to me.
     
    Last edited: Sep 22, 2017
  10. Bold Eagle

    Bold Eagle Member

    Joined:
    Jun 28, 2008
    Messages:
    6,733
    Location:
    Brisbane
    Careful with Furmark - I killed a card using that.

    What are the specifics of your GTX1070 (make/model)?

    Remember those units had some stability issues at release and many had to flash the BIOS...............

    Use GPU-z to attain your BIOS revision and to do a Lookup of your card and then assess (you have to be very specific with your Device ID, etc) whether there are any later BIOS's.
     
    Last edited: Sep 22, 2017
  11. OP
    OP
    dogthinker

    dogthinker Member

    Joined:
    May 5, 2006
    Messages:
    1,995
    Location:
    Sydney (Glebe)
    EVGA GeForce GTX 1070 SC GAMING ACX 3.0 Black Edition, 08G-P4-5173-KR, 8GB GDDR5

    BIOS version is 86.04.50.00.72, which is already current, as far as I can tell.

    I might try disabling the factory overclock for a while, it's not like I need it.
     
  12. Bold Eagle

    Bold Eagle Member

    Joined:
    Jun 28, 2008
    Messages:
    6,733
    Location:
    Brisbane
    Pop over to this dedicated thread if you get bored:
    http://www.overclock.net/t/1601546/official-nvidia-gtx-1070-owners-club

    There maybe something meaningful there?

    EVGA GeForce GTX 1070 BIOS Update - 86.04.50.00.70/86.04.50.01.70 (10/17/2016)
    https://forums.evga.com/EVGA-GeForce-GTX-1070-BIOS-Update-v8604500070-m2565056.aspx

    Here are all the BIOS revisions:
    https://www.techpowerup.com/vgabios...rface=PCI-E&memType=GDDR5&memSize=8192&since=

    Not sure which of those may be the Black Edition though.

    Also be mindful the GPU BIOS issue is only an educated guess but there was quite a few people having stability issues with the GTX1070 (mine was fine though, a GAINWARD GTX1070).
     
    Last edited: Sep 22, 2017
  13. OP
    OP
    dogthinker

    dogthinker Member

    Joined:
    May 5, 2006
    Messages:
    1,995
    Location:
    Sydney (Glebe)
    I'll have a look at that later.

    System is currently rebooting consistently the moment I close furmark (I'm not inclined to blame furmark for that, this problem started way before it even crossed my mind to bench or stress test it.) This is even with both the GPU clock and its memory underclocked by 25%... Smells like an RMA to me. I'll see if I can persuade a friend to let me test it in their system first though.

    EDIT: Thanks for the links. I think BIOS already has a higher version number than the ones listed there. I don't think my problem is the same problem (temps appear to be fine, at least at the sensor...)

    EDIT2: OK. I love it when I can finally make a problem reproducable... Use afterburner to set a power limit of 80% on the GPU, then I can start and stop furmark just fine. Set it to 90%+, it consistently reboots the system when I halt it... It feels quite hard to blame anything other than the GPU or PSU for that...
     
    Last edited: Sep 22, 2017
  14. Bold Eagle

    Bold Eagle Member

    Joined:
    Jun 28, 2008
    Messages:
    6,733
    Location:
    Brisbane
    OCCT will give the PSU a stress test and you can look at the voltage plots even if the system crashes I believe.

    Late edit: if you can get that card into someone else system that will isolate the GPU.
     
    Last edited: Sep 23, 2017
  15. Bold Eagle

    Bold Eagle Member

    Joined:
    Jun 28, 2008
    Messages:
    6,733
    Location:
    Brisbane
    any updates on this?

    did you test the card in another PC?
     
  16. Benergy

    Benergy New Member

    Joined:
    Sep 25, 2017
    Messages:
    18
    Location:
    Queensland
    Good Afternoon, dogthinker.

    If your computer took a hefty knock of the feline variety, there's a chance one of the spring/tension screws which hold the GPU HSF tightly in position has given way. I'd check they're all still tight and evenly tensioned. Although your temps look good, that's only at the sensor - as you've correctly noted. Some of these spring/tension screws getting around on graphics cards are pretty poor quality (They're not feline-moving-at-full-noise-collision-spec).

    Ben
     
  17. OP
    OP
    dogthinker

    dogthinker Member

    Joined:
    May 5, 2006
    Messages:
    1,995
    Location:
    Sydney (Glebe)
    I went the other way and grabbed a replacement PSU to try out (I figured if that wasn't it, it's still a useful spare part to have around). I haven't had a chance to swap it into the system yet though - I've been away for a few days, and have some critical work to get out of the way first. But hopefully I'll get to try that today.

    Good call, I'll check that out while I have it opened up.
     

Share This Page