Overclockers Australia Forums

OCAU News - Wiki - QuickLinks - Pix - Sponsors  

Go Back   Overclockers Australia Forums > General Topics > Troubleshooting Help

Notices

Reply
 
Thread Tools
Old 18th August 2017, 3:30 AM   #1
dogthinker Thread Starter
Member
 
Join Date: May 2006
Location: Sydney (Glebe)
Posts: 1,993
Default random hard reboots

Spec:
i7-4790k, GB Z97X-SOC, 32Gb DDR3, GTX 1070 3xHDD, 2xSDD, sound card (creative recon 3d), wifi card, usb kb/mouse, noctua d14 cooler, a couple of case fans, antec EDGE 750w PSU.

Backhistory:
This system has been somewhat unreliable since I built it ~2 years ago. The 4790k was overheating in a fraction of a second, until I disabled turbo boost, dialed down the processor current limit from 1000A to 105A, and set core voltage to 1.105v (following suggestions in a Intel forum thread). Since doing that it's been 'effectively' stable, I've done plenty of machine learning, and plenty of gaming, even triple screens @ 1440p, although it still wouldn't pass a prime95 stress test.

Fast forward to today...

The system randomly resets. No error. There's nothing interesting in the system logs (just the 'oh hey, your system didn't shut down properly last time' logs). It seems pretty obvious that it's a hardware issue of some form, rather than a software crash. It almost (not quite) exclusively happens when the system is under heavy load

Examples:
- DX:HR kills it in single digit minutes. A few months back I could play it indefinately without issues.
- Aven Colony dies in minutes on max settings, but can last over an hour on minimal settings. Just 2 days ago it was fine to play indefinately on max. No software changes between then and the problem starting.
- Low use of the computer (browsing, spreadsheets, etc.) works 100% fine.

Gaming at low settings instead of ultra seems to reduce, but not eliminate, the problem.

Halving the CPU multiplier from 40x to 20x didn't help, so despite the history, I'm leaning towards thinking it's probably not the CPU.

A cat bumped heavily into all the cabling at the back very recently, but I think that's a red herring. I pulled the 1070, took a good look, and reseated it just in case. No effect. Visually checked the power connections to the mobo, couldn't see anything that looked loose.

I've been watching the temps. CPU and GPU look quite reasonable right up to the moment of reset (i.e. logging at 200ms intervals, no nasty spike logged before crash. Neither GPU nor CPU going much over ~60^C, which is quite tolerable.)

It seems quite hard not to blame the PSU, but without a suitable spare, I'm not quite sure how to go about verifying this. The PSU *ought* be to fine - it's a premium-ish model, 2 years into a 5 year warranty. Power calculators suggest 500W is enough to drive this system, and it's rated at 750W, which is a pretty hefty margin of error. 6 months ago I had a much more power hungry 980Ti in this system without issues.

It's been a long time since I troubleshooted hardware... Any advice, before I have to descend into the pit of semi-randomly buying replacement parts until the problem vanishes...?
dogthinker is offline   Reply With Quote
Old 18th August 2017, 7:23 AM   #2
enzo_450
Member
 
enzo_450's Avatar
 
Join Date: Jun 2007
Location: Brisbane Southside
Posts: 2,557
Default

I've had two issues with 'random' restarts.

Suspected BIOS.

Dodgy power switch

Good luck.
__________________
||Z77 Extreme 4|i5 3570k|NH-D14|STRIX GTX970|16GB Ripjaws|120GB Intel 520, 2x2TB Seagates|Seasonic 860w Platinum|47" LG|P182||
enzo_450 is offline   Reply With Quote
Old 18th August 2017, 12:56 PM   #3
UnCoNoob
Member
 
Join Date: Jun 2016
Posts: 105
Default

I would start by updating BIOS if any new versions are out. Maybe run a memtest also to test memory. Check hard drive smart data if availible of not do a hard drive surface test.

If all that passes and issue is still there maybe try find some one with a PSU that they can lend you for testing and go from there.

Kinda does sound like a power issue but it could be anything
__________________
Gaming PC-Intel I7 6700k @ 4.6ghz - Asus Z170-AR - 16GB G.SKILL Ripjaws V 3200mhz C14 - Gigabyte R9 Fury- Custom Watercooled loop cooling CPU and GPU
Media PC- Intel I5 6500k - Asus Z170 PRO Gaming - 16GB Kingston HyperX Fury 2400mhz - Custom Watercooled
UnCoNoob is offline   Reply With Quote
Old 18th August 2017, 1:47 PM   #4
dogthinker Thread Starter
Member
 
Join Date: May 2006
Location: Sydney (Glebe)
Posts: 1,993
Default

Quote:
Originally Posted by enzo_450 View Post
I've had two issues with 'random' restarts.

Suspected BIOS.
On latest BIOS, so nothing I can try there. I'd say it couldn't possibly be that, because it's been fine for well over a year on this BIOS already, but that doesn't explain your experience.

Interesting! I do have my switch set to 'instant off', I'll change it to '5 seconds'. This is probably not it though, since it (almost) never resets except under load. I leave the system running 24/7. Resets overnight are rare enough (multiple months between incidents) that I'm inclined to put them down to mains power brownouts.)

Quote:
Originally Posted by UnCoNoob View Post
I would start by updating BIOS if any new versions are out. Maybe run a memtest also to test memory. Check hard drive smart data if availible of not do a hard drive surface test.

If all that passes and issue is still there maybe try find some one with a PSU that they can lend you for testing and go from there.

Kinda does sound like a power issue but it could be anything
Running latest BIOS already (version F7, from way back in 2015.)

To be honest I wouldn't be surprised if this system failed a memtest, given the history I had with the CPU. However, I would expect a memory fault (or HDD bad sectors) to result in a software crash (i.e. bluescreen, memory dump, stuff in the logs, etc.) at least some of the time, rather than nothing but hard resets. Do you have experience to the contrary?

Last edited by dogthinker; 18th August 2017 at 1:56 PM.
dogthinker is offline   Reply With Quote
Old 18th August 2017, 3:59 PM   #5
jpw007
Member
 
jpw007's Avatar
 
Join Date: Nov 2008
Location: Melbourne
Posts: 2,607
Default

My rig kept restarting under the recent Nvidia drivers. I ended up full reinstall etc only to get same issue again. Then i remembered i recently updated drivers
so installed the ones i'd used previously and stable as a rock again.

I wouldn't rule that out if you've changed them recently

I nearly went and bought a larger PSU too as a result (since this one wass 6months old so i thought load exceeded PSU max wattage)
__________________
| Asrock z170 Extreme7+ | i7 6700k | 32GB Corsair Vengeance LPX | 2x Strix 1080 Ti OC | 1x 500GB 850 Evo M.2 + 2x 1TB MX300 M.2 | HX850i | EKWB Custom Loop | X34 |Caselabs S8
Custom White w/ Red Internals

Last edited by jpw007; 18th August 2017 at 4:09 PM.
jpw007 is offline   Reply With Quote
Old 18th August 2017, 4:19 PM   #6
dogthinker Thread Starter
Member
 
Join Date: May 2006
Location: Sydney (Glebe)
Posts: 1,993
Default

Quote:
Originally Posted by jpw007 View Post
My rig kept restarting under the recent Nvidia drivers. I ended up full reinstall etc only to get same issue again. Then i remembered i recently updated drivers
so installed the ones i'd used previously and stable as a rock again.

I wouldn't rule that out if you've changed them recently

I nearly went and bought a larger PSU too as a result (since this one wass 6months old so i thought load exceeded PSU max wattage)
My problem started well after that my last driver update (May), but trying a different driver is nice and easy, so I'll try updating to the current one.
dogthinker is offline   Reply With Quote
Old 19th August 2017, 11:18 AM   #7
Bold Eagle
Member
 
Bold Eagle's Avatar
 
Join Date: Jun 2008
Location: Brisbane
Posts: 6,229
Default

In the BIOS look at Hardware Monitor (or Health) what is the CPU Temp at?

What is your specific OS version - open the run box and enter Winver.

Consider running memtest86+ via a CD/DVD pre-OS - you can DL it as part of UBCD and then boot the disc:
http://www.ultimatebootcd.com/

When in the OS have a look at Reliability Monitor flags:
https://www.howtogeek.com/166911/rel...u-arent-using/

When you note "system logs" is this via Event Viewer? The Flag that states "your system didn't shut down properly last time'' normally the issue has occurred just before that and they are the flags of interest for example look at the picture and comments I noted:
http://forums.overclockers.com.au/sh....php?t=1210045

Also turn off Fast Startup as it is a significant cause of stability issues for many users:
https://www.tenforums.com/tutorials/...dows-10-a.html

I would also be running DDU in Safe Mode and then reinstalling the latest nVidia Drivers "385.28-desktop-win10-64bit-international-whql".
__________________
PC3: Cardboard Box, peanut dispenser, highly conc caffine intravenous drip, little monkey w "electro El Shocko rectal probe", 3DMarkVantage=276818768

Last edited by Bold Eagle; 19th August 2017 at 11:22 AM.
Bold Eagle is offline   Reply With Quote
Old 21st August 2017, 5:05 PM   #8
dogthinker Thread Starter
Member
 
Join Date: May 2006
Location: Sydney (Glebe)
Posts: 1,993
Default

Quote:
Originally Posted by Bold Eagle View Post
In the BIOS look at Hardware Monitor (or Health) what is the CPU Temp at?
Temps are fine, as mentioned I've been watching/logging those and there's nothing out of the ordinary there.

Quote:
What is your specific OS version - open the run box and enter Winver
Win 10, 1703, 15063.540

Quote:
Consider running memtest86+ via a CD/DVD pre-OS - you can DL it as part of UBCD and then boot the disc:
http://www.ultimatebootcd.com/
I'll do this if the problem persists (see below), although I wouldn't expect to see only see problems when gaming if this was the case.

Quote:
When in the OS have a look at Reliability Monitor flags:
https://www.howtogeek.com/166911/rel...u-arent-using/

When you note "system logs" is this via Event Viewer? The Flag that states "your system didn't shut down properly last time'' normally the issue has occurred just before that and they are the flags of interest for example look at the picture and comments I noted:
http://forums.overclockers.com.au/sh....php?t=1210045
Yeah, I used to do this sort of thing as a profession (dialing back most of a decade...) The Reliability Monitor post-dates that though, and I do like it! It's nice to confirm I wasn't going crazy (the reliability score did indeed go from 9-10, off a cliff to basically 0-1, the day before I posted here.

Quote:
Also turn off Fast Startup as it is a significant cause of stability issues for many users:
https://www.tenforums.com/tutorials/...dows-10-a.html
Yep, long ago.

Quote:
I would also be running DDU in Safe Mode and then reinstalling the latest nVidia Drivers "385.28-desktop-win10-64bit-international-whql".
As jpw007 also pointed out, it could be the nividia driver. Seems to be it too, things have been all good since I updated it.

That makes me feel pretty stupid

It used to be, driver problems = nice crash dump to analyse, thanks to the HAL, so I didn't even consider looking at the driver in this case.

I'm a bit bemused why I was able to game (e.g. Aven Colony) for hours the day before without issues, then suddenly only for minutes at a time, without doing anything at all to the system in between, but everything does seem fine since updating the display driver. I guess I'll just have to wait and see.

Thanks for your help everyone.
dogthinker is offline   Reply With Quote
Old 22nd September 2017, 10:26 AM   #9
dogthinker Thread Starter
Member
 
Join Date: May 2006
Location: Sydney (Glebe)
Posts: 1,993
Default

Well, it seems it probably wasn't driver problems... It's just really intermittent. My perception is that everything is fine for days/weeks, but then it'll start happening frequently (minutes!) until I walk away and leave it for a day. Then everything'll be fine again for a while.

e.g. play D:OS 2 all night no problems... Then suddenly can't play it for more than a few minutes at a time. Wait a day, fine again, for a while...

I tried stress testing the GPU with Furmark, immediately after two resets in 10 minutes, and it seems stable, happily puttering along at 100% ulitization & power consumption. That's not the same as playing a game, of course, since there's basically no data flowing over the bus in that benchmark.

I'll test the memory next, but I'll have to dig around my desk for a working flash drive first

EDIT: aaaaand spoke too soon. hard reset while running Furmark, about 20 minutes in (GPU temps stable at ~70^C). Not sure what conclusion I can draw from that... PSU/GPU/mobo/CPU/memory/drivers all still feel plausible causes to me.

Last edited by dogthinker; 22nd September 2017 at 10:35 AM.
dogthinker is offline   Reply With Quote
Old 22nd September 2017, 1:20 PM   #10
Bold Eagle
Member
 
Bold Eagle's Avatar
 
Join Date: Jun 2008
Location: Brisbane
Posts: 6,229
Default

Careful with Furmark - I killed a card using that.

What are the specifics of your GTX1070 (make/model)?

Remember those units had some stability issues at release and many had to flash the BIOS...............

Use GPU-z to attain your BIOS revision and to do a Lookup of your card and then assess (you have to be very specific with your Device ID, etc) whether there are any later BIOS's.
__________________
PC3: Cardboard Box, peanut dispenser, highly conc caffine intravenous drip, little monkey w "electro El Shocko rectal probe", 3DMarkVantage=276818768

Last edited by Bold Eagle; 22nd September 2017 at 1:30 PM.
Bold Eagle is offline   Reply With Quote
Old 22nd September 2017, 4:05 PM   #11
dogthinker Thread Starter
Member
 
Join Date: May 2006
Location: Sydney (Glebe)
Posts: 1,993
Default

Quote:
Originally Posted by Bold Eagle View Post
Careful with Furmark - I killed a card using that.

What are the specifics of your GTX1070 (make/model)?

Remember those units had some stability issues at release and many had to flash the BIOS...............

Use GPU-z to attain your BIOS revision and to do a Lookup of your card and then assess (you have to be very specific with your Device ID, etc) whether there are any later BIOS's.
EVGA GeForce GTX 1070 SC GAMING ACX 3.0 Black Edition, 08G-P4-5173-KR, 8GB GDDR5

BIOS version is 86.04.50.00.72, which is already current, as far as I can tell.

I might try disabling the factory overclock for a while, it's not like I need it.
dogthinker is offline   Reply With Quote
Old 22nd September 2017, 4:46 PM   #12
Bold Eagle
Member
 
Bold Eagle's Avatar
 
Join Date: Jun 2008
Location: Brisbane
Posts: 6,229
Default

Pop over to this dedicated thread if you get bored:
http://www.overclock.net/t/1601546/o...70-owners-club

There maybe something meaningful there?

EVGA GeForce GTX 1070 BIOS Update - 86.04.50.00.70/86.04.50.01.70 (10/17/2016)
https://forums.evga.com/EVGA-GeForce...-m2565056.aspx

Here are all the BIOS revisions:
https://www.techpowerup.com/vgabios/...ze=8192&since=

Not sure which of those may be the Black Edition though.

Also be mindful the GPU BIOS issue is only an educated guess but there was quite a few people having stability issues with the GTX1070 (mine was fine though, a GAINWARD GTX1070).
__________________
PC3: Cardboard Box, peanut dispenser, highly conc caffine intravenous drip, little monkey w "electro El Shocko rectal probe", 3DMarkVantage=276818768

Last edited by Bold Eagle; 22nd September 2017 at 4:51 PM.
Bold Eagle is offline   Reply With Quote
Old 22nd September 2017, 4:54 PM   #13
dogthinker Thread Starter
Member
 
Join Date: May 2006
Location: Sydney (Glebe)
Posts: 1,993
Default

Quote:
Originally Posted by Bold Eagle View Post
Pop over to this dedicated thread if you get bored:
http://www.overclock.net/t/1601546/o...70-owners-club

There maybe something meaningful there?
I'll have a look at that later.

System is currently rebooting consistently the moment I close furmark (I'm not inclined to blame furmark for that, this problem started way before it even crossed my mind to bench or stress test it.) This is even with both the GPU clock and its memory underclocked by 25%... Smells like an RMA to me. I'll see if I can persuade a friend to let me test it in their system first though.

EDIT: Thanks for the links. I think BIOS already has a higher version number than the ones listed there. I don't think my problem is the same problem (temps appear to be fine, at least at the sensor...)

EDIT2: OK. I love it when I can finally make a problem reproducable... Use afterburner to set a power limit of 80% on the GPU, then I can start and stop furmark just fine. Set it to 90%+, it consistently reboots the system when I halt it... It feels quite hard to blame anything other than the GPU or PSU for that...

Last edited by dogthinker; 22nd September 2017 at 5:12 PM.
dogthinker is offline   Reply With Quote
Old 22nd September 2017, 5:31 PM   #14
Bold Eagle
Member
 
Bold Eagle's Avatar
 
Join Date: Jun 2008
Location: Brisbane
Posts: 6,229
Default

OCCT will give the PSU a stress test and you can look at the voltage plots even if the system crashes I believe.

Late edit: if you can get that card into someone else system that will isolate the GPU.
__________________
PC3: Cardboard Box, peanut dispenser, highly conc caffine intravenous drip, little monkey w "electro El Shocko rectal probe", 3DMarkVantage=276818768

Last edited by Bold Eagle; 23rd September 2017 at 1:56 AM.
Bold Eagle is offline   Reply With Quote
Reply

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +10. The time now is 9:33 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
OCAU is not responsible for the content of individual messages posted by others.
Other content copyright Overclockers Australia.
OCAU is hosted by Micron21!