21 October 2013

523. Random Reboots -- troubleshooting. Diagnosed: incompatible motherboard.

Update 8 Jan 2014: I've been putting the FX8350 through its paces together with the other mobo and it's completely stable.  The FX8150 box is also stable. Note that I thought I had a crash a few days after making the swap -- I have not had any issues since whatsoever in spite of running very heavy jobs. Either way, it should remind me to check whether a mobo is compatible with a CPU before making my purchase in the future.

Update 18 Nov 2013: I swapped the CPUs between to boxes, so that I was now using a mobo that officially supported FX8350. Only the CPU moved, nothing else.

Update 5 Nov 2013: Note that the motherboard doesn't support the CPU and this leads to spontaneous reboots under certain conditions. Make sure to look at the list over supported CPUs for the motherboard you use (in retrospect, obvious -- but as a linux person you get used to ignoring those things since everything's for just OSX or Win).

See here for the troubleshooting thread:
 http://verahill.blogspot.com.au/2013/10/523-random-reboots-troubleshooting-in.html

Also see this thread: http://www.techpowerup.com/forums/showthread.php?t=184061
I'll need to read up on...stuff...but the bottom line seems to be that one would expect issues with this board/cpu combo:

Still only a 4+1 phase board the FX chips pull a bit more power than that can put out comfortably and stable. [..] Those would be your three best to choose from all are the better 8+2 phase designs...
and
my opinion is to stay away from the asus FX ive seen many people asking why their boards are throttling at full load, vrm protection causes voltages to drop at full load when vrms hit a certain temp.

and it seemed that low (CPU) voltages precipitated crashes.

Update 4 Nov 2013: swapped CPUs with a different box. Will test in a couple of days.
Update 4 Nov 2013: Changing the multiplier back to 20 from 17 (but keeping voltage stable) caused a crash -- this time in a record 13 minutes.
Update 4 Nov 2013: System stable with new voltage/multiplier settings.
Update 27 Oct 2013: I'm currently looking at BIOS voltage.
---

I recently built a new node (http://verahill.blogspot.com.au/2013/10/520-new-node-amd-fx-835032-gb-ram990-fx.html). While that's always exciting, it quickly left a sour taste due to random reboots when running long (days) computational jobs.

Note that the motherboard (asrock 990 fx extreme3) does not officially support FX8350, which is something that I shouldn't have ignored. I might eventually move my fx 8350 to my gigabyte 990 fxa and put my 8150 on my asrock instead.

Short description
* Both Gaussian 09 and NWChem 6.3 cause the reboots.
* I've set up a cron job that logs a lot of data every minute and there's nothing odd in there. No overheating, the wattage seems ok etc.
* Running only smaller jobs (even though they are running non-stop) which take less than a day, the node has stayed up for 11 days now.
* I have never seen it reboot, so I don't know if there's any beeping etc.
* There's nothing in the logs, and nothing in the output from tailing dmesg using a cronjob.
* The only real output is in last:

reboot   system boot  3.11.5           Fri Oct 18 14:08 - 11:57 (2+21:48)   
reboot   system boot  3.8.10           Fri Oct 18 13:23 - 14:07  (00:44)    
reboot   system boot  3.8.10           Tue Oct  8 10:46 - 13:18 (10+02:31)  
me       tty1                          Mon Oct  7 13:25 - crash  (21:21)    
me       pts/0        beryllium        Mon Oct  7 12:29 - crash  (22:17)    
reboot   system boot  3.8.10           Mon Oct  7 12:27 - 13:18 (11+00:51)  
me       pts/0        beryllium        Sat Oct  5 20:59 - crash (1+14:27)   
reboot   system boot  3.8.10           Sat Oct  5 20:58 - 13:18 (12+15:19)
reboot   system boot  3.8.10           Tue Oct  1 14:09 - 11:54 (19+20:45)  
me       pts/0        beryllium        Sun Sep 29 11:39 - crash (2+02:29)   
reboot   system boot  3.8.10           Sun Sep 29 11:39 - 11:54 (21+23:14)  
me       pts/0        beryllium        Mon Sep 23 11:09 - crash (6+00:30)   
reboot   system boot  3.8.10           Mon Sep 23 11:07 - 11:54 (27+23:46)  
me       pts/0        beryllium        Fri Sep 20 12:59 - crash (2+22:08)   
reboot   system boot  3.8.0            Fri Sep 20 12:50 - 11:54 (30+22:04)  
reboot   system boot  3.8.0            Fri Sep 20 12:49 - 12:49  (00:00)    
reboot   system boot  3.2.0-4-amd64    Fri Sep 20 11:52 - 12:48  (00:56)    
reboot   system boot  3.2.0-4-amd64    Fri Sep 20 06:29 - 08:08  (01:38)    
me       pts/0        beryllium        Wed Sep 18 14:51 - crash (1+15:38)   
reboot   system boot  3.2.0-4-amd64    Wed Sep 18 14:40 - 08:08 (1+17:27)   
me       pts/8        beryllium        Wed Sep 18 09:02 - crash  (05:38)    
reboot   system boot  3.2.0-4-amd64    Wed Sep 18 01:51 - 08:08 (2+06:17)   
me       pts/0        beryllium        Tue Sep 17 18:11 - crash  (07:40)    
reboot   system boot  3.2.0-4-amd64    Tue Sep 17 18:08 - 08:08 (2+14:00)   
reboot   system boot  3.2.0-4-amd64    Tue Sep 17 17:55 - 17:56  (00:01)    
me       pts/0        beryllium        Tue Sep 17 13:12 - crash  (04:43)    
reboot   system boot  3.2.0-4-amd64    Tue Sep 17 12:23 - 17:56  (05:33)    
reboot   system boot  3.2.0-4-amd64    Mon Sep 16 20:05 - 12:17  (16:12)    
me       pts/0        beryllium        Mon Sep 16 16:03 - crash  (04:02)    
reboot   system boot  3.2.0-4-amd64    Mon Sep 16 15:31 - 12:17  (20:46)    
reboot   system boot  3.2.0-4-amd64    Mon Sep 16 15:20 - 15:30  (00:09)

Looking at the output it does seems that the crashes are happening less frequently. Part of the reason for the is probably a change in how I use the node, but I don't think that explains everything, and I don't like the idea of a piece of electronic hardware 'fixing' itself.

Another thing that puzzles me is the repeating numbers -- e.g. 08:08, 11:54 and 13:18  -- in the ouput. There's no cronjob or anything like that running at any of those times.

Other things that have changed are the kernel versions and that I removed the UPS around the 1st of October  (the UPS died, which is a bad sign, power-wise. I should probably also look into the warranty on it).

The chief challenge here is that I can't reliable trigger the reboots, which makes it difficult to see whether I've solved the issue or not.

On an older node I could trigger errors by compiling the kernel, but not using any other technique. On that node the RAM was faulty: http://verahill.blogspot.com.au/2013/04/401-amd-fx-8150issues-building-kernel.html

==> <== indicates what I'm currently doing.


0.  RAM
The most common reason for unstable nodes if faulty RAM, so if your computer is behaving strangely and randomly crashes, always suspect the RAM first. It's a more likely culprit than software, and the most likely of the hardware components to be at fault.

I ran a full cycle of memtest86+ which took some 4-5 hours if I remember correctly. No errors shown. Note that if memtest86+ does not show any errors it is no guarantee that the RAM is fine. However, the likelihood that it is indeed corrupt goes down.

1. Overheating
The second thing to investigate when something like this happens, in particular if it's associated with prolonged and heavy use, is the possibility of overheating. You can install sensors-lm and configure it to track various temperatures. Note that these aren't always correct.

At any rate, I've logged the output from sensors every minute and there's nothing indicating that the temperature is rising prior to a crash.

--------------------------------------------------
Intermission -- trying to trigger a reboot

* It's stable while compiling a kernel (in my case 11.5). Not surprising as it is intense, but short.

* Prime95
Number of torture test threads to run (8): Choose a type of torture test to run. 1 = Small FFTs (maximum FPU stress, data fits in L2 cache, RAM not tested much). 2 = In-place large FFTs (maximum heat and power consumption, some RAM tested). 3 = Blend (tests some of everything, lots of RAM tested). 11,12,13 = Allows you to fine tune the above three selections. Blend is the default. NOTE: if you fail the blend test, but can pass the small FFT test then your problem is likely bad memory or a bad memory controller. Type of torture test to run (3): 2 Accept the answers above? (Y): y
I ran this for three days and the node was stable.
I then ran test type 3 for 30 hours and it too was stable.

I accidentally ran the tests above without mounting ~/oxygen to the head node using NFS. Shouldn't matter, but in order to troubleshoot it's better to keep everything as constant as possible.

* PES scan
I think I saw reboots triggered using all sort of jobs, but due to their long run times, I saw it more consistently with PES scans.

So I ran a long PES scan in nwchem 6.3, and lo, it crashed after just under two days running this job (and having been up for 6 day and 30 minutes). It's not quite the quick, efficient way of crashing the computer that I was looking for, but it will do.

Note that this crash didn't lead to a reboot, but simply to the computer locking up and become unresponsive. No screen, no network, no harddrive activity.

The only errors I can spot in the dmesg are two warnings about 'perf samples too long (2545>2500)' at 30 minutes and at 14 hours of uptime, i.e. well before I started the PES job.

me pts/1 beryllium Fri Oct 25 08:59 still logged in me pts/0 beryllium Fri Oct 25 08:58 still logged in me pts/0 beryllium Fri Oct 25 08:57 - 08:58 (00:01) me tty1 Fri Oct 25 08:52 still logged in reboot system boot 3.11.5 Fri Oct 25 08:52 - 08:59 (00:07) me pts/4 beryllium Sun Oct 20 17:32 - 17:32 (00:00)

--------------------------------------------------

3. BIOS version
The BIOS version (1.5) at the time of purchase of the motherboard was the same version as the BIOS available at the motherboard manufacturer's site. Since then an update has been released, as pointed out by a commentator. Nothing in the description of the update indicates that it would fix the issue I'm having, but upgrading the bios is just one of those things that should be tried.

mkdir ~/tmp/bios -p
cd ~/tmp/bios
wget 'ftp://download.asrock.com/bios/AM3+/990FX%20Extreme3(1.70)ROM.zip'
unzip 990FX\ Extreme3\(1.70\)ROM.zip 
Archive: 990FX Extreme3(1.70)ROM.zip inflating: 990EX31.70

I copied the file to a USB stick formatted with FAT32 since my guess is that the uefi might not recognise extX. Booting with the USB stick plugged in and hitting F6 ('Instant flash') lead to the UEFI finding the flash file. Now click on the file name -- don't click on the buttons (e.g. 'Configuration' and 'Refresh device'). During the bios update the usual goes: don't power off during the update, and make sure that your usb stick isn't old and damaged.

Once the update is done you get a message saying 'Programming success, press Enter to reboot system'.

I reran the PES scan, and I had a crash after less than two days (ca 40 hours). This crash caused a reboot.
me       tty1                          Sun Oct 27 12:00   still logged in   
me       pts/1        beryllium        Sun Oct 27 11:55   still logged in   
me       pts/0        beryllium        Sun Oct 27 11:54   still logged in   
reboot   system boot  3.11.5           Sun Oct 27 11:52 - 12:14  (00:22)    
me       tty1                          Fri Oct 25 09:36 - crash (2+02:16)   
me       pts/0        beryllium        Fri Oct 25 09:17 - crash (2+02:35)   
me       pts/2        beryllium        Fri Oct 25 09:17 - crash (2+02:35)   
reboot   system boot  3.11.5           Fri Oct 25 09:17 - 12:14 (2+02:57) 
.

4. BIOS settings -- voltage
The older I get, the more comfortable I become with admitting when I don't really know what I'm doing. This  -- the tweaking of voltage settings -- is one of those areas where I definitely lack expertise.

Luckily, I've got some advice from a commentator: http://verahill.blogspot.com.au/2013/09/517-very-briefly-prime95-on-linux.html?showComment=1381459311645#c1080803985593788821

Anyway, I've always treated electricity a bit like magic (I always had issues with electrochemistry as a youngster, which is why I'm forcing myself to teach it these days), and the older I get the more I wish I had done chemical engineering rather than chemistry. Perhaps we want to benefit society more with age, rather than just benefit from it?

Anyway, here are some literal screenshots -- taken with my trusty old phone:

Main
 I don't see anything odd here.

First half of OC Tweaker
The alternative to 'Manual' in the OC Mode is 'CPU OC Mode', which sounds like something I want to avoid. Anyway, what bothers me is that there's no 'OC OFF' button. I don't know if the BIOS is doing something odd.

More OC Tweaker

HW Monitor. The Vcore and +12v lines fluctuate by about 5-10 mv.
Changes: 
* turn off Cool'n'Quiet.
* Change Multiplier/Voltage Change from Automatic to Manual
** Set CPU Freq multiplier to 17.0x (3400 MHz) instead of 20x (4.0 GHZ) under OC Tweaker
** Set CPU voltage to 1.35 instead of stock 1.3750 V, (OC Tweaker/CPU Voltage)

I've managed to run a full PES scan -- which hasn't worked before -- and re-ran it without issue. Looks like the issue was solved.

I then set the multiplier back to 20 (4000 MHz)  while keeping the CPU voltage at 1.35 V, and relaunched the PES scan. Almost immediate crash:.
me       pts/0        beryllium        Mon Nov  4 13:48   still logged in   
reboot   system boot  3.11.5           Mon Nov  4 13:48 - 14:30  (00:42)    
me       pts/0        beryllium        Mon Nov  4 13:35 - crash  (00:13)    
me       tty1                          Mon Nov  4 13:34 - 13:34  (00:00)    
reboot   system boot  3.11.5           Mon Nov  4 13:34 - 14:30  (00:56)

Not sure what to do now. Did it crash faster because the CPU voltage is low while the multiplier is high? Can it be solved by increasing -- rather than decreasing -- the voltage?

Re-running the PES job again gave a more interesting result -- the job crashed, but not the node (note that the job is exactly the same each time, so it's not a matter of the input):

 Grid integrated density:     191.999968853836
 Requested integration accuracy:   0.10E-06
 d= 0,ls=0.0,diis    13  -1092.4755719318 -1.63D-05  8.49D-06  2.73D-05   719.5
 Grid integrated density:     191.999968846937
 Requested integration accuracy:   0.10E-06
  Singularity in Pulay matrix. Error and Fock matrices removed. 
 PeIGS error from dstebz 4 ...trying dsterf 
 error from dsterf 516  
 error from dsterf 516  
 Error in pstein5. me = 0 argument 10 has an illegal value. 
 Error in pstein5. me = 1 argument 10 has an illegal value. 
 Error in pstein5. me = 2 argument 10 has an illegal value. 
  ME =                     2  Exiting via  
 Error in pstein5. me = 4 argument 10 has an illegal value. 
  ME =                     4  Exiting via  
 Error in pstein5. me = 5 argument 10 has an illegal value. 
  ME =                     5  Exiting via  
5:5: peigs error: mxpend:: -1
 Error in pstein5. me = 6 argument 10 has an illegal value. 
  ME =                     6  Exiting via  
 Error in pstein5. me = 7 argument 10 has an illegal value. 
  ME =                     7  Exiting via  
7:7: peigs error: mxpend:: -1
(rank:7 hostname:oxygen pid:3741):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/armci.c:ARMCI_Error():208 cond:0
  ME =                     0

And, while the frequency affects the thermal output, a thermal issue shouldn't lead to garbled stuff. This is looking more and more like what the poster wazoo42 recounted: http://verahill.blogspot.com.au/2013/09/517-very-briefly-prime95-on-linux.html?showComment=1381459311645#c1080803985593788821

5. Swapping the CPU to an approved MOBO.
This is a bit of a cop-out. Most people don't have multiple computers and which happen to have compatible hardware. On the other hand, I have a job to do.

To be able to continue to follow this post you'll need to know this:
There are two nodes (cpu, mobo, psu):
* oxygen: FX 8350, asrock 990fx extreme3, corsair GS700.
* neon: FX 8150, gigabyte GA-FX990D3, corsair GS800.
Both nodes have otherwise similar hardware: 32 gb ram, GT210 nvidia, one PCI network card. Both motherboards support 8150, but only GA-FX990D3 supports 8350.

So we'll move FX8350 to neon, and FX 8150 to oxygen.

I first set the Multiplier/Voltage Change back to Automatic on oxygen in preparation for the FX 8150.
I then shut down the two nodes and unplugged them. And here's where it's not funny anymore: I tried to gently remove the heatsink but the heatsink together with the CPU popped off in spite of the lever being locked. On both nodes.

The CPU was solidly glued to the heatsink in both cases. I managed to get the FX 8350 off its heatsink by gently scraping off the excess thermal paste (dry and solid), but the FX 8150 was a real struggle. In the end I used the back of a knife as a lever (gently). Not ideal.

Anyway, cleaned the fx 8350 heatsink and cpu, applied new thermal paste and installed on the gigabyte fxa990-d3 motherboard. Turned on -- no lights on the mobo. Fans etc all working. Dammit. Googled and saw that bios only supports 8350 from version 9 (incorrect -- looked at wrong mobo, but didn't discover that until later)

So now I have a CPU that'd work, but which I can't install since it's stuck to the heatsink, and one which I can install but not use, since the bios is wrong.

When I put the old CPU (fx8150) back in neon it wouldn't boot either -- the fans were spinning but no motherboard lights went on (e.g. LAN). PCI cards lit up, but nothing on Mobo. Took out the CPU, put back in, took out, put back in.

Put the FX8350 back in the oxygen case. Didn't work either, although here the LAN mobo light was on. Still didn't work though -- no video output and couldn't connect via LAN. Great. Killed two working nodes in on afternoon.

Finally, somehow, I managed to get neon working again. Popped a USB stick in. Set up a USB stick with a 1 GB W95 partition as shown in this post: http://verahill.blogspot.com.au/2013/04/401-amd-fx-8150issues-building-kernel.html

Downloaded bios etc. Couldn't install -- BIOS check error. Googled again -- dammit. BIOS for wrong mother board. And the bios that's installed actually supports 8350.

OK, installed all the CPUs, and now they booted up. I must have installed the CPUs badly -- which doesn't speak well of my attention to detail.

For some reason the card that used to be eth0 now gets assigned as eth2 on oxygen. Checked udev -- doesn't make sense. Turned everything off and checked that the pci card (eth0) was seated properly. Booted -- now ok.

Not sure if I have to recompile all the computational code but I did anyway -- the only difference, according to the acml cpuid.exe util, is that 8350 supports FMA3 while 8150 doesn't. Both support SSE, SSE2, SSE3, AVX, FMA4.





[edit]
After ca two months both boxes are stable in spite of being subjected to heavy work loads. The reason for the crashes/reboots originally must have been due to incompatible mobo/cpu.

2 comments:

  1. Sorry to hear all of the troubles, I've never had a cpu not seat properly. As for the cpu sticking to the hsf, what kind of thermal paste were you using? I use arctic silver 5 and it hasn't given me too many troubles with such things (it takes some initial twisting of the hsf to get it to loosen, but that's it).

    I'd say your results from the frequency and voltage test might need one more data point. If lower frequency and higher voltage has any crashes then I'd bet on the VRMs, so I'm curious how the mobo switch ends up (seeing as that should give similar info).

    ReplyDelete
    Replies
    1. I've never had a CPU not seat properly before either, so not sure that was really the case. It was all very odd.

      The thermal paste was the one that came with the CPUs i.e. pre-applied to the heatsink -- it was completely dried out. I haven't had any similar issues in the past, so it might just be AMD being cheap and saving a few cents per unit.

      When reassembling the CPUs/heatsinks cleaned them with methanol and applied Deepcool THP-Z9 paste.

      I've queued the PES job on both nodes to see what will happen. The results should be in in less than a week. Hopefully that'll solve it.

      Looking at http://www.techpowerup.com/forums/showthread.php?t=184061 I see
      "Still only a 4+1 phase board the FX chips pull a bit more power than that can put out comfortably and stable." whereas the ones Asrock rate as ok for FX8350 (e.g. extreme4; http://www.asrock.com/mb/AMD/990FX%20Extreme4/?cat=Specifications) are 8+2. On the other hand, the Gigabyte FX990A-D3 supports FX8350 and is 4+1.

      Delete