Showing posts with label random reboots. Show all posts
Showing posts with label random reboots. Show all posts

09 June 2015

612. Randomly Rebooting Router (E2500-AU v1.0 w/ TomatoUSB)

Rolling update:
* 24 June 2015: 7 days uptime with wifi working perfectly. Did reboot it last night because my work computer lost contact with the router somehow (connects via reverse tunnel). The issue with the Randomly Rebooting Router can be considered solved. Obviously, it's solved by crippling the router by turning off the 5 GHz band and tkip (the latter may not be related though).

* Submitted a bug report:

* 16 June 2015 12:42 AEST.  The router has been up for two days and four hours (and counting) in spite of heavy use of our phones. Seems like turning off 5 GHz and/or switching from AES to TKIP has worked. A fair criticism is that I don't have much of a baseline to compare with when it comes to reboots, but subjectively there's a lot less swearing over crappy wifi the past two days.

* 14 June 2015 08:05 AEST. After two days of uptime when radio-silence was enforced, we turned our phones back on. The router rebooted later that night. Same thing happened the next night. After briefly putting dd-wrt on the router, I put tomatousb back on it, turned off 5 GHz and changed from AES to TKIP. The router has been up since 9.30 pm last night (10 h and counting)

Found this bug report:

Also read this with interest:

I've seen posts that find that dd-wrt doesn't have the randomly rebooting issue. dd-wrt doesn't support dual band, at least on e2500. I was surprised that v1 of the cisco linksys firmware had the same exact issue (random reboots when 5 GHz is on). It's all pointing in a specific direction.

Not sure why using the 5 GHz channel with my laptop doesn't trigger the reboots, but maybe they did -- but happened less frequently due to the lower number of 5 GHz capable devices prior to us getting the phones.

* 10 June 2015 16:24 AEST. Since turning off wifi on the Samsung Galaxy S4 phones (but using the two laptops and the tablet listed below) the router has stayed up for 24 hours 8 hours and 11 minutes, and counting. The night between Monday and Tuesday, when we were using our phones, the router rebooted at least twice.

This is another one of those posts that don't offer a solution, but rather states a problem. I'm doing this in the hope that others who are making similar observations as I am will see this and...well, feel slightly less alone at the very least. In the best case, someone will have a solution and offer it as a comment.

So, here's the issue: 
* I have a Linksys E2500-AU v1.0 that is running TomatoUSB (howto)
Tomato v1.28.0000 MIPSR2-128 K26 USB Max ======================================================== Welcome to the Linksys E2500 v1.0 [TomatoUSB] Uptime: 08:14:26 up 1 min Load average: 0.52, 0.18, 0.06 Mem usage: 28.4% (used 17.06 of 59.96 MB) WAN : @ 58:6D:8F:D3:XX:XX LAN : @ DHCP: - WL0 : volatile @ channel: AU13 @ 58:6D:8F:D3:XX:XX WL1 : volatile50 @ channel: AU153 @ 00:01:36:1F:XX:XX ========================================================
* It has a "Broadcom BCM5357 chip rev 2 pkg 8"

* For a long time it, and its predecessor (a WRT-54GL), were running just fine. The predecessor got replaced due to a fried power supply.

* Over the past six-seven months there have been issues with the wireless signal dropping. It isn't just the wireless transmission being stopped and restarted, but the router actually reboots (according to uptime).

* We used to have the following wireless devices: Fujitsu lifebook (v100?), Thinkpad SL410, Google Nexus One and a HTC Legend. At some point we also got a Samsung Galaxy Tab 2. This configuration was running for a few years.

* Coinciding roughly with the perceived start of the rebooting issue was me purchasing a Samsung Galaxy S4 (i9505).

* The issue got a lot worse recently.

* Recently my partner also got a Samsung Galaxy S4 (i9505).

* I have an almost identical router (v2) at work, and the current uptime is 140 days. I do connect very occasionally via wireless to it using my Samsung Galaxy S4. However, this router has a "Broadcom BCM5357 chip rev 1 pkg 8".

What seems to be happening:
The Samsung Galaxy S4 phones seem to be destabilising the router and causing reboots. No, wait, hear me out. It shouldn't happen, and the adage about 'correlation vs causation' may well be true in this case too, but there are precendents (apparently) when it comes to Intel wireless devices:

On from 2010
Hrm...routers used to spontaneously reboot when the wireless driver failed on Tomato as a result of an Intel (mobile) wireless driver bug on Windows. Maybe similar?
On from 2007
Currently using DD-wrt V24, it's been up for 25 days I can confirm it has something to do with the broadcom wireless drivers.

What kind of wireless device does your laptop have? Is it Intel 2100/2200 by any chance?
And in the end the thread concludes that it was due to users with Intel 2100/2200 cards.

The Samsung Galaxy S4 has a Qualcomm Snapdragon 600 APQ8064AB, with is a system-on-a-chip. The Nexus One and HTC Legend also had snapdragons, but obviously much older models. The Galaxy Tab 2 seems to have a Texas Instrument chip (TI OMAP 4430).

Could it be that the phones are causing the issues?

Luckily it's something that's reasonably easy to test, so I'm looking forward to reporting back in a couple of days (of enforced radio silence).

A different test will be to swap routers (but not power supplies) between work and home and see if the behaviour is location dependent. That will take a lot more effort though due to the very specific set-ups.

As the logs get erased on reboot I'm tracking the uptime from now on using autossh and logging from a work computer that's always on.

Some more:
Below is a post regarding iphones, and rebooting routers.

While that post is not related to 5GHz causing issues, it's got me thinking that as neither the Fujitsu, Galaxy Tab, Nexus One or HTC legend support 5 GHz but the Samsung Galaxy S4 phones do, the issue may be possibly related to that. There is obviously quite a lot of things to test.
ADDITIONAL INFORMATION: in the mean time we have been checking and elimination as well. We have been trying to connect certain wireless devices to the network through the DAP and something odd has come up. We have been trying mobile phones (smartphones) at first. My own telephone (Samsung Galaxy S3) seems to cause no troubles. With that phone connected for a whole day, internet connection does not fail once (the router does not reboot). I have been trying both 5Ghz and 2.4 Ghz. bands, both worked okay. One colleague also wanted to connect his Apple Iphone 3G (S?) to the network. I told him he could but this phone could not find the 5Ghz network, so I have switched back to 2.4Ghz again. The iPhone connected and within 5 minutes the connection interrupted. I have set the network back to 5Ghz (so the iPhone could no longer connect) and changed the network settings again. This morning I switched back to 2.4 with only my phone connected. Not a problem. This afternoon I let my colleague connect his phone again and disconnected my Samsung. Within 5 minutes the router started to reboot! After I had the iPhone disconnected again and let another colleague connect his phone, a Samsung Galaxy S(1). So far no problems.

Tomato Anon
Somehow the Spontaneously Rebooting Router doesn't show up here, while the stable one does:

Either way, the anon database is a great way of quickly finding out what Tomato version you can put on your router.

21 October 2013

523. Random Reboots -- troubleshooting. Diagnosed: incompatible motherboard.

Update 8 Jan 2014: I've been putting the FX8350 through its paces together with the other mobo and it's completely stable.  The FX8150 box is also stable. Note that I thought I had a crash a few days after making the swap -- I have not had any issues since whatsoever in spite of running very heavy jobs. Either way, it should remind me to check whether a mobo is compatible with a CPU before making my purchase in the future.

Update 18 Nov 2013: I swapped the CPUs between to boxes, so that I was now using a mobo that officially supported FX8350. Only the CPU moved, nothing else.

Update 5 Nov 2013: Note that the motherboard doesn't support the CPU and this leads to spontaneous reboots under certain conditions. Make sure to look at the list over supported CPUs for the motherboard you use (in retrospect, obvious -- but as a linux person you get used to ignoring those things since everything's for just OSX or Win).

See here for the troubleshooting thread:

Also see this thread:
I'll need to read up on...stuff...but the bottom line seems to be that one would expect issues with this board/cpu combo:

Still only a 4+1 phase board the FX chips pull a bit more power than that can put out comfortably and stable. [..] Those would be your three best to choose from all are the better 8+2 phase designs...
my opinion is to stay away from the asus FX ive seen many people asking why their boards are throttling at full load, vrm protection causes voltages to drop at full load when vrms hit a certain temp.

and it seemed that low (CPU) voltages precipitated crashes.

Update 4 Nov 2013: swapped CPUs with a different box. Will test in a couple of days.
Update 4 Nov 2013: Changing the multiplier back to 20 from 17 (but keeping voltage stable) caused a crash -- this time in a record 13 minutes.
Update 4 Nov 2013: System stable with new voltage/multiplier settings.
Update 27 Oct 2013: I'm currently looking at BIOS voltage.

I recently built a new node ( While that's always exciting, it quickly left a sour taste due to random reboots when running long (days) computational jobs.

Note that the motherboard (asrock 990 fx extreme3) does not officially support FX8350, which is something that I shouldn't have ignored. I might eventually move my fx 8350 to my gigabyte 990 fxa and put my 8150 on my asrock instead.

Short description
* Both Gaussian 09 and NWChem 6.3 cause the reboots.
* I've set up a cron job that logs a lot of data every minute and there's nothing odd in there. No overheating, the wattage seems ok etc.
* Running only smaller jobs (even though they are running non-stop) which take less than a day, the node has stayed up for 11 days now.
* I have never seen it reboot, so I don't know if there's any beeping etc.
* There's nothing in the logs, and nothing in the output from tailing dmesg using a cronjob.
* The only real output is in last:

reboot   system boot  3.11.5           Fri Oct 18 14:08 - 11:57 (2+21:48)   
reboot   system boot  3.8.10           Fri Oct 18 13:23 - 14:07  (00:44)    
reboot   system boot  3.8.10           Tue Oct  8 10:46 - 13:18 (10+02:31)  
me       tty1                          Mon Oct  7 13:25 - crash  (21:21)    
me       pts/0        beryllium        Mon Oct  7 12:29 - crash  (22:17)    
reboot   system boot  3.8.10           Mon Oct  7 12:27 - 13:18 (11+00:51)  
me       pts/0        beryllium        Sat Oct  5 20:59 - crash (1+14:27)   
reboot   system boot  3.8.10           Sat Oct  5 20:58 - 13:18 (12+15:19)
reboot   system boot  3.8.10           Tue Oct  1 14:09 - 11:54 (19+20:45)  
me       pts/0        beryllium        Sun Sep 29 11:39 - crash (2+02:29)   
reboot   system boot  3.8.10           Sun Sep 29 11:39 - 11:54 (21+23:14)  
me       pts/0        beryllium        Mon Sep 23 11:09 - crash (6+00:30)   
reboot   system boot  3.8.10           Mon Sep 23 11:07 - 11:54 (27+23:46)  
me       pts/0        beryllium        Fri Sep 20 12:59 - crash (2+22:08)   
reboot   system boot  3.8.0            Fri Sep 20 12:50 - 11:54 (30+22:04)  
reboot   system boot  3.8.0            Fri Sep 20 12:49 - 12:49  (00:00)    
reboot   system boot  3.2.0-4-amd64    Fri Sep 20 11:52 - 12:48  (00:56)    
reboot   system boot  3.2.0-4-amd64    Fri Sep 20 06:29 - 08:08  (01:38)    
me       pts/0        beryllium        Wed Sep 18 14:51 - crash (1+15:38)   
reboot   system boot  3.2.0-4-amd64    Wed Sep 18 14:40 - 08:08 (1+17:27)   
me       pts/8        beryllium        Wed Sep 18 09:02 - crash  (05:38)    
reboot   system boot  3.2.0-4-amd64    Wed Sep 18 01:51 - 08:08 (2+06:17)   
me       pts/0        beryllium        Tue Sep 17 18:11 - crash  (07:40)    
reboot   system boot  3.2.0-4-amd64    Tue Sep 17 18:08 - 08:08 (2+14:00)   
reboot   system boot  3.2.0-4-amd64    Tue Sep 17 17:55 - 17:56  (00:01)    
me       pts/0        beryllium        Tue Sep 17 13:12 - crash  (04:43)    
reboot   system boot  3.2.0-4-amd64    Tue Sep 17 12:23 - 17:56  (05:33)    
reboot   system boot  3.2.0-4-amd64    Mon Sep 16 20:05 - 12:17  (16:12)    
me       pts/0        beryllium        Mon Sep 16 16:03 - crash  (04:02)    
reboot   system boot  3.2.0-4-amd64    Mon Sep 16 15:31 - 12:17  (20:46)    
reboot   system boot  3.2.0-4-amd64    Mon Sep 16 15:20 - 15:30  (00:09)

Looking at the output it does seems that the crashes are happening less frequently. Part of the reason for the is probably a change in how I use the node, but I don't think that explains everything, and I don't like the idea of a piece of electronic hardware 'fixing' itself.

Another thing that puzzles me is the repeating numbers -- e.g. 08:08, 11:54 and 13:18  -- in the ouput. There's no cronjob or anything like that running at any of those times.

Other things that have changed are the kernel versions and that I removed the UPS around the 1st of October  (the UPS died, which is a bad sign, power-wise. I should probably also look into the warranty on it).

The chief challenge here is that I can't reliable trigger the reboots, which makes it difficult to see whether I've solved the issue or not.

On an older node I could trigger errors by compiling the kernel, but not using any other technique. On that node the RAM was faulty:

==> <== indicates what I'm currently doing.

0.  RAM
The most common reason for unstable nodes if faulty RAM, so if your computer is behaving strangely and randomly crashes, always suspect the RAM first. It's a more likely culprit than software, and the most likely of the hardware components to be at fault.

I ran a full cycle of memtest86+ which took some 4-5 hours if I remember correctly. No errors shown. Note that if memtest86+ does not show any errors it is no guarantee that the RAM is fine. However, the likelihood that it is indeed corrupt goes down.

1. Overheating
The second thing to investigate when something like this happens, in particular if it's associated with prolonged and heavy use, is the possibility of overheating. You can install sensors-lm and configure it to track various temperatures. Note that these aren't always correct.

At any rate, I've logged the output from sensors every minute and there's nothing indicating that the temperature is rising prior to a crash.

Intermission -- trying to trigger a reboot

* It's stable while compiling a kernel (in my case 11.5). Not surprising as it is intense, but short.

* Prime95
Number of torture test threads to run (8): Choose a type of torture test to run. 1 = Small FFTs (maximum FPU stress, data fits in L2 cache, RAM not tested much). 2 = In-place large FFTs (maximum heat and power consumption, some RAM tested). 3 = Blend (tests some of everything, lots of RAM tested). 11,12,13 = Allows you to fine tune the above three selections. Blend is the default. NOTE: if you fail the blend test, but can pass the small FFT test then your problem is likely bad memory or a bad memory controller. Type of torture test to run (3): 2 Accept the answers above? (Y): y
I ran this for three days and the node was stable.
I then ran test type 3 for 30 hours and it too was stable.

I accidentally ran the tests above without mounting ~/oxygen to the head node using NFS. Shouldn't matter, but in order to troubleshoot it's better to keep everything as constant as possible.

* PES scan
I think I saw reboots triggered using all sort of jobs, but due to their long run times, I saw it more consistently with PES scans.

So I ran a long PES scan in nwchem 6.3, and lo, it crashed after just under two days running this job (and having been up for 6 day and 30 minutes). It's not quite the quick, efficient way of crashing the computer that I was looking for, but it will do.

Note that this crash didn't lead to a reboot, but simply to the computer locking up and become unresponsive. No screen, no network, no harddrive activity.

The only errors I can spot in the dmesg are two warnings about 'perf samples too long (2545>2500)' at 30 minutes and at 14 hours of uptime, i.e. well before I started the PES job.

me pts/1 beryllium Fri Oct 25 08:59 still logged in me pts/0 beryllium Fri Oct 25 08:58 still logged in me pts/0 beryllium Fri Oct 25 08:57 - 08:58 (00:01) me tty1 Fri Oct 25 08:52 still logged in reboot system boot 3.11.5 Fri Oct 25 08:52 - 08:59 (00:07) me pts/4 beryllium Sun Oct 20 17:32 - 17:32 (00:00)


3. BIOS version
The BIOS version (1.5) at the time of purchase of the motherboard was the same version as the BIOS available at the motherboard manufacturer's site. Since then an update has been released, as pointed out by a commentator. Nothing in the description of the update indicates that it would fix the issue I'm having, but upgrading the bios is just one of those things that should be tried.

mkdir ~/tmp/bios -p
cd ~/tmp/bios
wget ''
unzip 990FX\ Extreme3\(1.70\) 
Archive: 990FX Extreme3(1.70) inflating: 990EX31.70

I copied the file to a USB stick formatted with FAT32 since my guess is that the uefi might not recognise extX. Booting with the USB stick plugged in and hitting F6 ('Instant flash') lead to the UEFI finding the flash file. Now click on the file name -- don't click on the buttons (e.g. 'Configuration' and 'Refresh device'). During the bios update the usual goes: don't power off during the update, and make sure that your usb stick isn't old and damaged.

Once the update is done you get a message saying 'Programming success, press Enter to reboot system'.

I reran the PES scan, and I had a crash after less than two days (ca 40 hours). This crash caused a reboot.
me       tty1                          Sun Oct 27 12:00   still logged in   
me       pts/1        beryllium        Sun Oct 27 11:55   still logged in   
me       pts/0        beryllium        Sun Oct 27 11:54   still logged in   
reboot   system boot  3.11.5           Sun Oct 27 11:52 - 12:14  (00:22)    
me       tty1                          Fri Oct 25 09:36 - crash (2+02:16)   
me       pts/0        beryllium        Fri Oct 25 09:17 - crash (2+02:35)   
me       pts/2        beryllium        Fri Oct 25 09:17 - crash (2+02:35)   
reboot   system boot  3.11.5           Fri Oct 25 09:17 - 12:14 (2+02:57) 

4. BIOS settings -- voltage
The older I get, the more comfortable I become with admitting when I don't really know what I'm doing. This  -- the tweaking of voltage settings -- is one of those areas where I definitely lack expertise.

Luckily, I've got some advice from a commentator:

Anyway, I've always treated electricity a bit like magic (I always had issues with electrochemistry as a youngster, which is why I'm forcing myself to teach it these days), and the older I get the more I wish I had done chemical engineering rather than chemistry. Perhaps we want to benefit society more with age, rather than just benefit from it?

Anyway, here are some literal screenshots -- taken with my trusty old phone:

 I don't see anything odd here.

First half of OC Tweaker
The alternative to 'Manual' in the OC Mode is 'CPU OC Mode', which sounds like something I want to avoid. Anyway, what bothers me is that there's no 'OC OFF' button. I don't know if the BIOS is doing something odd.

More OC Tweaker

HW Monitor. The Vcore and +12v lines fluctuate by about 5-10 mv.
* turn off Cool'n'Quiet.
* Change Multiplier/Voltage Change from Automatic to Manual
** Set CPU Freq multiplier to 17.0x (3400 MHz) instead of 20x (4.0 GHZ) under OC Tweaker
** Set CPU voltage to 1.35 instead of stock 1.3750 V, (OC Tweaker/CPU Voltage)

I've managed to run a full PES scan -- which hasn't worked before -- and re-ran it without issue. Looks like the issue was solved.

I then set the multiplier back to 20 (4000 MHz)  while keeping the CPU voltage at 1.35 V, and relaunched the PES scan. Almost immediate crash:.
me       pts/0        beryllium        Mon Nov  4 13:48   still logged in   
reboot   system boot  3.11.5           Mon Nov  4 13:48 - 14:30  (00:42)    
me       pts/0        beryllium        Mon Nov  4 13:35 - crash  (00:13)    
me       tty1                          Mon Nov  4 13:34 - 13:34  (00:00)    
reboot   system boot  3.11.5           Mon Nov  4 13:34 - 14:30  (00:56)

Not sure what to do now. Did it crash faster because the CPU voltage is low while the multiplier is high? Can it be solved by increasing -- rather than decreasing -- the voltage?

Re-running the PES job again gave a more interesting result -- the job crashed, but not the node (note that the job is exactly the same each time, so it's not a matter of the input):

 Grid integrated density:     191.999968853836
 Requested integration accuracy:   0.10E-06
 d= 0,ls=0.0,diis    13  -1092.4755719318 -1.63D-05  8.49D-06  2.73D-05   719.5
 Grid integrated density:     191.999968846937
 Requested integration accuracy:   0.10E-06
  Singularity in Pulay matrix. Error and Fock matrices removed. 
 PeIGS error from dstebz 4 ...trying dsterf 
 error from dsterf 516  
 error from dsterf 516  
 Error in pstein5. me = 0 argument 10 has an illegal value. 
 Error in pstein5. me = 1 argument 10 has an illegal value. 
 Error in pstein5. me = 2 argument 10 has an illegal value. 
  ME =                     2  Exiting via  
 Error in pstein5. me = 4 argument 10 has an illegal value. 
  ME =                     4  Exiting via  
 Error in pstein5. me = 5 argument 10 has an illegal value. 
  ME =                     5  Exiting via  
5:5: peigs error: mxpend:: -1
 Error in pstein5. me = 6 argument 10 has an illegal value. 
  ME =                     6  Exiting via  
 Error in pstein5. me = 7 argument 10 has an illegal value. 
  ME =                     7  Exiting via  
7:7: peigs error: mxpend:: -1
(rank:7 hostname:oxygen pid:3741):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/armci.c:ARMCI_Error():208 cond:0
  ME =                     0

And, while the frequency affects the thermal output, a thermal issue shouldn't lead to garbled stuff. This is looking more and more like what the poster wazoo42 recounted:

5. Swapping the CPU to an approved MOBO.
This is a bit of a cop-out. Most people don't have multiple computers and which happen to have compatible hardware. On the other hand, I have a job to do.

To be able to continue to follow this post you'll need to know this:
There are two nodes (cpu, mobo, psu):
* oxygen: FX 8350, asrock 990fx extreme3, corsair GS700.
* neon: FX 8150, gigabyte GA-FX990D3, corsair GS800.
Both nodes have otherwise similar hardware: 32 gb ram, GT210 nvidia, one PCI network card. Both motherboards support 8150, but only GA-FX990D3 supports 8350.

So we'll move FX8350 to neon, and FX 8150 to oxygen.

I first set the Multiplier/Voltage Change back to Automatic on oxygen in preparation for the FX 8150.
I then shut down the two nodes and unplugged them. And here's where it's not funny anymore: I tried to gently remove the heatsink but the heatsink together with the CPU popped off in spite of the lever being locked. On both nodes.

The CPU was solidly glued to the heatsink in both cases. I managed to get the FX 8350 off its heatsink by gently scraping off the excess thermal paste (dry and solid), but the FX 8150 was a real struggle. In the end I used the back of a knife as a lever (gently). Not ideal.

Anyway, cleaned the fx 8350 heatsink and cpu, applied new thermal paste and installed on the gigabyte fxa990-d3 motherboard. Turned on -- no lights on the mobo. Fans etc all working. Dammit. Googled and saw that bios only supports 8350 from version 9 (incorrect -- looked at wrong mobo, but didn't discover that until later)

So now I have a CPU that'd work, but which I can't install since it's stuck to the heatsink, and one which I can install but not use, since the bios is wrong.

When I put the old CPU (fx8150) back in neon it wouldn't boot either -- the fans were spinning but no motherboard lights went on (e.g. LAN). PCI cards lit up, but nothing on Mobo. Took out the CPU, put back in, took out, put back in.

Put the FX8350 back in the oxygen case. Didn't work either, although here the LAN mobo light was on. Still didn't work though -- no video output and couldn't connect via LAN. Great. Killed two working nodes in on afternoon.

Finally, somehow, I managed to get neon working again. Popped a USB stick in. Set up a USB stick with a 1 GB W95 partition as shown in this post:

Downloaded bios etc. Couldn't install -- BIOS check error. Googled again -- dammit. BIOS for wrong mother board. And the bios that's installed actually supports 8350.

OK, installed all the CPUs, and now they booted up. I must have installed the CPUs badly -- which doesn't speak well of my attention to detail.

For some reason the card that used to be eth0 now gets assigned as eth2 on oxygen. Checked udev -- doesn't make sense. Turned everything off and checked that the pci card (eth0) was seated properly. Booted -- now ok.

Not sure if I have to recompile all the computational code but I did anyway -- the only difference, according to the acml cpuid.exe util, is that 8350 supports FMA3 while 8150 doesn't. Both support SSE, SSE2, SSE3, AVX, FMA4.

After ca two months both boxes are stable in spite of being subjected to heavy work loads. The reason for the crashes/reboots originally must have been due to incompatible mobo/cpu.