15 October 2015

624. Gaussian fails with "traps: l502.exe[12449] general protection ip:d75df7 sp:7f7de40dcce0 error:0 in l502.exe[400000+dc8000]" on i7-5820K

Update: 
* The systems is rock solid with nwchem and ADF.  Only G09 crashes
* Gaussian has now released G09E, and the release notes say: "A Sandybridge/Haswell binary distribution is also available". Remains to be found out if this new version solves the issue. I can't check, as I don't have access to that version.

Update 18 Oct 2015:
TL;DR version: G09D (EM64T and AMD64) crash within the first 30 min to 4 hours. An NWChem job has so far run 6 days without crashing and is still going strong.

Original post:
This isn't much of a post yet. I'm mostly posting this so that people searching online will see that they aren't alone.


I just built a new node:
AU$559 Intel BX80648I75820K 6 Core i7-5820K 3.3Ghz 15MB LGA-2011-V3 (No Heatsink)
AU$407 Gigabyte X99-SLI Intel X99 S2011-3 8xDDR4/4xPCI-E/Intel GBLan/ATX Motherboard
AU$50 DeepCool FrostWin v2.0 CPU cooler
AU$155x2 Patriot 16G Kit (8Gx2) DDR4 2133 Desktop RAM
AU$185 Antec HCG-900 High Current Gamer Gaming PSU
AU$39 Gigabyte N210SL-1GI 1GB GT210 PCI-E VGA Card
AU$68 Seagate 3.5" Barracuda 1TB ST1000DM003 SATA3 7200RPM 64MB HDD (carbon)
AU$76 Antec GX500B-W Dominator Window USB3.0 Gaming Case without PSU

I've got an installation of up-to-date Jessie on it, with the following kernel: Debian 3.16.7-ckt11-1+deb8u5 (2015-10-09) x86_64 GNU/Linux.


When running G09D rev. 01 EM64T I keep getting random errors along these lines (these are collected over a couple of days and between restarts):
[100433.566789] traps: l703.exe[11236] general protection ip:df18ca sp:7fc96f595268 error:0 in l703.exe[400000+a46000] [ 2587.899019] traps: l703.exe[3727] general protection ip:9c9757 sp:7fa6fc436ce0 error:0 in l703.exe[400000+a46000] [26439.755347] traps: l502.exe[3235] general protection ip:ab8a55 sp:7ffe29504c10 error:0 in l502.exe[400000+dc8000] [43030.457126] traps: l502.exe[427] general protection ip:11565a7 sp:7f2ec1fff268 error:0 in l502.exe[400000+dc8000] [ 2587.899019] traps: l703.exe[3727] general protection ip:9c9757 sp:7fa6fc436ce0 error:0 in l703.exe[400000+a46000] [37460.207608] traps: l703.exe[14649] general protection ip:a38ae0 sp:7f1a813cf8c0 error:0 in l703.exe[400000+a46000] [ 8865.403861] traps: l502.exe[12449] general protection ip:d75df7 sp:7f7de40dcce0 error:0 in l502.exe[400000+dc8000]


Sometimes the crashes happen after 30 minutes, sometimes after 3 hours. Most happen within four hours. I seem to remember that one ran up to 12 hours, but nothing's gone beyond that. Some short (1h 30 min) calculations have managed to run to completion.

I've checked each RAM stick with memtest+ -- they are fine -- and they are distributed as recommended in the motherboard manual.

The temperature is running below 40 degrees Celsius.

The harddrive is fine according to SMART.

I log everything every two minutes, and so can go back and look at what happened right before the crash, but there's nothing odd.


My current best three hypotheses are:

* There's an issue with G09D EM64T and the new generation of LGA2011-v3 i7 intel cpus specifically

* There's an issue with any version of G09D and the new generation of LGA2011-v3 i7 intel cpus

* There's an issue with my system which is independent of G09D.

To test, I'll be:
* Do runs with G09D rev. 01 AMD64
                         Done -- this also crashed.

* Do runs with NWChem 6.5 (ifort, mkl)
                         Running -- 6 days so far without a crash!

* Update the BIOS (long shot)

* Remove CPU and check for bent pins (long, arduous shot)

I'll be posting updates...