Lindqvist -- a blog about Linux and Science. Mostly.

23 October 2013

524. Generating bonds, angles and dihedrals for a molecule for GROMACS

Generating bond, angle and dihedral parameters for GROMACS molecular dynamics simulations is a real PITA when it comes to reasonably large and complex molecules.

Since I am currently looking at the solvation dynamics of a series of isomers of a metal oxide cluster I quickly got tired of working out the bonds and bond constraints for my 101 atom clusters by hand.

So here's a python (2.7.x) script that generates parameters suitable for a GROMACS .top or .itp file.

Let's call the script genparam. You can call it as follows:

genparam molecule.xyz 0.21 3

where molecule.xyz is the molecule in xyz coordinates. 0.21 (nm) is used to generate bonds -- atoms that are 0.21 nm or closer to each other are bonded. The final number should be 1, 2 or 3. If it is 1, then only bonds (and constraints for bonds that are 0.098 nm or shorter -- such as O-H bonds. It's specific for my use, so you can edit the code.) are printed. If it is 2, then bonds and angles are printed. If it is 3, then bonds, angles and dihedrals are printed.

A simple and understandable example using methanol looks like this:

./genparams.py methanol.xyz 0.15 3

[bonds]
[bonds]
1 2 6 0.143 50000.00 ;C OH
1 3 6 0.108 50000.00 ;C H
1 4 6 0.108 50000.00 ;C H
1 5 6 0.108 50000.00 ;C H
[constraints]
2 6 2 0.096 ;OH HO

[ angles ]
2 1 3 1 109.487 50000.00 ;OH C H
2 1 4 1 109.487 50000.00 ;OH C H
2 1 5 1 109.466 50000.00 ;OH C H
1 2 6 1 109.578 50000.00 ;C OH HO
3 1 4 1 109.542 50000.00 ;H C H
3 1 5 1 109.534 50000.00 ;H C H
4 1 5 1 109.534 50000.00 ;H C H
[ dihedrals ]
6 2 1 3 1 60.01 50000.00 2 ;HO OH C H
6 2 1 4 1 60.01 50000.00 2 ;HO OH C H
6 2 1 5 1 180.00 50000.00 2 ;HO OH C H

where methanol.xyz looks like this:


6
Methanol
C      -0.000000010000       0.138569980000       0.355570700000   
OH     -0.000000010000       0.187935770000      -1.074466460000  
H       0.882876920000      -0.383123830000       0.697839450000  
H      -0.882876940000      -0.383123830000       0.697839450000  
H      -0.000000010000       1.145042790000       0.750208830000  
HO     -0.000000010000      -0.705300580000      -1.426986340000

Note that the bond length value may need to be tuned for each molecule -- e.g. 0.21 nm is fine for my metal oxides which have long M-O bonds, while 0.15 nm is better for my methanol.xyz. In the end, you'll probably have to do some manual editing to remove excess bonds, angles and dihedrals.

Also note that there's a switch (prevent_hhbonds) which prevent the formation of bonds between atoms that start with H -- this is to prevent bonds between e.g. H in CH3 units (177 pm). You can change this in the code in the __main__ section.

Finally, note that the function types, multiplicities and forces may need to be edited. I suggest not doing that in the code, but in the output as it will depend on each molecule.

This script won't do science for you, but it'll take care of some of the drudgery of providing geometric descriptors.

Anyway, having said all that, here's the code:


#!/usr/bin/python
import sys
from math import sqrt, pi
from itertools import permutations
from math import acos,atan2

# see table 5.5 (p 132, v 4.6.3) in the gromacs manual
# for the different function types


#from
#http://stackoverflow.com/questions/1984799/cross-product-of-2-different-\
#vectors-in-python
def crossproduct(a, b):
 c = [a[1]*b[2] - a[2]*b[1],
  a[2]*b[0] - a[0]*b[2],
  a[0]*b[1] - a[1]*b[0]]
 return c
#end of code snippet

# mostly from 
# http://www.emoticode.net/python/calculate-angle-between-two-vectors.html
def dotproduct(a,b):
 return sum([a[i]*b[i] for i in range(len(a))])

def veclength(a):
 length=sqrt(a[0]**2+a[1]**2+a[2]**2)
 return length

def ange(a,b,la,lb,angle_unit):
 dp=dotproduct(a,b)
 costheta=dp/(la*lb)
 angle=acos(costheta)
 if angle_unit=='deg':
  angle=angle*180/pi
 return angle
# end of code snippet

def diheder(a,b,c,angle_unit):
 dihedral=atan2(veclength(crossproduct(crossproduct(a,b),\
 crossproduct(b,c))), dotproduct(crossproduct(a,b),crossproduct(b,c)))
 if angle_unit=='deg':
  dihedral=dihedral*180/pi
 return dihedral

def readatoms(infile):
 positions=[]
 f=open(infile,'r')
 atomno=-2
 for line in f:
  atomno+=1
  if atomno >=1:
   position=filter(None,line.rstrip('\n').split(' '))
   if len(position)>3:
    positions+=[[position[0],int(atomno),\
    float(position[1]),float(position[2]),\
    float(position[3])]]
 return positions

def makebonds(positions,rcutoff,prevent_hhbonds):
 
 bonds=[]
 
 for firstatom in positions:
  for secondatom in positions:
   distance=round(sqrt((firstatom[2]-secondatom[2])**2\
+(firstatom[3]-secondatom[3])**2+(firstatom[4]-secondatom[4])**2)/10.0,3)
   xyz=[[firstatom[2],firstatom[3],firstatom[4]],\
[secondatom[2],secondatom[3],secondatom[4]]]
   
   if distance<=rcutoff and firstatom[1]!=secondatom[1]:
    if prevent_hhbonds and (firstatom[0][0:1]=='H' and\
     secondatom[0][0:1]=='H'):
      pass
    elif firstatom[1]<secondatom[1]:
     bonds+=[[firstatom[1],secondatom[1],\
distance,firstatom[0],secondatom[0],xyz[0],xyz[1]]]
    else:
     bonds+=[[secondatom[1],firstatom[1],\
distance,firstatom[0],secondatom[0],xyz[1],xyz[0]]]
 return bonds

def dedupe_bonds(bonds):
 
 newbonds=[]
 for olditem in bonds:
  dupe=False
  for newitem in newbonds:
   if newitem[0]==olditem[0] and newitem[1]==olditem[1]:
    dupe=True
    break;
  if dupe==False:
   newbonds+=[olditem]
 return(newbonds)

def genvec(a,b):
 vec=[b[0]-a[0],b[1]-a[1],b[2]-a[2]]
 return vec

def findangles(bonds,angle_unit):
 # for atoms 1,2,3 we can have the following situations
 # 1-2, 1-3
 # 1-2, 2-3
 # 1-3, 2-3
 # The indices are sorted so that the lower number is always first
 
 angles=[]
 for firstbond in bonds:
    
  for secondbond in bonds:
   if firstbond[0]==secondbond[0] and not (firstbond[1]==\
secondbond[1]): # 1-2, 1-3
    vec=[genvec(firstbond[6],firstbond[5])]
    vec+=[genvec(secondbond[6],secondbond[5])]
    angle=ange(vec[0],vec[1],firstbond[2]*10,\
secondbond[2]*10,angle_unit)
    angles+=[[firstbond[1],firstbond[0],\
secondbond[1],angle,firstbond[4],firstbond[3],secondbond[4],firstbond[6],\
firstbond[5],secondbond[6]]]
   
   if firstbond[0]==secondbond[1] and not (firstbond[1]==\
secondbond[1]): # 1-2, 3-1
    #this should never be relevant since we've sorted the atom numbers
    pass
   
   if firstbond[1]==secondbond[0] and not (firstbond[0]==\
secondbond[1]): # 1-2, 2-3
    vec=[genvec(firstbond[5],firstbond[6])]
    vec+=[genvec(secondbond[6],secondbond[5])]
    angle=ange(vec[0],vec[1],firstbond[2]*10,\
secondbond[2]*10,angle_unit)
    angles+=[[firstbond[0],firstbond[1],\
secondbond[1],angle,firstbond[3],firstbond[4],secondbond[4],firstbond[5],\
firstbond[6],secondbond[6]]]
   
   if firstbond[1]==secondbond[1] and not (firstbond[0]==\
secondbond[0]): # 1-3, 2-3
    vec=[genvec(firstbond[6],firstbond[5])]
    vec+=[genvec(secondbond[6],secondbond[5])]
    angle=ange(vec[0],vec[1],firstbond[2]*10,\
secondbond[2]*10,angle_unit)
    angles+=[[firstbond[0],firstbond[1],\
secondbond[0],angle,firstbond[3],firstbond[4],secondbond[3],firstbond[5],\
firstbond[6],secondbond[5]]]    
 return angles

def dedupe_angles(angles):
 dupe=False
 newangles=[]
 for item in angles:
  dupe=False
  for anotheritem in newangles:
   if item[0]==anotheritem[2] and (item[2]==anotheritem[0]\
 and item[1]==anotheritem[1]):
    dupe=True
    break
  if dupe==False:
   newangles+=[item]
 
 newerangles=[]
 dupe=False
 
 for item in newangles:
  dupe=False
  for anotheritem in newerangles:
   if item[2]==anotheritem[2] and (item[0]==anotheritem[1]\
 and item[1]==anotheritem[0]):
    dupe_item=anotheritem
    dupe=True
    break
  
  if dupe==False:
   newerangles+=[item]
  elif dupe==True:
   if dupe_item[3]>item[3]:
    pass;
   else:
    newerangles[len(newerangles)-1]=item

 newestangles=[]
 dupe=False
 
 for item in newerangles:
  dupe=False
  for anotheritem in newestangles:
   if (sorted(item[0:3]) == sorted (anotheritem[0:3])):
    dupe_item=anotheritem
    dupe=True
    break
  
  if dupe==False:
   newestangles+=[item]
  elif dupe==True:
   if dupe_item[3]>item[3]:
    pass;
   else:
    newestangles[len(newestangles)-1]=item
 return newestangles

def finddihedrals(angles,bonds,angle_unit):
 dihedrals=[]
 for item in angles:
  for anotheritem in bonds:
   if item[2]==anotheritem[0]:
    vec=[genvec(item[7],item[8])]
    vec+=[genvec(item[8],item[9])]
    vec+=[genvec(item[9],anotheritem[6])]
    dihedral=diheder(vec[0],vec[1],vec[2],\
angle_unit)
    dihedrals+=[ [ item[0],item[1],item[2],\
anotheritem[1], dihedral, item[4],item[5],item[6], anotheritem[4] ] ]

   if item[0]==anotheritem[0] and not item[1]==anotheritem[1]:
    vec=[genvec(anotheritem[6],item[7])]
    vec+=[genvec(item[7],item[8])]
    vec+=[genvec(item[8],item[9])]
    dihedral=diheder(vec[0],vec[1],vec[2],\
angle_unit)
    dihedrals+=[ [ anotheritem[1],item[0],item[1],\
item[2], dihedral, anotheritem[4],item[4],item[5], item[6] ] ]
 return dihedrals

def dedup_dihedrals(dihedrals):
 newdihedrals=[]
 
 for item in dihedrals:
  dupe=False
  for anotheritem in newdihedrals:
   rev=anotheritem[0:4]
   rev.reverse()
   if item[0:4] == rev:
    dupe=True
  if not dupe:
   newdihedrals+=[item]
 return newdihedrals

def print_bonds(bonds,func):
 constraints=""
 funcconstr='2'
 print '[bonds]'
 
 for item in bonds:
  if item[2]<=0.098:
   constraints+=(str(item[0])+'\t'+str(item[1])+'\t'+\
funcconstr+'\t'+str("%1.3f"% item[2])+'\t;'+str(item[3])+'\t'+str(item[4])+'\n')
  else: 
   print str(item[0])+'\t'+str(item[1])+'\t'+func+'\t'+\
str("%1.3f"% item[2])+'\t'+'50000.00'+'\t;'+str(item[3])+'\t'+str(item[4])
 print '[constraints]'
 print constraints
 return 0

def print_angles(angles,func):
 print '[ angles ]'
 for item in angles:
  print str(item[0])+'\t'+str(item[1])+'\t'+str(item[2])+'\t'+\
func+'\t'+str("%3.3f" % item[3])+'\t'+'50000.00'+'\t;'\
  +str(item[4])+'\t'+str(item[5])+'\t'+str(item[6])
 return 0

def print_dihedrals(dihedrals,func):
 print "[ dihedrals ]"
 force='50000.00'
 mult='2'
 for item in dihedrals:
  print str(item[0])+'\t'+str(item[1])+'\t'+str(item[2])+'\t'+\
str(item[3])+\
  '\t'+func+'\t'+str("%3.2f" % item[4])+'\t'+force+'\t'+mult+\
  '\t;'+str(item[5])+'\t'+str(item[6])+\
  '\t'+str(item[7])+'\t'+str(item[8])
 return 0

if __name__ == "__main__":
 infile=sys.argv[1]
 rcutoff=float(sys.argv[2]) # in nm
 itemstoprint=int(sys.argv[3])
 
 angle_unit='deg' #{'rad'|'deg'}
 prevent_hhbonds=True # False is safer -- it prevents bonds between
 # atoms whose names start with H

 positions=readatoms(infile)

 bonds=makebonds(positions,rcutoff,prevent_hhbonds)
 bonds=dedupe_bonds(bonds) 
 
 print_bonds(bonds,'6')
 
 angles=findangles(bonds,angle_unit)
 angles=dedupe_angles(angles)
 
 if itemstoprint>=2:
  print_angles(angles,'1')

 dihedrals=finddihedrals(angles,bonds,angle_unit)
 dihedrals=dedup_dihedrals(dihedrals)
 if itemstoprint>=3:
  print_dihedrals(dihedrals,'1')

21 October 2013

523. Random Reboots -- troubleshooting. Diagnosed: incompatible motherboard.

Update 8 Jan 2014: I've been putting the FX8350 through its paces together with the other mobo and it's completely stable. The FX8150 box is also stable. Note that I thought I had a crash a few days after making the swap -- I have not had any issues since whatsoever in spite of running very heavy jobs. Either way, it should remind me to check whether a mobo is compatible with a CPU before making my purchase in the future.

Update 18 Nov 2013: I swapped the CPUs between to boxes, so that I was now using a mobo that officially supported FX8350. Only the CPU moved, nothing else.

Update 5 Nov 2013: Note that the motherboard doesn't support the CPU and this leads to spontaneous reboots under certain conditions. Make sure to look at the list over supported CPUs for the motherboard you use (in retrospect, obvious -- but as a linux person you get used to ignoring those things since everything's for just OSX or Win).

See here for the troubleshooting thread:
http://verahill.blogspot.com.au/2013/10/523-random-reboots-troubleshooting-in.html

Also see this thread: http://www.techpowerup.com/forums/showthread.php?t=184061
I'll need to read up on...stuff...but the bottom line seems to be that one would expect issues with this board/cpu combo:


Still only a 4+1 phase board the FX chips pull a bit more power than that can put out comfortably and stable. [..] Those would be your three best to choose from all are the better 8+2 phase designs...

and


my opinion is to stay away from the asus FX ive seen many people asking why their boards are throttling at full load, vrm protection causes voltages to drop at full load when vrms hit a certain temp.

and it seemed that low (CPU) voltages precipitated crashes.

Update 4 Nov 2013: swapped CPUs with a different box. Will test in a couple of days.
Update 4 Nov 2013: Changing the multiplier back to 20 from 17 (but keeping voltage stable) caused a crash -- this time in a record 13 minutes.
Update 4 Nov 2013: System stable with new voltage/multiplier settings.
Update 27 Oct 2013: I'm currently looking at BIOS voltage.
---

I recently built a new node (http://verahill.blogspot.com.au/2013/10/520-new-node-amd-fx-835032-gb-ram990-fx.html). While that's always exciting, it quickly left a sour taste due to random reboots when running long (days) computational jobs.

Note that the motherboard (asrock 990 fx extreme3) does not officially support FX8350, which is something that I shouldn't have ignored. I might eventually move my fx 8350 to my gigabyte 990 fxa and put my 8150 on my asrock instead.

Short description
* Both Gaussian 09 and NWChem 6.3 cause the reboots.
* I've set up a cron job that logs a lot of data every minute and there's nothing odd in there. No overheating, the wattage seems ok etc.
* Running only smaller jobs (even though they are running non-stop) which take less than a day, the node has stayed up for 11 days now.
* I have never seen it reboot, so I don't know if there's any beeping etc.
* There's nothing in the logs, and nothing in the output from tailing dmesg using a cronjob.
* The only real output is in last:


reboot   system boot  3.11.5           Fri Oct 18 14:08 - 11:57 (2+21:48)   
reboot   system boot  3.8.10           Fri Oct 18 13:23 - 14:07  (00:44)    
reboot   system boot  3.8.10           Tue Oct  8 10:46 - 13:18 (10+02:31)  
me       tty1                          Mon Oct  7 13:25 - crash  (21:21)    
me       pts/0        beryllium        Mon Oct  7 12:29 - crash  (22:17)    
reboot   system boot  3.8.10           Mon Oct  7 12:27 - 13:18 (11+00:51)  
me       pts/0        beryllium        Sat Oct  5 20:59 - crash (1+14:27)   
reboot   system boot  3.8.10           Sat Oct  5 20:58 - 13:18 (12+15:19)
reboot   system boot  3.8.10           Tue Oct  1 14:09 - 11:54 (19+20:45)  
me       pts/0        beryllium        Sun Sep 29 11:39 - crash (2+02:29)   
reboot   system boot  3.8.10           Sun Sep 29 11:39 - 11:54 (21+23:14)  
me       pts/0        beryllium        Mon Sep 23 11:09 - crash (6+00:30)   
reboot   system boot  3.8.10           Mon Sep 23 11:07 - 11:54 (27+23:46)  
me       pts/0        beryllium        Fri Sep 20 12:59 - crash (2+22:08)   
reboot   system boot  3.8.0            Fri Sep 20 12:50 - 11:54 (30+22:04)  
reboot   system boot  3.8.0            Fri Sep 20 12:49 - 12:49  (00:00)    
reboot   system boot  3.2.0-4-amd64    Fri Sep 20 11:52 - 12:48  (00:56)    
reboot   system boot  3.2.0-4-amd64    Fri Sep 20 06:29 - 08:08  (01:38)    
me       pts/0        beryllium        Wed Sep 18 14:51 - crash (1+15:38)   
reboot   system boot  3.2.0-4-amd64    Wed Sep 18 14:40 - 08:08 (1+17:27)   
me       pts/8        beryllium        Wed Sep 18 09:02 - crash  (05:38)    
reboot   system boot  3.2.0-4-amd64    Wed Sep 18 01:51 - 08:08 (2+06:17)   
me       pts/0        beryllium        Tue Sep 17 18:11 - crash  (07:40)    
reboot   system boot  3.2.0-4-amd64    Tue Sep 17 18:08 - 08:08 (2+14:00)   
reboot   system boot  3.2.0-4-amd64    Tue Sep 17 17:55 - 17:56  (00:01)    
me       pts/0        beryllium        Tue Sep 17 13:12 - crash  (04:43)    
reboot   system boot  3.2.0-4-amd64    Tue Sep 17 12:23 - 17:56  (05:33)    
reboot   system boot  3.2.0-4-amd64    Mon Sep 16 20:05 - 12:17  (16:12)    
me       pts/0        beryllium        Mon Sep 16 16:03 - crash  (04:02)    
reboot   system boot  3.2.0-4-amd64    Mon Sep 16 15:31 - 12:17  (20:46)    
reboot   system boot  3.2.0-4-amd64    Mon Sep 16 15:20 - 15:30  (00:09)

Looking at the output it does seems that the crashes are happening less frequently. Part of the reason for the is probably a change in how I use the node, but I don't think that explains everything, and I don't like the idea of a piece of electronic hardware 'fixing' itself.

Another thing that puzzles me is the repeating numbers -- e.g. 08:08, 11:54 and 13:18 -- in the ouput. There's no cronjob or anything like that running at any of those times.

Other things that have changed are the kernel versions and that I removed the UPS around the 1st of October (the UPS died, which is a bad sign, power-wise. I should probably also look into the warranty on it).

The chief challenge here is that I can't reliable trigger the reboots, which makes it difficult to see whether I've solved the issue or not.

On an older node I could trigger errors by compiling the kernel, but not using any other technique. On that node the RAM was faulty: http://verahill.blogspot.com.au/2013/04/401-amd-fx-8150issues-building-kernel.html

==> <== indicates what I'm currently doing.

0. RAM
The most common reason for unstable nodes if faulty RAM, so if your computer is behaving strangely and randomly crashes, always suspect the RAM first. It's a more likely culprit than software, and the most likely of the hardware components to be at fault.

I ran a full cycle of memtest86+ which took some 4-5 hours if I remember correctly. No errors shown. Note that if memtest86+ does not show any errors it is no guarantee that the RAM is fine. However, the likelihood that it is indeed corrupt goes down.

1. Overheating
The second thing to investigate when something like this happens, in particular if it's associated with prolonged and heavy use, is the possibility of overheating. You can install sensors-lm and configure it to track various temperatures. Note that these aren't always correct.

At any rate, I've logged the output from sensors every minute and there's nothing indicating that the temperature is rising prior to a crash.

--------------------------------------------------
Intermission -- trying to trigger a reboot

* It's stable while compiling a kernel (in my case 11.5). Not surprising as it is intense, but short.

* Prime95


Number of torture test threads to run (8): 
Choose a type of torture test to run.
1 = Small FFTs (maximum FPU stress, data fits in L2 cache, RAM not tested much).
2 = In-place large FFTs (maximum heat and power consumption, some RAM 
tested).
3 = Blend (tests some of everything, lots of RAM tested).
11,12,13 = Allows you to fine tune the above three selections.
Blend is the default.  NOTE: if you fail the blend test, but can pass the small FFT test then your problem is likely bad memory or a bad memory controller.
Type of torture test to run (3): 2

Accept the answers above? (Y): y

I ran this for three days and the node was stable.
I then ran test type 3 for 30 hours and it too was stable.

I accidentally ran the tests above without mounting ~/oxygen to the head node using NFS. Shouldn't matter, but in order to troubleshoot it's better to keep everything as constant as possible.

* PES scan
I think I saw reboots triggered using all sort of jobs, but due to their long run times, I saw it more consistently with PES scans.

So I ran a long PES scan in nwchem 6.3, and lo, it crashed after just under two days running this job (and having been up for 6 day and 30 minutes). It's not quite the quick, efficient way of crashing the computer that I was looking for, but it will do.

Note that this crash didn't lead to a reboot, but simply to the computer locking up and become unresponsive. No screen, no network, no harddrive activity.

The only errors I can spot in the dmesg are two warnings about 'perf samples too long (2545>2500)' at 30 minutes and at 14 hours of uptime, i.e. well before I started the PES job.


me       pts/1        beryllium        Fri Oct 25 08:59   still logged in   
me       pts/0        beryllium        Fri Oct 25 08:58   still logged in   
me       pts/0        beryllium        Fri Oct 25 08:57 - 08:58  (00:01)    
me       tty1                          Fri Oct 25 08:52   still logged in   
reboot   system boot  3.11.5           Fri Oct 25 08:52 - 08:59  (00:07)    
me       pts/4        beryllium        Sun Oct 20 17:32 - 17:32  (00:00)

--------------------------------------------------

3. BIOS version
The BIOS version (1.5) at the time of purchase of the motherboard was the same version as the BIOS available at the motherboard manufacturer's site. Since then an update has been released, as pointed out by a commentator. Nothing in the description of the update indicates that it would fix the issue I'm having, but upgrading the bios is just one of those things that should be tried.

mkdir ~/tmp/bios -p
cd ~/tmp/bios
wget 'ftp://download.asrock.com/bios/AM3+/990FX%20Extreme3(1.70)ROM.zip'
unzip 990FX\ Extreme3\(1.70\)ROM.zip 

Archive:  990FX Extreme3(1.70)ROM.zip
  inflating: 990EX31.70

I copied the file to a USB stick formatted with FAT32 since my guess is that the uefi might not recognise extX. Booting with the USB stick plugged in and hitting F6 ('Instant flash') lead to the UEFI finding the flash file. Now click on the file name -- don't click on the buttons (e.g. 'Configuration' and 'Refresh device'). During the bios update the usual goes: don't power off during the update, and make sure that your usb stick isn't old and damaged.

Once the update is done you get a message saying 'Programming success, press Enter to reboot system'.

I reran the PES scan, and I had a crash after less than two days (ca 40 hours). This crash caused a reboot.

me       tty1                          Sun Oct 27 12:00   still logged in   
me       pts/1        beryllium        Sun Oct 27 11:55   still logged in   
me       pts/0        beryllium        Sun Oct 27 11:54   still logged in   
reboot   system boot  3.11.5           Sun Oct 27 11:52 - 12:14  (00:22)    
me       tty1                          Fri Oct 25 09:36 - crash (2+02:16)   
me       pts/0        beryllium        Fri Oct 25 09:17 - crash (2+02:35)   
me       pts/2        beryllium        Fri Oct 25 09:17 - crash (2+02:35)   
reboot   system boot  3.11.5           Fri Oct 25 09:17 - 12:14 (2+02:57)

.

4. BIOS settings -- voltage
The older I get, the more comfortable I become with admitting when I don't really know what I'm doing. This -- the tweaking of voltage settings -- is one of those areas where I definitely lack expertise.

Luckily, I've got some advice from a commentator: http://verahill.blogspot.com.au/2013/09/517-very-briefly-prime95-on-linux.html?showComment=1381459311645#c1080803985593788821

Anyway, I've always treated electricity a bit like magic (I always had issues with electrochemistry as a youngster, which is why I'm forcing myself to teach it these days), and the older I get the more I wish I had done chemical engineering rather than chemistry. Perhaps we want to benefit society more with age, rather than just benefit from it?

Anyway, here are some literal screenshots -- taken with my trusty old phone:

Main

I don't see anything odd here.

First half of OC Tweaker

The alternative to 'Manual' in the OC Mode is 'CPU OC Mode', which sounds like something I want to avoid. Anyway, what bothers me is that there's no 'OC OFF' button. I don't know if the BIOS is doing something odd.

More OC Tweaker

HW Monitor. The Vcore and +12v lines fluctuate by about 5-10 mv.

Changes:
* turn off Cool'n'Quiet.
* Change Multiplier/Voltage Change from Automatic to Manual
** Set CPU Freq multiplier to 17.0x (3400 MHz) instead of 20x (4.0 GHZ) under OC Tweaker
** Set CPU voltage to 1.35 instead of stock 1.3750 V, (OC Tweaker/CPU Voltage)

I've managed to run a full PES scan -- which hasn't worked before -- and re-ran it without issue. Looks like the issue was solved.

I then set the multiplier back to 20 (4000 MHz) while keeping the CPU voltage at 1.35 V, and relaunched the PES scan. Almost immediate crash:.

me       pts/0        beryllium        Mon Nov  4 13:48   still logged in   
reboot   system boot  3.11.5           Mon Nov  4 13:48 - 14:30  (00:42)    
me       pts/0        beryllium        Mon Nov  4 13:35 - crash  (00:13)    
me       tty1                          Mon Nov  4 13:34 - 13:34  (00:00)    
reboot   system boot  3.11.5           Mon Nov  4 13:34 - 14:30  (00:56)

Not sure what to do now. Did it crash faster because the CPU voltage is low while the multiplier is high? Can it be solved by increasing -- rather than decreasing -- the voltage?

Re-running the PES job again gave a more interesting result -- the job crashed, but not the node (note that the job is exactly the same each time, so it's not a matter of the input):

 Grid integrated density:     191.999968853836
 Requested integration accuracy:   0.10E-06
 d= 0,ls=0.0,diis    13  -1092.4755719318 -1.63D-05  8.49D-06  2.73D-05   719.5
 Grid integrated density:     191.999968846937
 Requested integration accuracy:   0.10E-06
  Singularity in Pulay matrix. Error and Fock matrices removed. 
 PeIGS error from dstebz 4 ...trying dsterf 
 error from dsterf 516  
 error from dsterf 516  
 Error in pstein5. me = 0 argument 10 has an illegal value. 
 Error in pstein5. me = 1 argument 10 has an illegal value. 
 Error in pstein5. me = 2 argument 10 has an illegal value. 
  ME =                     2  Exiting via  
 Error in pstein5. me = 4 argument 10 has an illegal value. 
  ME =                     4  Exiting via  
 Error in pstein5. me = 5 argument 10 has an illegal value. 
  ME =                     5  Exiting via  
5:5: peigs error: mxpend:: -1
 Error in pstein5. me = 6 argument 10 has an illegal value. 
  ME =                     6  Exiting via  
 Error in pstein5. me = 7 argument 10 has an illegal value. 
  ME =                     7  Exiting via  
7:7: peigs error: mxpend:: -1
(rank:7 hostname:oxygen pid:3741):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/armci.c:ARMCI_Error():208 cond:0
  ME =                     0

And, while the frequency affects the thermal output, a thermal issue shouldn't lead to garbled stuff. This is looking more and more like what the poster wazoo42 recounted: http://verahill.blogspot.com.au/2013/09/517-very-briefly-prime95-on-linux.html?showComment=1381459311645#c1080803985593788821

5. Swapping the CPU to an approved MOBO.
This is a bit of a cop-out. Most people don't have multiple computers and which happen to have compatible hardware. On the other hand, I have a job to do.

To be able to continue to follow this post you'll need to know this:
There are two nodes (cpu, mobo, psu):
* oxygen: FX 8350, asrock 990fx extreme3, corsair GS700.
* neon: FX 8150, gigabyte GA-FX990D3, corsair GS800.
Both nodes have otherwise similar hardware: 32 gb ram, GT210 nvidia, one PCI network card. Both motherboards support 8150, but only GA-FX990D3 supports 8350.

So we'll move FX8350 to neon, and FX 8150 to oxygen.

I first set the Multiplier/Voltage Change back to Automatic on oxygen in preparation for the FX 8150.
I then shut down the two nodes and unplugged them. And here's where it's not funny anymore: I tried to gently remove the heatsink but the heatsink together with the CPU popped off in spite of the lever being locked. On both nodes.

The CPU was solidly glued to the heatsink in both cases. I managed to get the FX 8350 off its heatsink by gently scraping off the excess thermal paste (dry and solid), but the FX 8150 was a real struggle. In the end I used the back of a knife as a lever (gently). Not ideal.

Anyway, cleaned the fx 8350 heatsink and cpu, applied new thermal paste and installed on the gigabyte fxa990-d3 motherboard. Turned on -- no lights on the mobo. Fans etc all working. Dammit. Googled and saw that bios only supports 8350 from version 9 (incorrect -- looked at wrong mobo, but didn't discover that until later)

So now I have a CPU that'd work, but which I can't install since it's stuck to the heatsink, and one which I can install but not use, since the bios is wrong.

When I put the old CPU (fx8150) back in neon it wouldn't boot either -- the fans were spinning but no motherboard lights went on (e.g. LAN). PCI cards lit up, but nothing on Mobo. Took out the CPU, put back in, took out, put back in.

Put the FX8350 back in the oxygen case. Didn't work either, although here the LAN mobo light was on. Still didn't work though -- no video output and couldn't connect via LAN. Great. Killed two working nodes in on afternoon.

Finally, somehow, I managed to get neon working again. Popped a USB stick in. Set up a USB stick with a 1 GB W95 partition as shown in this post: http://verahill.blogspot.com.au/2013/04/401-amd-fx-8150issues-building-kernel.html

Downloaded bios etc. Couldn't install -- BIOS check error. Googled again -- dammit. BIOS for wrong mother board. And the bios that's installed actually supports 8350.

OK, installed all the CPUs, and now they booted up. I must have installed the CPUs badly -- which doesn't speak well of my attention to detail.

For some reason the card that used to be eth0 now gets assigned as eth2 on oxygen. Checked udev -- doesn't make sense. Turned everything off and checked that the pci card (eth0) was seated properly. Booted -- now ok.

Not sure if I have to recompile all the computational code but I did anyway -- the only difference, according to the acml cpuid.exe util, is that 8350 supports FMA3 while 8150 doesn't. Both support SSE, SSE2, SSE3, AVX, FMA4.

[edit]
After ca two months both boxes are stable in spite of being subjected to heavy work loads. The reason for the crashes/reboots originally must have been due to incompatible mobo/cpu.

18 October 2013

522. Briefly: nvidia installer holding up dpkg

I'm normally using smxi to handle my graphics drivers on debian.

Because I'm working on a QM/MM problem at the moment I wanted to use gromacs (4.6) to generate a solvated molecule, but I kept getting errors about missing libcudart4.so. So I figured I'd install it using apt-get install libcudart4

Everything went fine until I got the following prompt:

Selecting 'No' stops everything, and selecting Yes hangs it.

You can't install or remove any other programs until this has been resolved, as everything ends with

E: dpkg was interrupted, you must manually run 'sudo dpkg --configure -a' to correct the problem.

Running

sudo dpkg --configure -a

takes you back to the screenshot above.

The fix:
Run

nvidia-installer --uninstall

At this point you get

You can now select No and go through everything:

After this, you can run

sudo dpkg --configure -a

and now your system is working again so that packages can be installed and removed.

Pages