Showing posts with label simple queue manager. Show all posts
Showing posts with label simple queue manager. Show all posts

27 March 2012

115. Very Simple Python Queue Manager

I suppose we can call it the VSPQM, which sounds a bit like a Roman initialism, akin to SPQR.

I've spent the past few days trying to get to grips with the Sun Gridengine (SGE) but have given up for now. While it seems capable, it's just overkill for my purposes, especially taking into account the difficulties in simply configuring it. It's a bit similar to my experience with OpenDX, a very capable plotting program, but which I couldn't make work to satisfaction in spite of being one of the lucky few in possession of the "Open DX -- Paths to Visualisation" book.

Long story short -- I wrote a small script in python. It
- reads a file, list, with the name of shell scripts
- the shell scripts, job1.sh..jobn.sh, are executed sequentially - when the execution of one script is finished, the next one is executed
- jobs can be added and removed from list during execution

It's a 'dumb' script -- it does not try to balance jobs across nodes or look for idle cpus/cores. It just executes one job after the other, and mark jobs as done after execution.

To test it:
create a file called list and put the following lines in it:
pi40.sh
pi400.sh
pi2000.sh
The scripts are the following:

pi40.sh
echo "pi to 40 decimals"
echo "scale=40; 4*a(1)" | bc -l -q
echo "done"
pi400.sh
echo "scale=400; 4*a(1)" | bc -l -q
pi200.sh
 echo "scale=2000; 4*a(1)" | bc -l -q
The python code for vspqm.py is below

I've aliased my vspqm (edit ~/.bashrc):
alias vspqm='/home/me/work/vspqm/vspqm.py'
Then sourced ~/.bashrc

Launch in the directory you keep your list file using
me@beryllium:~/work/vspqm/jobs$ vspqm list > log &
[1] 23925
me@beryllium:~/work/vspqm/jobs$ cat log
pi to 40 decimals
3.1415926535897932384626433832795028841968
done
3.141592653589793238462643383279502884197169399375105820974944592307\
[..]
3.141592653589793238462643383279502884197169399375105820974944592307\
81640628620899862803482534211706798214808651328230664709384460955058\
[..]

An nwchem example would be
list:
ac.sh
bn.sh
ac.sh:
cd acetone/
mpirun -n 4 nwchem ac.nw>ac.out
cd ../
bn.sh:
cd benzene/
mpirun -n 4 nwchem bn.nw>bn.out
cd ../


Our python queue manager (which we'll call vspqm.py and chmod +x to make executable) is below. Don't forget to change #!/usr/bin/python2.4 if necessary -- I use 2.4 on ROCKS and 2.7 on Debian testing/wheezy

#!/usr/bin/python2.4
# rudimentary queue manager. Handles a single node,
# submitting a series of jobs in sequence. use python v2.4-2.7
import os
import time
import sys
infile=sys.argv[1]
print "pyqm v 0.0.3"
def launchjob(job):
        i=0
        print "######"
        job=job.rstrip('\n')
     
        i=os.system("sh "+job)
        if i==0:
                print "Job successful"
        else:
                print "Job failed"
        print "######"
        return i
def remake_list(infile):
        qfile=open(infile,"w")
        bakfile=open(infile+".bak",'r')
        for i in bakfile:
                qfile.write(i)
        return 0
def rewind(infile):
        qfile=open(infile,"w")
        bakfile=open(infile+".bak",'r')
        for i in bakfile:
                qfile.write(i[1:])
        return 0
def get_next_job(infile):
        qfile=open(infile,"r")
        bakfile=open(infile+".bak",'w')
        lines=""
        job=""
        for line in qfile:
                if line[0]=="*":
                        print "Marked as done: ",line[1:]
                if line[0]!="*" and job=="":
                        print "Launching: ", line
                        job=line
                        line="*"+line
                lines+=line
        bakfile.write(lines)
        qfile.close
        bakfile.close
        return job
def main(infile):
        jobs=1
        while (jobs==1):
                newjob=get_next_job(infile)
                remake_list(infile)
                if newjob!="":
                        jobs=1
                        echojob=launchjob(newjob)
                else:
                        print "No more jobs found at "+str(time.asctime())    
                        jobs=0
        return 0

if __name__ == "__main__":
        main(infile)
        rewind(infile)