01 March 2013

349. SGE: removed node while jobs were queued

The Problem
There's a cluster (running ROCKS with Sun Grid Engine) which I manage remotely and which I did not set up. Instead it was the IT people at that uni who first configured it. For some reason they named the nodes
compute-0-0.local
compute-0-1.local
compute-0-2.local
compute-0-3.local
compute-0-6.local
compute-0-7.local

Recently a few extra disks were added to the system, so all jobs were suspended. However, while installing the disks the local IT peep decided to change the node names without consulting us. Now the nodes were called

compute-0-0.local
compute-0-1.local
compute-0-2.local
compute-0-3.local
compute-0-4.local
compute-0-5.local

instead. Suddenly there were two node-queues with jobs in them, but with no corresponding nodes.Trying to delete the jobs in those queues only lead to:

all.q@compute-0-5.local        BIP   0/8/8          9.12     lx26-amd64    
   5142 0.55500 submit__v3 me         r     02/27/2013 15:02:11     8        
---------------------------------------------------------------------------------
all.q@compute-0-6.local        BIP   0/8/8          -NA-     lx26-amd64    auo
   5074 0.55500 submit__nb me         dr    02/02/2013 21:53:59     8      

The Solution
It wasn't immediately obvious how to fix this, but it turned out to be simple:
qconf -cq all.q@compute-0-6.local

That clears and deletes the queue. That's all.

No comments:

Post a Comment