[Bioclusters] Queue Problems / Dead Processes

Mon, 14 Apr 2003 16:44:41 +0100

Hi all,

We have a 16 node Linux cluster here which we have installed Sun Grid Engine on.  While testing Grid Engine we have been running NCBI standalone Blast jobs against a local database.

The problem we are having is that when we submit a number of blast jobs to Grid Engine sometimes a queue that is being used will go into an alarm state.   The qstat -alarm command displays this error message: no load value for threshold np_load_avg and the job on the node will continue to use 99.9 of process time.  

Attempts to kill the job process or shutdown / restart the Grid Engine daemons are ignored by the machine and the only method we have found to reset the node is to power it down.

Has anyone seen this problem before?  We are not sure if it is a problem with our installation of Grid Engine or a problem with standalone blast if so any suggestions or pointers to relevant documentation would be appreciated.

Thanks

David

David Speed
Programmer
Roslin Institute
Bioinformatics Group
Roslin, 
Midlothian, 
EH25 9PS, 
UK
Telephone: +44 (0)131 527 4200 (switchboard) 
Fax: +44 (0)131 440 0434

The information contained in this e-mail (including any attachments) is confidential and is intended for the use of the addressee only. The opinions expressed within this e-mail (including any attachments) are the opinions of the sender and do not necessarily constitute those of Roslin Institute (Edinburgh) ("the Institute") unless specifically stated by a sender who is duly authorised to do so on behalf of the Institute.