[Bioclusters] PBS abnormal after a failed node

Zhiliang Hu zhu at iastate.edu
Mon Jun 9 14:32:09 EDT 2008


We have a situation where PBS queue hang after a failed node:

Last week we had a bad node which failed NFS mount of shared drives.
After numerous efforts we (with helps of the vender) determine that it's either a bad motherboard or bad node-disk.  While that's being fixed, I tried to make the PBS jobs queue without this node, by
(1) > pbsnodes -o node006
which gives error: Error marking node node006 - Unauthorized Request
(I was as ROOT, 'su - root')

(2) Deleted the line for the node in:
  /var/spool/torque/server_priv/nodes
and restarted PBS:
  /etc/init.d/pbs stop
  /etc/init.d/pbs start
which appear started alright.

Now the problem is -- all jobs queued (by qsub) are hanging there without getting into any node process. I tried to delete all queue and resubmit but the results are the same.  Any hint what could be the problem?

Thanks in advance,

Zhiliang

--
Zhi-Liang Hu (PhD)
Associate Scientist,
Assistant to NAGRP Bioinformatics Coordinators,
National Animal Genome Research Program,
Department of Animal Science,
Center for Integrated Animal Genomics,
Iowa State University
Tel: 901-759-0643 (H,O) 901-212-2820 (C) 
Web: http://www.animalgenome.org

"Not everything that counts can be counted, and
    not everything that can be counted counts." 

"If you torture the data long enough, 
it will confess."  -- Ronald Coase





More information about the Bioclusters mailing list