[Bioclusters] PBS abnormal after a failed node
Zhiliang Hu
zhu at iastate.edu
Mon Jun 9 14:32:09 EDT 2008
We have a situation where PBS queue hang after a failed node:
Last week we had a bad node which failed NFS mount of shared drives.
After numerous efforts we (with helps of the vender) determine that it's either a bad motherboard or bad node-disk. While that's being fixed, I tried to make the PBS jobs queue without this node, by
(1) > pbsnodes -o node006
which gives error: Error marking node node006 - Unauthorized Request
(I was as ROOT, 'su - root')
(2) Deleted the line for the node in:
/var/spool/torque/server_priv/nodes
and restarted PBS:
/etc/init.d/pbs stop
/etc/init.d/pbs start
which appear started alright.
Now the problem is -- all jobs queued (by qsub) are hanging there without getting into any node process. I tried to delete all queue and resubmit but the results are the same. Any hint what could be the problem?
Thanks in advance,
Zhiliang
--
Zhi-Liang Hu (PhD)
Associate Scientist,
Assistant to NAGRP Bioinformatics Coordinators,
National Animal Genome Research Program,
Department of Animal Science,
Center for Integrated Animal Genomics,
Iowa State University
Tel: 901-759-0643 (H,O) 901-212-2820 (C)
Web: http://www.animalgenome.org
"Not everything that counts can be counted, and
not everything that can be counted counts."
"If you torture the data long enough,
it will confess." -- Ronald Coase
More information about the Bioclusters
mailing list