[Bioclusters] PBS abnormal after a failed node

Greenseid, Joseph M. Joseph.Greenseid at ngc.com
Fri Jun 13 08:55:15 EDT 2008


I'd be curious what the system says about the reason jobs are being held in the queue.  If you're using the Maui scheduler paired with Torque, the command "checkjob" is a really useful one (when run as root).  Torque has a similar command, though in my experience it doesn't give quite as much info, "tracejob."  Run `<command> $JOBID` and see if you get any helpful info about what's holding up your jobs.
 
--Joe

________________________________

From: bioclusters-bounces at bioinformatics.org on behalf of Zhiliang Hu
Sent: Thu 6/12/2008 8:27 AM
To: HPC in Bioinformatics
Subject: Re: [Bioclusters] PBS abnormal after a failed node



Thanks Joe,

I did it as 'root':

[root at cluster ~]# qmgr -c "set node node006 state=offline"
qmgr obj=node006 svr=default: Unauthorized Request

[root at cluster ~]# qmgr
Max open servers: 4
Qmgr: set node node006 state=offline
qmgr obj=node006 svr=default: Unauthorized Request

Any idea why is the error?

Also, After I remove a node from /var/spool/torque/server_priv/nodes,
restart pbs, the 'pbsnodes' shows it disappeared in the list.  However the queued jobs still don't get into any node.  I think we have a bigger problem ... will update later.

Thanks!
Zhiliang


At 08:57 AM 6/10/2008 -0500, Greenseid, Joseph M. wrote:
>did you try to mark the node offline in qmgr (qmgr -c "set node node006 state=offline")?  that's how i mark my nodes offline if there are problems.
>
>after you deleted the node from the nodes file, does pbsnodes still list it?  if so, torque may have the node's name stored somewhere else that you missed.
>
>--Joe
>
>________________________________
>
>From: bioclusters-bounces at bioinformatics.org on behalf of Zhiliang Hu
>Sent: Mon 6/9/2008 2:32 PM
>To: HPC in Bioinformatics
>Subject: [Bioclusters] PBS abnormal after a failed node
>
>
>
>We have a situation where PBS queue hang after a failed node:
>
>Last week we had a bad node which failed NFS mount of shared drives.
>After numerous efforts we (with helps of the vender) determine that it's either a bad motherboard or bad node-disk.  While that's being fixed, I tried to make the PBS jobs queue without this node, by
>(1) > pbsnodes -o node006
>which gives error: Error marking node node006 - Unauthorized Request
>(I was as ROOT, 'su - root')
>
>(2) Deleted the line for the node in:
>  /var/spool/torque/server_priv/nodes
>and restarted PBS:
>  /etc/init.d/pbs stop
>  /etc/init.d/pbs start
>which appear started alright.
>
>Now the problem is -- all jobs queued (by qsub) are hanging there without getting into any node process. I tried to delete all queue and resubmit but the results are the same.  Any hint what could be the problem?
>
>Thanks in advance,
>
>Zhiliang
>
>--
>Zhi-Liang Hu (PhD)
>Associate Scientist,
>Assistant to NAGRP Bioinformatics Coordinators,
>National Animal Genome Research Program,
>Department of Animal Science,
>Center for Integrated Animal Genomics,
>Iowa State University
>Tel: 901-759-0643 (H,O) 901-212-2820 (C)
>Web: http://www.animalgenome.org <http://www.animalgenome.org/>  <http://www.animalgenome.org/>
>
>"Not everything that counts can be counted, and
>    not everything that can be counted counts."
>
>"If you torture the data long enough,
>it will confess."  -- Ronald Coase
>
>
>
>_______________________________________________
>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>http://www.bioinformatics.org/mailman/listinfo/bioclusters
>
>
>_______________________________________________
>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>http://www.bioinformatics.org/mailman/listinfo/bioclusters


_______________________________________________
Bioclusters maillist  -  Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters




More information about the Bioclusters mailing list