[Bioclusters] How can I make blast job running short time on Gridengine

Fri, 22 Nov 2002 14:25:33 -0600

This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_001_01C29265.4F3E9990
Content-Type: text/plain;
	charset="iso-8859-1"

Grace,
 -->"But this should be done through a lot of work, and make some scripts,
right?"

Yes, unfortunately the process is simple - but the implementation does take
a little work. We (RLX Technologies) actually put a package together which
runs on top of LSF called the RLX BLAST Cluster Solution where this is
already done for you - with tons of configure options built in - you just
use the same command line as you would for NCBI BLAST and the job runs on
multiple nodes depending on the resources that are available. 

    -->"I have no cluster in hand, I just install SGE on two separate Unix
machines, so far, I can not think SGE can shorten certain job's running
time, it just distribute the job to idle host, of course, if one host is too
busy, it do not need wait, which will avoid the job packed on some hosts.
But suppose all the host are idle, the submitted job still running on one
host. So I want to know if there is big efficience difference between a
collection of computer and real cluster."

It's really hard to say what a "real cluster" is sometimes, but most times
it is just a collection of computers with something like SGE on it. These
clusters can be made to do things quite a bit faster than running a job on a
single node - but that then gets back to how you use them. If you are not
doing any scripting that partitions jobs then you will not see a real speed
increase in your BLAST runs. SGE alone will not shorten a job's run time -
it's only a toolbox that allows multiple computers to be used together -
from which creative scripting can be done that will run jobs faster....
--
Mike McCardle 
Systems Software Engineer 
RLX Technologies, Inc. 
mike.mccardle@rlxtechnologies.com <mailto:mike.mccardle@rlxtechnologies.com>

 <http://www.rlxtechnologies.com/> http://www.rlxtechnologies.com 

[Mike McCardle]  
[Mike McCardle]  -----Original Message-----
From: bioinfo Gu [mailto:bioinfowistar@yahoo.com]
Sent: Thursday, November 21, 2002 2:00 PM
To: bioclusters@bioinformatics.org
Subject: RE: [Bioclusters] How can I make blast job running short time on
Gridengine

Hi Mike, 

   <mailto:mike.mccardle@rlx.com>   

Hi Grace,
    A popular method for running jobs on two different machines at once is
to divide the input into parts, send each part to a different machine, run
to program on each machine using the segment of the input on that machine,
then combining the results. This is what's usually called an embarrassingly
parallel method, where each job has 5 parts:
1) pre-processing (preparing data) on the submission host
2) data transfer to the nodes
3) processing on the execution nodes
4) data transfer back to the submission node, queuing of the results
5) post-processing (combining the results) on the submission host

But this should be done through a lot of work, and make some scripts, right?

So, for the case where you are using BLAST as the application, the database
(or query) can be split on sequence boundaries, sent to each of the nodes
for BLASTing, result files sent back to the submission host and combined to
get the final result. This yields about 1/N performance depending on the
efficiency of your configuration, where N is the number of nodes in your
cluster. This would be one way of getting roughly 2x the performance our of
your SGE cluster than what you would get out of a single machine.

    Another nice feature of some Distributed Resource Management (DRM) tools
like (LSF, SGE ...) on clusters is that they do some level of load
balancing. So, if you needed to run a job and needed it to have a fair
chance of getting run with whatever everyone else is doing, the DRM would
figure out which machine will give you the best service for your job. One
nice feature of some schedulers in DRM packages (and they are not all
equal!!) is that each user, group, job... can have a priority placed on it
that will actually preempt other jobs, shuffle queuing... to get the right
resources into the hands of the people who need them most.. 

I have no cluser in hand, I just install SGE on two separate unix machines,
so far, I can not think SGE can shorten certain job's running time, it just
distribute the job to idle host, of course, if one host is too busy, it do
not need wait, which will avaoid the job packed on some hosts. But suppose
all the host are idle, the submitted job still running on one host. So I
want to know if there is big efficience difference between a collection of
computer and real cluster.

Look forward your suggestion.

Grace

Combining parallel (embarrassingly parallel) job execution with
scheduling/load-balancing features of DRM tools is really the key to
achieving the efficiency in a cluster that makes if a valuable resource for
doing things like BLAST. 
________________________________________ 

Mike McCardle 
Systems Software Engineer 
RLX Technologies, Inc. 
mike.mccardle@rlxtechnologies.com <mailto:mike.mccardle@rlxtechnologies.com>

 <http://www.rlxtechnologies.com/> http://www.rlxtechnologies.com 

From: bioinfo Gu [mailto:bioinfowistar@yahoo.com]
Sent: Thursday, November 21, 2002 10:08 AM
To: bioclusters@bioinformatics.org
Subject: [Bioclusters] How can I make blast job running short time on
Gridengine

Hi all,

I have two machines to install SGE. athena is master and excution host,
apollo is only excution host. When I submit a job from master host with
qsub, the job will be distributed to one of queue(one of host), and this job
will be executed on this machinefor the whole process. I can not see
Gridengine can save execution time when launch blast job on it.  how can I
save blast running time on Gridengine, do I have to use Parallel
Environment?

Also, how can I setup environmental variable for specific execution host in
batch job script? For example: 

on one excution host: I need BLASTDB point to path1, on the second execution
host, I want to set BLASTDB to path2, how can I do that?

Thank you very much in advance.

Grace

  _____  

Do you Yahoo!?
Yahoo!  <http://rd.yahoo.com/mail/mailsig/*http://mailplus.yahoo.com> Mail
Plus - Powerful. Affordable. Sign up
<http://rd.yahoo.com/mail/mailsig/*http://mailplus.yahoo.com> now

  _____  

Do you Yahoo!?
Yahoo! Mail  <http://rd.yahoo.com/mail/mailsig/*http://mailplus.yahoo.com>
Plus - Powerful. Affordable. Sign up
<http://rd.yahoo.com/mail/mailsig/*http://mailplus.yahoo.com> now

------_=_NextPart_001_01C29265.4F3E9990
Content-Type: text/html;
	charset="iso-8859-1"

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">

<META content="MSHTML 5.50.4522.1800" name=GENERATOR></HEAD>
<BODY>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN 
class=580564615-22112002>Grace,</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN 
class=580564615-22112002>&nbsp;--&gt;"<SPAN class=754271417-21112002>But this 
should be done through a lot of work, and make some scripts, 
right?"</SPAN></SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN class=580564615-22112002><SPAN 
class=754271417-21112002></SPAN></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN class=580564615-22112002><SPAN 
class=754271417-21112002>Yes, unfortunately the process is simple - but the 
implementation does take a little work.&nbsp;We&nbsp;(RLX Technologies) actually 
put a package together which runs on top of LSF called the RLX BLAST Cluster 
Solution where this is already done for you - with tons of configure 
options&nbsp;built in - you just use the same command line as you would for NCBI 
BLAST and the job runs on multiple nodes depending on the resources that are 
available. </SPAN></SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN class=580564615-22112002><SPAN 
class=754271417-21112002></SPAN></SPAN></FONT>&nbsp;</DIV>
<DIV><SPAN class=580564615-22112002><SPAN class=754271417-21112002><FONT 
face=Arial color=#0000ff size=2>&nbsp;&nbsp;&nbsp; --&gt;"</FONT><SPAN 
class=754271417-21112002><FONT face=Arial><FONT color=#0000ff><FONT size=2>I 
have no cluster in hand, I just install SGE on two separate Unix machines, so 
far, I can not think SGE can shorten certain job's running time, it just 
distribute the job to idle host, of course, if one host is too busy, it do not 
need wait, which will avoid the job packed on some hosts.&nbsp;But suppose all 
the host are idle, the submitted job still running on one host.&nbsp;So I want 
to know if there is big efficience difference between a collection of computer 
and real cluster<SPAN 
class=580564615-22112002>."</SPAN></FONT></FONT></FONT></SPAN></SPAN></SPAN></DIV>
<DIV><SPAN class=580564615-22112002><SPAN class=754271417-21112002><SPAN 
class=754271417-21112002><FONT face=Arial><FONT color=#0000ff><FONT size=2><SPAN 
class=580564615-22112002></SPAN></FONT></FONT></FONT></SPAN></SPAN></SPAN>&nbsp;</DIV>
<DIV><SPAN class=580564615-22112002><SPAN class=754271417-21112002><SPAN 
class=754271417-21112002><FONT face=Arial><FONT color=#0000ff><FONT size=2><SPAN 
class=580564615-22112002>It's really hard to say what a "real cluster" is 
sometimes, but most times it is just a collection of computers with something 
like SGE on it. These clusters can be made to do things quite&nbsp;a bit faster 
than running&nbsp;a job&nbsp;on a single node - but that then gets back 
to&nbsp;how&nbsp;you use them. If you are not doing any scripting&nbsp;that 
partitions jobs then you will not see a real speed increase in your BLAST runs. 
SGE alone will not shorten a job's run time - it's only a toolbox that allows 
multiple computers to be used together - from which creative scripting can be 
done that will run jobs 
faster....</SPAN></FONT></FONT></FONT></SPAN></SPAN></SPAN></DIV>
<DIV><SPAN class=580564615-22112002><SPAN class=754271417-21112002><SPAN 
class=754271417-21112002><FONT face=Arial><FONT color=#0000ff><FONT size=2><SPAN 
class=580564615-22112002>--</SPAN></FONT></FONT></FONT></SPAN></SPAN></SPAN></DIV>
<DIV><SPAN class=580564615-22112002><SPAN class=754271417-21112002><SPAN 
class=754271417-21112002><SPAN class=580564615-22112002>
<P><FONT face=Arial><FONT color=#0000ff><FONT size=2>Mike McCardle 
<BR>Systems&nbsp;<SPAN class=580564615-22112002>Software </SPAN>Engineer <BR>RLX 
Technologies, Inc. <BR><A 
href="mailto:mike.mccardle@rlxtechnologies.com">mike.mccardle@rlxtechnologies.com</A> 
<BR></FONT><A target=_blank href="http://www.rlxtechnologies.com/"><FONT 
size=2>http://www.rlxtechnologies.com</FONT></A><FONT size=2> 
</FONT></FONT></FONT></P></SPAN></SPAN></SPAN></SPAN></DIV>
<DIV><SPAN class=580564615-22112002><SPAN class=754271417-21112002><SPAN 
class=754271417-21112002><FONT face=Arial><FONT size=2><SPAN 
class=580564615-22112002></SPAN></FONT></FONT></SPAN></SPAN></SPAN><FONT 
face=Tahoma><BR><SPAN class=580564615-22112002><FONT face=Arial size=2>[Mike 
McCardle]&nbsp;&nbsp;</FONT></SPAN><BR><FONT size=2><SPAN 
class=580564615-22112002><FONT face=Arial>[Mike 
McCardle]&nbsp;&nbsp;</FONT></SPAN>-----Original Message-----<BR><B>From:</B> 
bioinfo Gu [mailto:bioinfowistar@yahoo.com]<BR><B>Sent:</B> Thursday, November 
21, 2002 2:00 PM<BR><B>To:</B> bioclusters@bioinformatics.org<BR><B>Subject:</B> 
RE: [Bioclusters] How can I make blast job running short time on 
Gridengine<BR><BR></DIV></FONT></FONT>
<BLOCKQUOTE>
  <P>Hi Mike, 
  <P><B><I>&nbsp;<A href="mailto:mike.mccardle@rlx.com">&nbsp;</A></I></B> 
  <META content="MSHTML 5.50.4522.1800" name=GENERATOR>
  <BLOCKQUOTE 
  style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #1010ff 2px solid">
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN class=754271417-21112002>Hi 
    Grace,</SPAN></FONT></DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN 
    class=754271417-21112002>&nbsp;&nbsp;&nbsp; A popular method for running 
    jobs on two different&nbsp;machines at once is to divide the input into 
    parts, send each part to a different machine, run to program on each machine 
    using the segment of the input on that&nbsp;machine, then combining the 
    results. This is what's usually called an embarrassingly parallel method, 
    where each job has 5 parts:</SPAN></FONT></DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN class=754271417-21112002>1) 
    pre-processing (preparing data) on the submission host</SPAN></FONT></DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN class=754271417-21112002>2) 
    data transfer to the nodes</SPAN></FONT></DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN class=754271417-21112002>3) 
    processing on the execution nodes</SPAN></FONT></DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN class=754271417-21112002>4) 
    data transfer back to the submission node, queuing of the 
    results</SPAN></FONT></DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN class=754271417-21112002>5) 
    post-processing (combining the results) on the submission 
    host</SPAN></FONT></DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN 
    class=754271417-21112002></SPAN></FONT>&nbsp;</DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN 
    class=754271417-21112002>But this should be done through a lot of work, and 
    make some scripts, right?</SPAN></FONT></DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN 
    class=754271417-21112002></SPAN></FONT>&nbsp;</DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN 
    class=754271417-21112002>So, for the case where you are using BLAST as the 
    application, the database (or query) can be split on sequence boundaries, 
    sent to each of the nodes for BLASTing, result files sent back to the 
    submission host and combined to get the final result. This yields about 1/N 
    performance depending on the efficiency of&nbsp;your configuration, where N 
    is the number of nodes in your cluster.&nbsp;This would be one way of 
    getting roughly 2x the performance our of your SGE cluster than what you 
    would get out of a single machine.</SPAN></FONT></DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN 
    class=754271417-21112002></SPAN></FONT>&nbsp;</DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN 
    class=754271417-21112002>&nbsp;&nbsp;&nbsp;&nbsp;Another&nbsp;nice feature 
    of some Distributed Resource&nbsp;Management (DRM) tools like (LSF, SGE ...) 
    on clusters is that they do some level of load balancing. So, if you needed 
    to run a job and&nbsp;needed it to have a fair chance of getting 
    run&nbsp;with whatever everyone else is doing, the DRM would figure out 
    which machine will give you the best service for your job. One nice feature 
    of some schedulers in DRM packages (and they are not all equal!!) is that 
    each user, group, job... can have a priority placed on it that will actually 
    preempt other jobs, shuffle queuing... to get the right resources&nbsp;into 
    the hands&nbsp;of the&nbsp;people who need them 
    most..&nbsp;</SPAN></FONT></DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN 
    class=754271417-21112002></SPAN></FONT>&nbsp;</DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN class=754271417-21112002>I 
    have no cluser in hand, I just install SGE on two separate unix machines, so 
    far, I can not think SGE can shorten certain job's running time, it just 
    distribute the job to idle host, of course, if one host is too busy, it do 
    not need wait, which will avaoid the job packed on some hosts.&nbsp;But 
    suppose all the host are idle, the submitted job still running on one 
    host.&nbsp;So I want to know if there is big efficience difference between a 
    collection of computer and real cluster.</SPAN></FONT></DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN 
    class=754271417-21112002></SPAN></FONT>&nbsp;</DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN 
    class=754271417-21112002>Look forward your suggestion.</SPAN></FONT></DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN 
    class=754271417-21112002></SPAN></FONT>&nbsp;</DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN 
    class=754271417-21112002>Grace</SPAN></FONT></DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN 
    class=754271417-21112002></SPAN></FONT>&nbsp;</DIV>
    <DIV><FONT face=Arial color=#0000ff size=2><SPAN 
    class=754271417-21112002>Combining parallel (embarrassingly 
    parallel)&nbsp;job execution 
    with&nbsp;scheduling/load-balancing&nbsp;features of DRM tools&nbsp;is 
    really the key to achieving the efficiency&nbsp;in a cluster that makes if a 
    valuable resource for doing things like BLAST. </SPAN></FONT></DIV>
    <DIV><SPAN class=754271417-21112002><FONT face=Arial><FONT color=#0000ff 
    size=2>________________________________________ </FONT></FONT></DIV>
    <DIV>
    <P><FONT face=Arial><FONT color=#0000ff><FONT size=2>Mike McCardle 
    <BR>Systems&nbsp;<SPAN class=754271417-21112002>Software </SPAN>Engineer 
    <BR>RLX Technologies, Inc. <BR><A 
    href="mailto:mike.mccardle@rlxtechnologies.com">mike.mccardle@rlxtechnologies.com</A> 
    <BR></FONT><A target=_blank href="http://www.rlxtechnologies.com/"><FONT 
    size=2>http://www.rlxtechnologies.com</FONT></A><FONT size=2> 
    </FONT></FONT></FONT></P>
    <P></SPAN><FONT face=Tahoma size=2><B>From:</B> bioinfo Gu 
    [mailto:bioinfowistar@yahoo.com]<BR><B>Sent:</B> Thursday, November 21, 2002 
    10:08 AM<BR><B>To:</B> bioclusters@bioinformatics.org<BR><B>Subject:</B> 
    [Bioclusters] How can I make blast job running short time on 
    Gridengine<BR><BR></FONT></P></DIV>
    <BLOCKQUOTE>
      <P>Hi all,</P>
      <P>I have two machines to install SGE. athena is master and excution host, 
      apollo is only excution host. When I submit a job from master host with 
      qsub, the job will be distributed to one of queue(one of host), and this 
      job will be executed on this machinefor the whole process. I can not see 
      Gridengine can save execution time when launch blast job on it.&nbsp; how 
      can I save blast running time on Gridengine, do I have to use Parallel 
      Environment?</P>
      <P>Also, how can I setup environmental variable for specific execution 
      host in batch job script? For example: </P>
      <P>on one excution host: I need&nbsp;BLASTDB point to path1, on the second 
      execution host, I want to set&nbsp;BLASTDB to path2, how can I do 
that?</P>
      <P>Thank you very much in advance.</P>
      <P>Grace</P>
      <P><FONT face=Arial color=#0000ff size=2></FONT><FONT face=Arial 
      color=#0000ff size=2></FONT><FONT face=Arial color=#0000ff 
      size=2></FONT><FONT face=Arial color=#0000ff size=2></FONT><FONT 
      face=Arial color=#0000ff size=2></FONT><BR>
      <HR SIZE=1>
      Do you Yahoo!?<BR><A 
      href="http://rd.yahoo.com/mail/mailsig/*http://mailplus.yahoo.com">Yahoo! 
      Mail Plus</A> - Powerful. Affordable. <A 
      href="http://rd.yahoo.com/mail/mailsig/*http://mailplus.yahoo.com">Sign up 
      now</A></BLOCKQUOTE></BLOCKQUOTE>
  <P><BR>
  <HR SIZE=1>
  Do you Yahoo!?<BR><A 
  href="http://rd.yahoo.com/mail/mailsig/*http://mailplus.yahoo.com">Yahoo! Mail 
  Plus</A> - Powerful. Affordable. <A 
  href="http://rd.yahoo.com/mail/mailsig/*http://mailplus.yahoo.com">Sign up 
  now</A></BLOCKQUOTE></BODY></HTML>

------_=_NextPart_001_01C29265.4F3E9990--