[Bioclusters] Reducing the memory of BLAST

Wed Mar 9 23:31:48 EST 2005

Hi,

  Why the -F F? you are increasing the number of hits by including low
complexity regions in the search. I bet that requires creating some fat
index along the way.

   Eitan

--------------------
Eitan Rubin, PhD
Head of Bioinformatics
The Bauer Center for Genomics Research
Harvard University
Tel: 617-496-5649 Fax: 617-495-2196

-----Original Message-----
From: bioclusters-request at bioinformatics.org
[mailto:bioclusters-request at bioinformatics.org] 
Sent: Wednesday, March 09, 2005 6:46 PM
To: bioclusters at bioinformatics.org
Subject: Bioclusters Digest, Vol 5, Issue 9

Send Bioclusters mailing list submissions to
	bioclusters at bioinformatics.org

To subscribe or unsubscribe via the World Wide Web, visit
	https://bioinformatics.org/mailman/listinfo/bioclusters
or, via email, send a message with subject or body 'help' to
	bioclusters-request at bioinformatics.org

You can reach the person managing the list at
	bioclusters-owner at bioinformatics.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Bioclusters digest..."

Today's Topics:

   1. multiple inputs to MPIBLAST (Lik Mui)
   2. Memory Usage for Blast - question (Dinanath Sulakhe)
   3. Re: multiple inputs to MPIBLAST (Aaron Darling)
   4. Re: Memory Usage for Blast - question (Hrishikesh Deshmukh)
   5. Re: Memory Usage for Blast - question (Dinanath Sulakhe)
   6. Re: Memory Usage for Blast - question (Lucas Carey)
   7. Re: Memory Usage for Blast - question (Dinanath Sulakhe)
   8. Re: Memory Usage for Blast - question (Lucas Carey)

----------------------------------------------------------------------

Message: 1
Date: Wed,  9 Mar 2005 13:27:10 -0800
From: Lik Mui <lmui at stanford.edu>
Subject: [Bioclusters] multiple inputs to MPIBLAST
To: bioclusters at bioinformatics.org
Message-ID: <1110403630.422f6a2e8ddce at webmail.stanford.edu>
Content-Type: text/plain; charset=ISO-8859-1

Hello, I tried to feed multiple inputs to mpiblast (all in a single FASTA
file).  I found that when the number of inputs is > 15, mpiblast's
performance GREATLY deteriotes.  For example, using 1 single head node, I
get a blastall output in about 20 seconds.  When I feed an input of 20
input sequences to MPIBLAST on a 24 node cluster, the result takes 3
minutes to get back.  This is hardly super-linear.

I am running on a 24 nodes Platform ROCKS cluster with MPICH 1.2.6, and the
latest MPIBLAST 1.3.0.

Can anyone explain why this is or how to get around MPIBLAST slowing down
with multiple inputs.

Thanks in advanced.

           Lik Mui

p.s. because my genome db is about 1 GB, it seems to make sense to process a
batch of inputs together with a single read of the db.  Hence, I am running
multiple input files.  If this is not correct reasoning, please comment.

------------------------------

Message: 2
Date: Wed, 09 Mar 2005 16:09:41 -0600
From: Dinanath Sulakhe <sulakhe at mcs.anl.gov>
Subject: [Bioclusters] Memory Usage for Blast - question
To: bioclusters at bioinformatics.org
Message-ID: <6.0.0.22.2.20050309154748.04a271b0 at pop.mcs.anl.gov>
Content-Type: text/plain; charset="us-ascii"; format=flowed

Hi,
I am not sure if this is the right place to ask this question !!
I am running Blast (NCBI) parallely on a cluster with 80 nodes. (I am 
running NCBI NR against Itself). Each node is a dual processor.

I am using Condor to submit the jobs to this cluster. The problem I am 
coming across is, whenever two blast jobs (each blast job has 100 
sequences) are assigned on One node (one on each processor), the node 
cannot handle the amount of memory used by the two blast jobs. PBS mom 
daemon on the nodes cannot allocate the memory they need to monitor the 
jobs on the node and they fail, thus killing the jobs.

Condor doesn't recognize this failure and assumes that the job was 
successfully completed, but actually only few sequences get processed 
before the job is killed.

Now the Admin of the Site is asking me if its possible to reduce the amount 
of memory these blast jobs use? He says these jobs are requesting about 
600-700MB of RAM, and he is asking me to reduce it to atmost 500MB.

Is it possible to reduce the amount of RAM it is requesting by tweaking any 
of the parameters in blast??

My blast options are :

blastall -i $input -o $output -d $db -p blastp -m 8 -F F

Please let me know,
Thank you,
Dina

------------------------------

Message: 3
Date: Wed, 09 Mar 2005 16:11:25 -0600
From: Aaron Darling <darling at cs.wisc.edu>
Subject: Re: [Bioclusters] multiple inputs to MPIBLAST
To: "Clustering,	compute farming & distributed computing in life
	science informatics"	<bioclusters at bioinformatics.org>
Message-ID: <422F748D.2090403 at cs.wisc.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi Lik

The bad behavior could be due any one of a number of factors (extra 
fragment copies, startup overhead, etc).  In order to pin down what's 
going wrong on your setup it would be helpful to have a debug log as 
generated by adding the --debug command line option.  Debug goes to 
stderr, redirect as appropriate for whatever shell you use.  As the 
mpiblast debug log can get lengthy you may want to send it directly to 
me or post it on a web server somewhere...

-Aaron

Lik Mui wrote:

>Hello, I tried to feed multiple inputs to mpiblast (all in a single FASTA
>file).  I found that when the number of inputs is > 15, mpiblast's
>performance GREATLY deteriotes.  For example, using 1 single head node, I
>get a blastall output in about 20 seconds.  When I feed an input of 20
>input sequences to MPIBLAST on a 24 node cluster, the result takes 3
>minutes to get back.  This is hardly super-linear.
>
>I am running on a 24 nodes Platform ROCKS cluster with MPICH 1.2.6, and the
>latest MPIBLAST 1.3.0.
>
>Can anyone explain why this is or how to get around MPIBLAST slowing down
>with multiple inputs.
>
>Thanks in advanced.
>
>           Lik Mui
>
>
>p.s. because my genome db is about 1 GB, it seems to make sense to process
a
>batch of inputs together with a single read of the db.  Hence, I am running
>multiple input files.  If this is not correct reasoning, please comment.
>
>
>
>_______________________________________________
>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/bioclusters
>  
>

------------------------------

Message: 4
Date: Wed, 9 Mar 2005 17:50:37 -0500
From: Hrishikesh Deshmukh <hdeshmuk at gmail.com>
Subject: Re: [Bioclusters] Memory Usage for Blast - question
To: "Clustering,	compute farming & distributed computing in life
	science informatics"	<bioclusters at bioinformatics.org>
Message-ID: <829d7fb6050309145051155606 at mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII

Hi,

You haven't told how long each sequence is!, you can tweak wordsize -W
to make it faster but then BLAST becomes less sensitive.I suggest you
take a look at the book BLAST by Ian Korf et al.

Thanks,
Hrishi

On Wed, 09 Mar 2005 16:09:41 -0600, Dinanath Sulakhe
<sulakhe at mcs.anl.gov> wrote:
> Hi,
> I am not sure if this is the right place to ask this question !!
> I am running Blast (NCBI) parallely on a cluster with 80 nodes. (I am
> running NCBI NR against Itself). Each node is a dual processor.
> 
> I am using Condor to submit the jobs to this cluster. The problem I am
> coming across is, whenever two blast jobs (each blast job has 100
> sequences) are assigned on One node (one on each processor), the node
> cannot handle the amount of memory used by the two blast jobs. PBS mom
> daemon on the nodes cannot allocate the memory they need to monitor the
> jobs on the node and they fail, thus killing the jobs.
> 
> Condor doesn't recognize this failure and assumes that the job was
> successfully completed, but actually only few sequences get processed
> before the job is killed.
> 
> Now the Admin of the Site is asking me if its possible to reduce the
amount
> of memory these blast jobs use? He says these jobs are requesting about
> 600-700MB of RAM, and he is asking me to reduce it to atmost 500MB.
> 
> Is it possible to reduce the amount of RAM it is requesting by tweaking
any
> of the parameters in blast??
> 
> My blast options are :
> 
> blastall -i $input -o $output -d $db -p blastp -m 8 -F F
> 
> Please let me know,
> Thank you,
> Dina
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>

------------------------------

Message: 5
Date: Wed, 09 Mar 2005 16:59:28 -0600
From: Dinanath Sulakhe <sulakhe at mcs.anl.gov>
Subject: Re: [Bioclusters] Memory Usage for Blast - question
To: Hrishikesh Deshmukh <hdeshmuk at gmail.com>, "Clustering,	compute
	farming & distributed computing in life science informatics"
	<bioclusters at bioinformatics.org>, "Clustering,	compute farming &
	distributed computing in life science informatics"
	<bioclusters at bioinformatics.org>
Message-ID: <6.0.0.22.2.20050309165745.03d35090 at pop.mcs.anl.gov>
Content-Type: text/plain; charset="us-ascii"; format=flowed

This is a blast run for NCBI NR against itself. So the sequence size varies.
Thanks for the reply, i will look into the wordsize.

Dina

At 04:50 PM 3/9/2005, Hrishikesh Deshmukh wrote:
>Hi,
>
>You haven't told how long each sequence is!, you can tweak wordsize -W
>to make it faster but then BLAST becomes less sensitive.I suggest you
>take a look at the book BLAST by Ian Korf et al.
>
>Thanks,
>Hrishi
>
>
>On Wed, 09 Mar 2005 16:09:41 -0600, Dinanath Sulakhe
><sulakhe at mcs.anl.gov> wrote:
> > Hi,
> > I am not sure if this is the right place to ask this question !!
> > I am running Blast (NCBI) parallely on a cluster with 80 nodes. (I am
> > running NCBI NR against Itself). Each node is a dual processor.
> >
> > I am using Condor to submit the jobs to this cluster. The problem I am
> > coming across is, whenever two blast jobs (each blast job has 100
> > sequences) are assigned on One node (one on each processor), the node
> > cannot handle the amount of memory used by the two blast jobs. PBS mom
> > daemon on the nodes cannot allocate the memory they need to monitor the
> > jobs on the node and they fail, thus killing the jobs.
> >
> > Condor doesn't recognize this failure and assumes that the job was
> > successfully completed, but actually only few sequences get processed
> > before the job is killed.
> >
> > Now the Admin of the Site is asking me if its possible to reduce the
amount
> > of memory these blast jobs use? He says these jobs are requesting about
> > 600-700MB of RAM, and he is asking me to reduce it to atmost 500MB.
> >
> > Is it possible to reduce the amount of RAM it is requesting by tweaking
any
> > of the parameters in blast??
> >
> > My blast options are :
> >
> > blastall -i $input -o $output -d $db -p blastp -m 8 -F F
> >
> > Please let me know,
> > Thank you,
> > Dina
> >
> > _______________________________________________
> > Bioclusters maillist  -  Bioclusters at bioinformatics.org
> > https://bioinformatics.org/mailman/listinfo/bioclusters
> >
>_______________________________________________
>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/bioclusters

===============================
Dinanath Sulakhe
Mathematics & Computer Science Division
Argonne National Laboratory
Ph: (630)-252-7856      Fax: (630)-252-5986

------------------------------

Message: 6
Date: Wed, 9 Mar 2005 18:06:44 -0500
From: Lucas Carey <lcarey at odd.bio.sunysb.edu>
Subject: Re: [Bioclusters] Memory Usage for Blast - question
To: "Clustering,	compute farming & distributed computing in life
	science informatics"	<bioclusters at bioinformatics.org>
Message-ID: <20050309230644.GA27139 at odd.bio.sunysb.edu>
Content-Type: text/plain; charset=us-ascii

Hi Dina,
I don't know how many of the results you actually need. You may free up some
memory by limiting e-value, returned results, and aligned results. 
blastall -e 0.0001 -b 25 -v 25
Another option, if you can limit Condor to a single job per machine, would
be to run 'blastall -a 2' to use both CPUs with only one process.
-Lucas

On Wednesday, March 09, 2005 at 16:09 -0600, Dinanath Sulakhe wrote:
> Hi,
> I am not sure if this is the right place to ask this question !!
> I am running Blast (NCBI) parallely on a cluster with 80 nodes. (I am 
> running NCBI NR against Itself). Each node is a dual processor.
> 
> I am using Condor to submit the jobs to this cluster. The problem I am 
> coming across is, whenever two blast jobs (each blast job has 100 
> sequences) are assigned on One node (one on each processor), the node 
> cannot handle the amount of memory used by the two blast jobs. PBS mom 
> daemon on the nodes cannot allocate the memory they need to monitor the 
> jobs on the node and they fail, thus killing the jobs.
> 
> Condor doesn't recognize this failure and assumes that the job was 
> successfully completed, but actually only few sequences get processed 
> before the job is killed.
> 
> Now the Admin of the Site is asking me if its possible to reduce the
amount 
> of memory these blast jobs use? He says these jobs are requesting about 
> 600-700MB of RAM, and he is asking me to reduce it to atmost 500MB.
> 
> Is it possible to reduce the amount of RAM it is requesting by tweaking
any 
> of the parameters in blast??
> 
> My blast options are :
> 
> blastall -i $input -o $output -d $db -p blastp -m 8 -F F
> 
> Please let me know,
> Thank you,
> Dina
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters

------------------------------

Message: 7
Date: Wed, 09 Mar 2005 17:23:51 -0600
From: Dinanath Sulakhe <sulakhe at mcs.anl.gov>
Subject: Re: [Bioclusters] Memory Usage for Blast - question
To: "Clustering,	compute farming & distributed computing in life
	science informatics"	<bioclusters at bioinformatics.org>,
"Clustering,
	compute farming & distributed computing in life science informatics"
	<bioclusters at bioinformatics.org>
Message-ID: <6.0.0.22.2.20050309171325.04bcdbd0 at pop.mcs.anl.gov>
Content-Type: text/plain; charset="us-ascii"; format=flowed

At 05:06 PM 3/9/2005, Lucas Carey wrote:
>Hi Dina,
>I don't know how many of the results you actually need. You may free up 
>some memory by limiting e-value, returned results, and aligned results.
>blastall -e 0.0001 -b 25 -v 25

would limiting the e-value and other parameters reduce the RAM usage??

>Another option, if you can limit Condor to a single job per machine, would 
>be to run 'blastall -a 2' to use both CPUs with only one process.

These jobs are assigned by the scheduler. Initially i had used '-a 2' 
option, but when this job is running on a node, the scheduler would assign 
another job by some other user on the same node, assuming the other 
processor to be free, but then blast would starve the other job. So we 
can't use 'a -n' option here.

Thanks,
Dina

>-Lucas
>
>On Wednesday, March 09, 2005 at 16:09 -0600, Dinanath Sulakhe wrote:
> > Hi,
> > I am not sure if this is the right place to ask this question !!
> > I am running Blast (NCBI) parallely on a cluster with 80 nodes. (I am
> > running NCBI NR against Itself). Each node is a dual processor.
> >
> > I am using Condor to submit the jobs to this cluster. The problem I am
> > coming across is, whenever two blast jobs (each blast job has 100
> > sequences) are assigned on One node (one on each processor), the node
> > cannot handle the amount of memory used by the two blast jobs. PBS mom
> > daemon on the nodes cannot allocate the memory they need to monitor the
> > jobs on the node and they fail, thus killing the jobs.
> >
> > Condor doesn't recognize this failure and assumes that the job was
> > successfully completed, but actually only few sequences get processed
> > before the job is killed.
> >
> > Now the Admin of the Site is asking me if its possible to reduce the 
> amount
> > of memory these blast jobs use? He says these jobs are requesting about
> > 600-700MB of RAM, and he is asking me to reduce it to atmost 500MB.
> >
> > Is it possible to reduce the amount of RAM it is requesting by tweaking 
> any
> > of the parameters in blast??
> >
> > My blast options are :
> >
> > blastall -i $input -o $output -d $db -p blastp -m 8 -F F
> >
> > Please let me know,
> > Thank you,
> > Dina
> >
> > _______________________________________________
> > Bioclusters maillist  -  Bioclusters at bioinformatics.org
> > https://bioinformatics.org/mailman/listinfo/bioclusters
>_______________________________________________
>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/bioclusters

===============================
Dinanath Sulakhe
Mathematics & Computer Science Division
Argonne National Laboratory
Ph: (630)-252-7856      Fax: (630)-252-5986

------------------------------

Message: 8
Date: Wed, 9 Mar 2005 18:33:38 -0500
From: Lucas Carey <lcarey at odd.bio.sunysb.edu>
Subject: Re: [Bioclusters] Memory Usage for Blast - question
To: "Clustering,	compute farming & distributed computing in life
	science informatics"	<bioclusters at bioinformatics.org>
Message-ID: <20050309233338.GC27139 at odd.bio.sunysb.edu>
Content-Type: text/plain; charset=us-ascii

On Wednesday, March 09, 2005 at 17:23 -0600, Dinanath Sulakhe wrote:
> At 05:06 PM 3/9/2005, Lucas Carey wrote:
> >Hi Dina,
> >I don't know how many of the results you actually need. You may free up 
> >some memory by limiting e-value, returned results, and aligned results.
> >blastall -e 0.0001 -b 25 -v 25
> 
>after would limiting the e-value and other parameters reduce the RAM
usage??
with mpiBLAST -e & -b can limit the memory usage for the master node. I
don't have a free cpu right now to check blastall.
> 
> >Another option, if you can limit Condor to a single job per machine,
would 
> >be to run 'blastall -a 2' to use both CPUs with only one process.
> 
> These jobs are assigned by the scheduler. Initially i had used '-a 2' 
> option, but when this job is running on a node, the scheduler would assign

> another job by some other user on the same node, assuming the other 
> processor to be free, but then blast would starve the other job. So we 
> can't use 'a -n' option here.
I used to use an OpenPBS cluster that would do that, but allowed me to
specify which nodes I wanted to run on. I would start up my compute job on
one processor, and 
	while (1){
	sleep (1000);
	}
one the second.
-Lucas

> 
> 
> Thanks,
> Dina
> 
> >-Lucas
> >
> >On Wednesday, March 09, 2005 at 16:09 -0600, Dinanath Sulakhe wrote:
> >> Hi,
> >> I am not sure if this is the right place to ask this question !!
> >> I am running Blast (NCBI) parallely on a cluster with 80 nodes. (I am
> >> running NCBI NR against Itself). Each node is a dual processor.
> >>
> >> I am using Condor to submit the jobs to this cluster. The problem I am
> >> coming across is, whenever two blast jobs (each blast job has 100
> >> sequences) are assigned on One node (one on each processor), the node
> >> cannot handle the amount of memory used by the two blast jobs. PBS mom
> >> daemon on the nodes cannot allocate the memory they need to monitor the
> >> jobs on the node and they fail, thus killing the jobs.
> >>
> >> Condor doesn't recognize this failure and assumes that the job was
> >> successfully completed, but actually only few sequences get processed
> >> before the job is killed.
> >>
> >> Now the Admin of the Site is asking me if its possible to reduce the 
> >amount
> >> of memory these blast jobs use? He says these jobs are requesting about
> >> 600-700MB of RAM, and he is asking me to reduce it to atmost 500MB.
> >>
> >> Is it possible to reduce the amount of RAM it is requesting by tweaking

> >any
> >> of the parameters in blast??
> >>
> >> My blast options are :
> >>
> >> blastall -i $input -o $output -d $db -p blastp -m 8 -F F
> >>
> >> Please let me know,
> >> Thank you,
> >> Dina
> >>
> >> _______________________________________________
> >> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> >> https://bioinformatics.org/mailman/listinfo/bioclusters
> >_______________________________________________
> >Bioclusters maillist  -  Bioclusters at bioinformatics.org
> >https://bioinformatics.org/mailman/listinfo/bioclusters
> 
> ===============================
> Dinanath Sulakhe
> Mathematics & Computer Science Division
> Argonne National Laboratory
> Ph: (630)-252-7856      Fax: (630)-252-5986
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters

------------------------------

_______________________________________________
Bioclusters maillist  -  Bioclusters at bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters

End of Bioclusters Digest, Vol 5, Issue 9
*****************************************