[Bio-Linux] Blasting Multiple Fasta Files

Martin Gollery mgollery at unr.edu
Wed May 6 00:37:55 EDT 2015


Be sure to include -num_threads

I still think that this will be slower overall, but it will be interesting
to hear your results!

Marty


On Tue, May 5, 2015 at 9:33 PM, Zain A Alvi <zain.alvi at student.shu.edu>
wrote:

> Hi Everyone,
>
> Thank you for all the great and helpful recommendations, especially Tim,
> Tony, Dr. Beall, and Andreas.  I am trying to do exactly what Tim has
> showed and having BLASTx run on each fasta file one at a time, but not at
> the same time.  It should go through each fasta file one at a time and do
> BLASTX and then move onto next fasta file until there are no fasta files
> left in the folder.
>
> Would something like this work as well:
>
> for input in *.fa; do -blastx -db /path_to_db -query $input -out
> $input.blastx_output; done
>
> Then concantentate all *.blastx_output > Final_BlastxOutput.blastx_output
>
> Thank you for the very interesting information about parallel on Bio
> Linux. Would parallel work well for de-novo assemblers like Velvet and
> Spades (as examples)? Especially Velvet after reading about:
> https://www.biostars.org/p/86907/
>
> Also would creating multiple databases of the same database with a
> different name/title. Will that go around the problem of accessing the same
> database and memory problems when trying to run multiple BLASTx.  I know it
> is not recommended, would this quasi method be any beneficial to do. Should
> I just stick with the script above or the script that Tim kindly shared?
>
> For example:
>
> Folder A
> blastx -db /path_to_db01 -infile input_seq_001-100 -out
> ouput_seq_001-100.blastx_output
> blastx -db /path_to_db01 -infile input_seq_101-200 -out
> ouput_seq_101-200.blastx_output
> etc to
> blasts -db /path_to_db01 -infile input_seq_401-499 -out
> ouput_seq_401-499.blastx_output
>
> Folder B:
> blastx -db /path_to_db02 -infile input_seq_501-600 -out
> ouput_seq_501-600.blastx_output
> blastx -db /path_to_db02 -infile input_seq_601-700 -out
> ouput_seq_601-700.blastx_output
> etc
> blastx -db /path_to_db02 -infile input_seq_901-1000 -out
> ouput_seq_901-1000.blastx_output
>
> On a side note how is BLASTX from BLAST+ package compared MPI-BLAST? I
> thought MPI-BLAST is based on the older version of BLAST hence it might
> return fewer results. This is our major concern as I am going for tabular
> output format with all sequence titles and information (-outfmt 6
> salltitles) This will be helpful for filtering the viral genome for by
> using some simple grep -w filtering techniques for the contigs.
>
> Also there is some interesting points about using xargs to parallelize
> BLAST+ (the last example): https://www.biostars.org/p/76009/ Has anyone
> tried this?
>
> Thank you Prash for the recommendation for mpich. Its definitely
> interesting on how it works.  My mentor and I are trying to accomplish this
> on  a 32 Thread Workstation (Intel Xeon E5-2640v2 (16 cores)) with 128 GB
> of RAM for Viral Genome that I am planning on using BLASTX across the Viral
> refseq Protein sequences from NCBI.
>
> Thank you Dr. Beall. If you don't mind sharing, I would definitely be
> interested in taking look and trying to see how the script is like. Many
> thanks.  If I am able to successfully hack the script, I am more than
> willing to share it with rest of the community.
>
> Thank you again Andreas, Tim, Tony, Dr. Beall, and Prash. I really
> appreciate all the suggestions and help.
>
> Kind regards,
>
> Zain
>
> ________________________________________
> From: Tony Travis <tony.travis at minke-informatics.co.uk>
> Sent: Tuesday, May 5, 2015 12:19 PM
> To: bio-linux at nebclists.nerc.ac.uk
> Subject: Re: [Bio-Linux] Blasting Multiple Fasta Files
>
> On 05/05/15 16:08, Tim Booth wrote:
> > [...]
> > You want to run:
> >
> > blastx -db foo -infile seqs_000000_to_000999.fsa -out
> seqs_000000_to_000999.blastx
> > ...then...
> > blastx -db foo -infile seqs_001000_to_001999.fsa -out
> seqs_001000_to_001999.blastx
> > ...then...
> > blastx -db foo -infile seqs_002000_to_002999.fsa -out
> seqs_002000_to_002999.blastx
> > ...then...
> > blastx -db foo -infile seqs_003000_to_003999.fsa -out
> seqs_003000_to_003999.blastx
> > ...etc
> > [...]
>
> Hi, Tim.
>
> It's not good to run multiple instances of BLAST on the same machine
> because each invocation of BLAST will have a copy of the same database
> stored in memory. MPI-BLAST avoids this by loading different parts of
> the database into each worker process.
>
> The time-consuming part of BLAST is the initial exact word match and
> both the old and new versions of BLAST allow you to specify how many
> threads to run to speed this up:
>
>   BLAST  uses "-a nn"
>   BLAST+ uses "-num_threads nn"
>
> I compared "blastall", "blastn", "blat", "pblat" and "bowtie" for
> mapping microRNA and mRNA to a custom database in:
>
> Travis, A. J., Moody, J., Helwak, A., Tollervey, D., & Kudla, G. (2013).
> Hyb: A bioinformatics pipeline for the analysis of CLASH (crosslinking,
> ligation and sequencing of hybrids) data. Methods (San Diego, Calif.).
> http://doi.org/10.1016/j.ymeth.2013.10.015
>
> ["pblat" is a parallel/multi-threaded version of BLAT]
>
> You will need a script like this one by Jonathan Moody to convert
> "bowtie2" alignments to equivalent tabular BLAST output:
>
>   https://github.com/gkudla/hyb/blob/master/bin/sam2blast
>
> Bye,
>
>   Tony.
>
> --
> Minke Informatics Limited, Registered in Scotland - Company No. SC419028
> Registered Office: 3 Donview, Bridge of Alford, AB33 8QJ, Scotland (UK)
> tel. +44(0)19755 63548                    http://minke-informatics.co.uk
> mob. +44(0)7985 078324        mailto:tony.travis at minke-informatics.co.uk
> _______________________________________________
> Bio-Linux mailing list
> Bio-Linux at nebclists.nerc.ac.uk
> http://nebclists.nerc.ac.uk/mailman/listinfo/bio-linux
> _______________________________________________
> Bio-Linux mailing list
> Bio-Linux at nebclists.nerc.ac.uk
> http://nebclists.nerc.ac.uk/mailman/listinfo/bio-linux
>



-- 
-- 
Martin Gollery
Senior Bioinformatics Scientist
Tahoe Informatics
www.bioinformaticist.biz
www.hiddenmarkovmodels.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/bio-linux-list/attachments/20150505/6b5397e8/attachment.html>


More information about the Bio-linux-list mailing list