[Bio-Linux] Blasting Multiple Fasta Files

Andreas Leimbach aleimba at gwdg.de
Wed May 6 03:58:39 EDT 2015


Hi,

if you lose the "-" before blastx it will work:
for input in *.fa; do blastx -db /path_to_db -query $input -out
$input.blastx_output; done

And as Tony/Martin recommended you should really use '-num_threads'. The
blast+ routines should be faster than legacy blast and you want the
extra output option anyway. Still parallel will be faster than the loop.

Having different databases won't make a difference, all will be held in
memory anyway.

I don't think you can run a single assembly through parallel. The
assembler has to look at the whole data. Anyway, assembly algorithms are
designed for parallel thread usage anyway, they all have an *option* how
many threads you want to use (in the case of velvet through OpenMP). For
Illumina data I'd recommend SPAdes, it has a nice workflow (including
error correction etc.) and thus is quite user-friendly.

The xargs example won't give you anything that parallel can't do.

mpiBLAST is mainly meant for clustered computers (i.e. several servers
being used for a single program run). IMO, it won't give you a speed
advantage on a single computer with several cores in comparison to the
aforementioned possibilities.

HTH,
Andreas

--
Andreas Leimbach
Universität Münster
Institut für Hygiene
Mendelstr. 7
D-48149 Münster
Germany

Tel.: +49 (0)551 39 33843
E-Mail: aleimba at gwdg.de

On 06.05.2015 06:33, Zain A Alvi wrote:
> Hi Everyone,
> 
> Thank you for all the great and helpful recommendations, especially Tim, Tony, Dr. Beall, and Andreas.  I am trying to do exactly what Tim has showed and having BLASTx run on each fasta file one at a time, but not at the same time.  It should go through each fasta file one at a time and do BLASTX and then move onto next fasta file until there are no fasta files left in the folder.
> 
> Would something like this work as well: 
> 
> for input in *.fa; do -blastx -db /path_to_db -query $input -out $input.blastx_output; done
> 
> Then concantentate all *.blastx_output > Final_BlastxOutput.blastx_output
> 
> Thank you for the very interesting information about parallel on Bio Linux. Would parallel work well for de-novo assemblers like Velvet and Spades (as examples)? Especially Velvet after reading about: https://www.biostars.org/p/86907/ 
> 
> Also would creating multiple databases of the same database with a different name/title. Will that go around the problem of accessing the same database and memory problems when trying to run multiple BLASTx.  I know it is not recommended, would this quasi method be any beneficial to do. Should I just stick with the script above or the script that Tim kindly shared? 
> 
> For example: 
> 
> Folder A 
> blastx -db /path_to_db01 -infile input_seq_001-100 -out ouput_seq_001-100.blastx_output
> blastx -db /path_to_db01 -infile input_seq_101-200 -out ouput_seq_101-200.blastx_output
> etc to
> blasts -db /path_to_db01 -infile input_seq_401-499 -out ouput_seq_401-499.blastx_output
> 
> Folder B: 
> blastx -db /path_to_db02 -infile input_seq_501-600 -out ouput_seq_501-600.blastx_output
> blastx -db /path_to_db02 -infile input_seq_601-700 -out ouput_seq_601-700.blastx_output
> etc
> blastx -db /path_to_db02 -infile input_seq_901-1000 -out ouput_seq_901-1000.blastx_output
> 
> On a side note how is BLASTX from BLAST+ package compared MPI-BLAST? I thought MPI-BLAST is based on the older version of BLAST hence it might return fewer results. This is our major concern as I am going for tabular output format with all sequence titles and information (-outfmt 6 salltitles) This will be helpful for filtering the viral genome for by using some simple grep -w filtering techniques for the contigs. 
> 
> Also there is some interesting points about using xargs to parallelize BLAST+ (the last example): https://www.biostars.org/p/76009/ Has anyone tried this?
> 
> Thank you Prash for the recommendation for mpich. Its definitely interesting on how it works.  My mentor and I are trying to accomplish this on  a 32 Thread Workstation (Intel Xeon E5-2640v2 (16 cores)) with 128 GB of RAM for Viral Genome that I am planning on using BLASTX across the Viral refseq Protein sequences from NCBI. 
> 
> Thank you Dr. Beall. If you don't mind sharing, I would definitely be interested in taking look and trying to see how the script is like. Many thanks.  If I am able to successfully hack the script, I am more than willing to share it with rest of the community. 
> 
> Thank you again Andreas, Tim, Tony, Dr. Beall, and Prash. I really appreciate all the suggestions and help. 
> 
> Kind regards,
> 
> Zain 
> 
> ________________________________________
> From: Tony Travis <tony.travis at minke-informatics.co.uk>
> Sent: Tuesday, May 5, 2015 12:19 PM
> To: bio-linux at nebclists.nerc.ac.uk
> Subject: Re: [Bio-Linux] Blasting Multiple Fasta Files
> 
> On 05/05/15 16:08, Tim Booth wrote:
>> [...]
>> You want to run:
>>
>> blastx -db foo -infile seqs_000000_to_000999.fsa -out seqs_000000_to_000999.blastx
>> ...then...
>> blastx -db foo -infile seqs_001000_to_001999.fsa -out seqs_001000_to_001999.blastx
>> ...then...
>> blastx -db foo -infile seqs_002000_to_002999.fsa -out seqs_002000_to_002999.blastx
>> ...then...
>> blastx -db foo -infile seqs_003000_to_003999.fsa -out seqs_003000_to_003999.blastx
>> ...etc
>> [...]
> 
> Hi, Tim.
> 
> It's not good to run multiple instances of BLAST on the same machine
> because each invocation of BLAST will have a copy of the same database
> stored in memory. MPI-BLAST avoids this by loading different parts of
> the database into each worker process.
> 
> The time-consuming part of BLAST is the initial exact word match and
> both the old and new versions of BLAST allow you to specify how many
> threads to run to speed this up:
> 
>   BLAST  uses "-a nn"
>   BLAST+ uses "-num_threads nn"
> 
> I compared "blastall", "blastn", "blat", "pblat" and "bowtie" for
> mapping microRNA and mRNA to a custom database in:
> 
> Travis, A. J., Moody, J., Helwak, A., Tollervey, D., & Kudla, G. (2013).
> Hyb: A bioinformatics pipeline for the analysis of CLASH (crosslinking,
> ligation and sequencing of hybrids) data. Methods (San Diego, Calif.).
> http://doi.org/10.1016/j.ymeth.2013.10.015
> 
> ["pblat" is a parallel/multi-threaded version of BLAT]
> 
> You will need a script like this one by Jonathan Moody to convert
> "bowtie2" alignments to equivalent tabular BLAST output:
> 
>   https://github.com/gkudla/hyb/blob/master/bin/sam2blast
> 
> Bye,
> 
>   Tony.
> 
> --
> Minke Informatics Limited, Registered in Scotland - Company No. SC419028
> Registered Office: 3 Donview, Bridge of Alford, AB33 8QJ, Scotland (UK)
> tel. +44(0)19755 63548                    http://minke-informatics.co.uk
> mob. +44(0)7985 078324        mailto:tony.travis at minke-informatics.co.uk
> _______________________________________________
> Bio-Linux mailing list
> Bio-Linux at nebclists.nerc.ac.uk
> http://nebclists.nerc.ac.uk/mailman/listinfo/bio-linux
> _______________________________________________
> Bio-Linux mailing list
> Bio-Linux at nebclists.nerc.ac.uk
> http://nebclists.nerc.ac.uk/mailman/listinfo/bio-linux
> 



More information about the Bio-linux-list mailing list