[Bio-Linux] Blasting Multiple Fasta Files

Tim Booth tbooth at ceh.ac.uk
Tue May 5 11:08:20 EDT 2015


Hi Zain,

So, I think you are saying that if you have a directory of files like
this:

seqs_000000_to_000999.fsa
seqs_001000_to_001999.fsa
seqs_002000_to_002999.fsa
seqs_003000_to_003999.fsa
...etc

You want to run:

blastx -db foo -infile seqs_000000_to_000999.fsa -out seqs_000000_to_000999.blastx
...then...
blastx -db foo -infile seqs_001000_to_001999.fsa -out seqs_001000_to_001999.blastx
...then...
blastx -db foo -infile seqs_002000_to_002999.fsa -out seqs_002000_to_002999.blastx
...then...
blastx -db foo -infile seqs_003000_to_003999.fsa -out seqs_003000_to_003999.blastx
...etc

This can be done with a shell loop.  The tricky bit is generating the output file name:

$ for f in *.fasta ; do
>   outname=$(basename $f .fasta).blastx
>   blastx -db foo -query $f -out $outname
> done

A nifty way of running jobs like this is with 'parallel' which is
pre-installed on Bio-Linux 8 and can run multiple jobs at once and even
send them to other remote machines for you.  Here's the basic invocation
(yes, it's a bit cryptic - it's based on the xargs tool):

$ ls *.fasta | parallel --res out blastx -db foo -query

Then to see what files were outputted:

$ find out -name stdout

Hope that helps.

(Just before sending this, I see that Andreas recommended parallel too!)

TIM

On Tue, 2015-05-05 at 15:31 +0100, Zain A Alvi wrote:
> Hi Marty,
> 
> 
> I apologize for the confusion. I am splitting a fasta file that
> contains approximately 100,000 fasta sequences to 100 fasta files that
> contains 1000 sequences each.  I am hoping this will expedite the
> BLASTx process. 
> 
> 
> Kind regards,
> 
> 
> 
> Zain
> 
> 
> 
> ______________________________________________________________________
> From: Martin Gollery <mgollery at unr.edu>
> Sent: Tuesday, May 5, 2015 10:23 AM
> To: Bio-Linux help and discussion
> Subject: Re: [Bio-Linux] Blasting Multiple Fasta Files 
>  
> Running a million BLASTX jobs on one sequence each is not going to
> save you time. It is better to run one BLASTX job on a million
> sequences. 
> 
> 
> -Marty
> 
> 
> 
> 
> On Tue, May 5, 2015 at 7:00 AM, Zain A Alvi
> <zain.alvi at student.shu.edu> wrote:
>         Dear Sir or Madam,
>         
>         
>         
>         I hope everything is well. I have downloaded all the viral
>         protein sequences from the NCBI refseq database using
>         their script from their E-book.  I have de-novo assembled some
>         viral genomes and I know BLASTX takes a long time if the fasta
>         is large.  I have been able to split the large fasta file
>         based on an user specified contig number in each new fasta
>         file. 
>         
>         
>         I was wondering is there a method to run BLASTX automatically
>         on each of the fasta files one at a time so that it will be
>         able to complete in a "shorter" amount of time as compared to
>         BLASTing the whole large de-novo assembled fasta file.  Then I
>         was hoping to concatenate all the results into one file.
>         
>         
>         
>         Sincerely,
>         
>         
>         
>         Zain
>         
>         
>         
>         
>         _______________________________________________
>         Bio-Linux mailing list
>         Bio-Linux at nebclists.nerc.ac.uk
>         http://nebclists.nerc.ac.uk/mailman/listinfo/bio-linux
>         
> 
> 
> 
> 

-- 
Tim Booth <tbooth at ceh.ac.uk>
NERC Environmental Bioinformatics Centre 

Centre for Ecology and Hydrology
Maclean Bldg, Benson Lane
Crowmarsh Gifford
Wallingford, England
OX10 8BB 

http://environmentalomics.org/bio-linux
+44 1491 69 2297




More information about the Bio-linux-list mailing list