[ssml] blastclust usage

Tue Feb 21 13:07:49 EST 2006

Joanna Sharman wrote:
> Hi,
> 
> I've run blastclust on a set of my own sequences and I would like to 
> know how to
> interpret the results in the hits file.  Is this file not supposed to be 
> readable as a plain text file, or have I done something wrong?  If not, 
> is there a way to read the contents of the file (any programs that can 
> interpret the file into human readable format)?

Hi, whenever I run blastclust I do so with a 'super clean' fasta format 
sequence database, like this...

 >14983
slfaklggreaveaavdkfynkivadptvstyfsntdmkvqrskqfaflayalggasewk
gkdmrtahkdlvphlsdvhfqavarhlsdtltelgvppeditdamavvastrtevlnmpq
q
 >100067
slfaklggreaveaavdkfynkivadptvstyfsntdmkvqrskqfaflayalggasewk
gkdmrtahkdlvphlsdvhfqavarhlsdtltelgvppeditdamavvastrtevlnmpq
q
 >79572
stlyeklggttavdlavdkfyervlqddrikhffadvdmakqrahqkafltyafggtdky
dgrymreahkelvenhglngehfdavaedllatlkemgvpedliaevaavagapahkrdv
lnq
 >97827
stlyeklggttavdlavdkfyervlqddrikhffadvdmakqrahqkafltyafggtdky
dgrymreahkelvenhglngehfdavaedllatlkemgvpedliaevaavagapahkrdv
lnq

And I run formatdb on the database with the following options...

formatdb -i Data/domain.sequence.1.69.fa -s -V

After running blastclust (I run a batch of jobs like this...

#! /usr/bin/perl -w

use strict;
use Getopt::Long;
##
## Runs blastclust with a set of thresholds (specified below) on a
## pre-formatted (formatdb) fasta database. The options chosen are
## suitable for removing sequence fragments (-b F)
##
## NB: blastclust has a problem with sequences less than 20 residues
## long. It can't cluster them, and puts each one into a separate
## cluster.
##

# Default directory for bcl results
my $OUT         = '.';

GetOptions(
   "out-dir|o=s"     => \$OUT,
)
or die
   "Problem with command line options!\n";

# Output directory!
die &usage
   unless -d $OUT;

die "Pass fasta *database* to cluster! (Use formatdb)\n"
   unless @ARGV;

my $inFile = shift @ARGV;

die "Pass fasta *database* to cluster! (Use formatdb)\n"
   unless -e $inFile;

my $outFile = `basename $inFile`;
chomp($outFile);

# Select clustering thresholds...
foreach ( 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 ){

   my $cmd =
     "blastclust ".
     "-d $inFile ".                      # After running formatdb!
     "-o $OUT/$outFile.$_.bcl ".         # Standardized for convenience
     "-L 0.9 ".                          # Length of coverage threshold
     "-S $_ ".                           # Score threshold (identity if > 3)
     "-a 4 ".                            # Number of CPU's to use
     "-b F";                             # Don't require coverage threshold
                                         # on *both* sequences.
   print "$cmd\n";

   sleep(2);

   my $run = system( $cmd );

   unless ( $run == 0 ) {
     die "Killed by $run\n";
   }
}

)

This creates output files which looks like this

...
68314 68315 28225 28226 28233 28231 28232
26927 26928 26929 26930 26932 26931 26933
79649 79652 79655 79658 79661 79664 79667
87598 70661 70663 70665 87600 87602 87604
...
37279 78328 78329
44814 44810 44812
83064 83067 83070
21764 85533 85535
38898 76734 76735
...
44158 44159
44173 73569
65287 60623
84904 97444
44321 44323
83461 44074
27805 27806
45028 45026
...
74243
103858
19345
65300
106382
42707
105949
...

Each row is a cluster,  and the number of id's on a row is the number of 
sequences in the cluster. It is then very easy to make a tab delimited 
'cluster file' as follows...

#!/usr/bin/perl -w

use strict;

use Getopt::Long;

my $fastaFile = '';

GetOptions(
  "fasta=s"      => \$fastaFile,
);

die "Pass blastclust results files (.bcl)!\n"
   unless @ARGV;

warn "Found ". scalar(@ARGV) . " files\n";

my %seqLen;

if ($fastaFile && -s $fastaFile){

   warn "Parsing fasta file $fastaFile for sequence lengths\n\n";

   open FASTA, $fastaFile
     or die "$!:$fastaFile\n";

   my $id;

   while(<FASTA>){
     if (/^>(\S+)/){
       $id = $1;
     }
     else{
       $seqLen{$id} += length($_);
     }
   }
}
else{
   warn "Not parsing fasta file for chain lengths\n\n";
}

foreach my $file (@ARGV){

   warn "Doing $file\n";

   die "Cant parse filename ($file) for threshold\n"
     unless $file =~ /\.(\d+)\.bcl$/;

   my $thresh = $1;

   open (FILE, "<$file")
     or die "$file:$!\n";

   while(<FILE>){
     chomp;
     my @ID = split(" ", $_);

     foreach (@ID){
       print
         join("\t",
              $_,
              $seqLen{$_} || 0,
              $thresh,
              $ID[0],
              $seqLen{$ID[0]} || 0,
             ), "\n";
     }
   }
}

warn "OK\n";

Which gives me files looking like this...

21026   110     100     21026   110
76651   110     100     21026   110
105275  109     100     21026   110
106116  109     100     21026   110
107193  109     100     21026   110
107216  109     100     21026   110
107221  109     100     21026   110
20874   109     100     21026   110
20878   109     100     21026   110
20922   109     100     21026   110
20924   109     100     21026   110
20926   109     100     21026   110
...
40771   363     100     40771   363
40772   363     100     40771   363
40773   363     100     40771   363
40774   363     100     40771   363
73428   363     100     40771   363
73431   363     100     40771   363
35866   362     100     35866   362
35868   362     100     35866   362
35870   362     100     35866   362
35872   362     100     35866   362
35874   362     100     35866   362
66037   362     100     35866   362
35444   361     100     35444   361
35445   361     100     35444   361
35446   361     100     35444   361
35447   361     100     35444   361
35448   361     100     35444   361
35449   361     100     35444   361

Which says (for example) that ID 35444 has IDs 35444 35445 35446 35447 
35448 35449 as 'groupies' at 100 percent identity. You can then load the 
above into MySQL for convenient analysis.

I am sorry if you can't read Perl, but hopefully I answered some of your 
questions.

Dan.

> I am currently running this under Linux, although I would also like to 
> know how to run blastclust under Windows.  Everytime I try to run it 
> under Windows I get an error message saying that blastclust has 
> encountered a problem and needs to close.  Has anyone used blastclust 
> under Windows before and know what could be causing it to crash?  I've 
> tried creating both a .ncbirc file in the blast bin directory and a 
> ncbi.ini file under WINDOWS.
> 
> I appreciate your help with either of these problems.
> 
> Regards,
> Joanna
> 
> _______________________________________________
> ssml-general mailing list
> ssml-general at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/ssml-general