[ssml] blastclust usage
Dan Bolser
dmb at mrc-dunn.cam.ac.uk
Tue Feb 21 13:07:49 EST 2006
Joanna Sharman wrote:
> Hi,
>
> I've run blastclust on a set of my own sequences and I would like to
> know how to
> interpret the results in the hits file. Is this file not supposed to be
> readable as a plain text file, or have I done something wrong? If not,
> is there a way to read the contents of the file (any programs that can
> interpret the file into human readable format)?
Hi, whenever I run blastclust I do so with a 'super clean' fasta format
sequence database, like this...
>14983
slfaklggreaveaavdkfynkivadptvstyfsntdmkvqrskqfaflayalggasewk
gkdmrtahkdlvphlsdvhfqavarhlsdtltelgvppeditdamavvastrtevlnmpq
q
>100067
slfaklggreaveaavdkfynkivadptvstyfsntdmkvqrskqfaflayalggasewk
gkdmrtahkdlvphlsdvhfqavarhlsdtltelgvppeditdamavvastrtevlnmpq
q
>79572
stlyeklggttavdlavdkfyervlqddrikhffadvdmakqrahqkafltyafggtdky
dgrymreahkelvenhglngehfdavaedllatlkemgvpedliaevaavagapahkrdv
lnq
>97827
stlyeklggttavdlavdkfyervlqddrikhffadvdmakqrahqkafltyafggtdky
dgrymreahkelvenhglngehfdavaedllatlkemgvpedliaevaavagapahkrdv
lnq
And I run formatdb on the database with the following options...
formatdb -i Data/domain.sequence.1.69.fa -s -V
After running blastclust (I run a batch of jobs like this...
#! /usr/bin/perl -w
use strict;
use Getopt::Long;
##
## Runs blastclust with a set of thresholds (specified below) on a
## pre-formatted (formatdb) fasta database. The options chosen are
## suitable for removing sequence fragments (-b F)
##
## NB: blastclust has a problem with sequences less than 20 residues
## long. It can't cluster them, and puts each one into a separate
## cluster.
##
# Default directory for bcl results
my $OUT = '.';
GetOptions(
"out-dir|o=s" => \$OUT,
)
or die
"Problem with command line options!\n";
# Output directory!
die &usage
unless -d $OUT;
die "Pass fasta *database* to cluster! (Use formatdb)\n"
unless @ARGV;
my $inFile = shift @ARGV;
die "Pass fasta *database* to cluster! (Use formatdb)\n"
unless -e $inFile;
my $outFile = `basename $inFile`;
chomp($outFile);
# Select clustering thresholds...
foreach ( 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 ){
my $cmd =
"blastclust ".
"-d $inFile ". # After running formatdb!
"-o $OUT/$outFile.$_.bcl ". # Standardized for convenience
"-L 0.9 ". # Length of coverage threshold
"-S $_ ". # Score threshold (identity if > 3)
"-a 4 ". # Number of CPU's to use
"-b F"; # Don't require coverage threshold
# on *both* sequences.
print "$cmd\n";
sleep(2);
my $run = system( $cmd );
unless ( $run == 0 ) {
die "Killed by $run\n";
}
}
)
This creates output files which looks like this
...
68314 68315 28225 28226 28233 28231 28232
26927 26928 26929 26930 26932 26931 26933
79649 79652 79655 79658 79661 79664 79667
87598 70661 70663 70665 87600 87602 87604
...
37279 78328 78329
44814 44810 44812
83064 83067 83070
21764 85533 85535
38898 76734 76735
...
44158 44159
44173 73569
65287 60623
84904 97444
44321 44323
83461 44074
27805 27806
45028 45026
...
74243
103858
19345
65300
106382
42707
105949
...
Each row is a cluster, and the number of id's on a row is the number of
sequences in the cluster. It is then very easy to make a tab delimited
'cluster file' as follows...
#!/usr/bin/perl -w
use strict;
use Getopt::Long;
my $fastaFile = '';
GetOptions(
"fasta=s" => \$fastaFile,
);
die "Pass blastclust results files (.bcl)!\n"
unless @ARGV;
warn "Found ". scalar(@ARGV) . " files\n";
my %seqLen;
if ($fastaFile && -s $fastaFile){
warn "Parsing fasta file $fastaFile for sequence lengths\n\n";
open FASTA, $fastaFile
or die "$!:$fastaFile\n";
my $id;
while(<FASTA>){
if (/^>(\S+)/){
$id = $1;
}
else{
$seqLen{$id} += length($_);
}
}
}
else{
warn "Not parsing fasta file for chain lengths\n\n";
}
foreach my $file (@ARGV){
warn "Doing $file\n";
die "Cant parse filename ($file) for threshold\n"
unless $file =~ /\.(\d+)\.bcl$/;
my $thresh = $1;
open (FILE, "<$file")
or die "$file:$!\n";
while(<FILE>){
chomp;
my @ID = split(" ", $_);
foreach (@ID){
print
join("\t",
$_,
$seqLen{$_} || 0,
$thresh,
$ID[0],
$seqLen{$ID[0]} || 0,
), "\n";
}
}
}
warn "OK\n";
Which gives me files looking like this...
21026 110 100 21026 110
76651 110 100 21026 110
105275 109 100 21026 110
106116 109 100 21026 110
107193 109 100 21026 110
107216 109 100 21026 110
107221 109 100 21026 110
20874 109 100 21026 110
20878 109 100 21026 110
20922 109 100 21026 110
20924 109 100 21026 110
20926 109 100 21026 110
...
40771 363 100 40771 363
40772 363 100 40771 363
40773 363 100 40771 363
40774 363 100 40771 363
73428 363 100 40771 363
73431 363 100 40771 363
35866 362 100 35866 362
35868 362 100 35866 362
35870 362 100 35866 362
35872 362 100 35866 362
35874 362 100 35866 362
66037 362 100 35866 362
35444 361 100 35444 361
35445 361 100 35444 361
35446 361 100 35444 361
35447 361 100 35444 361
35448 361 100 35444 361
35449 361 100 35444 361
Which says (for example) that ID 35444 has IDs 35444 35445 35446 35447
35448 35449 as 'groupies' at 100 percent identity. You can then load the
above into MySQL for convenient analysis.
I am sorry if you can't read Perl, but hopefully I answered some of your
questions.
Dan.
> I am currently running this under Linux, although I would also like to
> know how to run blastclust under Windows. Everytime I try to run it
> under Windows I get an error message saying that blastclust has
> encountered a problem and needs to close. Has anyone used blastclust
> under Windows before and know what could be causing it to crash? I've
> tried creating both a .ncbirc file in the blast bin directory and a
> ncbi.ini file under WINDOWS.
>
> I appreciate your help with either of these problems.
>
> Regards,
> Joanna
>
> _______________________________________________
> ssml-general mailing list
> ssml-general at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/ssml-general
More information about the ssml-general
mailing list