Joanna Sharman wrote: > Hi, > > I've run blastclust on a set of my own sequences and I would like to > know how to > interpret the results in the hits file. Is this file not supposed to be > readable as a plain text file, or have I done something wrong? If not, > is there a way to read the contents of the file (any programs that can > interpret the file into human readable format)? Hi, whenever I run blastclust I do so with a 'super clean' fasta format sequence database, like this... >14983 slfaklggreaveaavdkfynkivadptvstyfsntdmkvqrskqfaflayalggasewk gkdmrtahkdlvphlsdvhfqavarhlsdtltelgvppeditdamavvastrtevlnmpq q >100067 slfaklggreaveaavdkfynkivadptvstyfsntdmkvqrskqfaflayalggasewk gkdmrtahkdlvphlsdvhfqavarhlsdtltelgvppeditdamavvastrtevlnmpq q >79572 stlyeklggttavdlavdkfyervlqddrikhffadvdmakqrahqkafltyafggtdky dgrymreahkelvenhglngehfdavaedllatlkemgvpedliaevaavagapahkrdv lnq >97827 stlyeklggttavdlavdkfyervlqddrikhffadvdmakqrahqkafltyafggtdky dgrymreahkelvenhglngehfdavaedllatlkemgvpedliaevaavagapahkrdv lnq And I run formatdb on the database with the following options... formatdb -i Data/domain.sequence.1.69.fa -s -V After running blastclust (I run a batch of jobs like this... #! /usr/bin/perl -w use strict; use Getopt::Long; ## ## Runs blastclust with a set of thresholds (specified below) on a ## pre-formatted (formatdb) fasta database. The options chosen are ## suitable for removing sequence fragments (-b F) ## ## NB: blastclust has a problem with sequences less than 20 residues ## long. It can't cluster them, and puts each one into a separate ## cluster. ## # Default directory for bcl results my $OUT = '.'; GetOptions( "out-dir|o=s" => \$OUT, ) or die "Problem with command line options!\n"; # Output directory! die &usage unless -d $OUT; die "Pass fasta *database* to cluster! (Use formatdb)\n" unless @ARGV; my $inFile = shift @ARGV; die "Pass fasta *database* to cluster! (Use formatdb)\n" unless -e $inFile; my $outFile = `basename $inFile`; chomp($outFile); # Select clustering thresholds... foreach ( 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 ){ my $cmd = "blastclust ". "-d $inFile ". # After running formatdb! "-o $OUT/$outFile.$_.bcl ". # Standardized for convenience "-L 0.9 ". # Length of coverage threshold "-S $_ ". # Score threshold (identity if > 3) "-a 4 ". # Number of CPU's to use "-b F"; # Don't require coverage threshold # on *both* sequences. print "$cmd\n"; sleep(2); my $run = system( $cmd ); unless ( $run == 0 ) { die "Killed by $run\n"; } } ) This creates output files which looks like this ... 68314 68315 28225 28226 28233 28231 28232 26927 26928 26929 26930 26932 26931 26933 79649 79652 79655 79658 79661 79664 79667 87598 70661 70663 70665 87600 87602 87604 ... 37279 78328 78329 44814 44810 44812 83064 83067 83070 21764 85533 85535 38898 76734 76735 ... 44158 44159 44173 73569 65287 60623 84904 97444 44321 44323 83461 44074 27805 27806 45028 45026 ... 74243 103858 19345 65300 106382 42707 105949 ... Each row is a cluster, and the number of id's on a row is the number of sequences in the cluster. It is then very easy to make a tab delimited 'cluster file' as follows... #!/usr/bin/perl -w use strict; use Getopt::Long; my $fastaFile = ''; GetOptions( "fasta=s" => \$fastaFile, ); die "Pass blastclust results files (.bcl)!\n" unless @ARGV; warn "Found ". scalar(@ARGV) . " files\n"; my %seqLen; if ($fastaFile && -s $fastaFile){ warn "Parsing fasta file $fastaFile for sequence lengths\n\n"; open FASTA, $fastaFile or die "$!:$fastaFile\n"; my $id; while(<FASTA>){ if (/^>(\S+)/){ $id = $1; } else{ $seqLen{$id} += length($_); } } } else{ warn "Not parsing fasta file for chain lengths\n\n"; } foreach my $file (@ARGV){ warn "Doing $file\n"; die "Cant parse filename ($file) for threshold\n" unless $file =~ /\.(\d+)\.bcl$/; my $thresh = $1; open (FILE, "<$file") or die "$file:$!\n"; while(<FILE>){ chomp; my @ID = split(" ", $_); foreach (@ID){ print join("\t", $_, $seqLen{$_} || 0, $thresh, $ID[0], $seqLen{$ID[0]} || 0, ), "\n"; } } } warn "OK\n"; Which gives me files looking like this... 21026 110 100 21026 110 76651 110 100 21026 110 105275 109 100 21026 110 106116 109 100 21026 110 107193 109 100 21026 110 107216 109 100 21026 110 107221 109 100 21026 110 20874 109 100 21026 110 20878 109 100 21026 110 20922 109 100 21026 110 20924 109 100 21026 110 20926 109 100 21026 110 ... 40771 363 100 40771 363 40772 363 100 40771 363 40773 363 100 40771 363 40774 363 100 40771 363 73428 363 100 40771 363 73431 363 100 40771 363 35866 362 100 35866 362 35868 362 100 35866 362 35870 362 100 35866 362 35872 362 100 35866 362 35874 362 100 35866 362 66037 362 100 35866 362 35444 361 100 35444 361 35445 361 100 35444 361 35446 361 100 35444 361 35447 361 100 35444 361 35448 361 100 35444 361 35449 361 100 35444 361 Which says (for example) that ID 35444 has IDs 35444 35445 35446 35447 35448 35449 as 'groupies' at 100 percent identity. You can then load the above into MySQL for convenient analysis. I am sorry if you can't read Perl, but hopefully I answered some of your questions. Dan. > I am currently running this under Linux, although I would also like to > know how to run blastclust under Windows. Everytime I try to run it > under Windows I get an error message saying that blastclust has > encountered a problem and needs to close. Has anyone used blastclust > under Windows before and know what could be causing it to crash? I've > tried creating both a .ncbirc file in the blast bin directory and a > ncbi.ini file under WINDOWS. > > I appreciate your help with either of these problems. > > Regards, > Joanna > > _______________________________________________ > ssml-general mailing list > ssml-general at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/ssml-general