Hi Tristan, >-----Original Message----- >From: ssml-general-admin at bioinformatics.org [mailto:ssml-general- >admin at bioinformatics.org] On Behalf Of Tristan Fiedler >Sent: Tuesday, December 09, 2003 4:17 PM >To: ssml-general at bioinformatics.org >Subject: [ssml] Finding Matches using N-term & C-term sequences > >I am interested in finding any homologs to a protein I am working on, >however, I have only an N-terminal sequence of about 15 amino acids, and 3 >internal peptides from tryptic digests. How big are the internal fragments? Can you share any of the sequence information? >I have used the default scoring matrices, gap existence & extension >penalties, and word sizes for the NCBI blastp web interface as well as for >the 'search short nearly exact matches' using the blastcl3 client-server >interface : > > ../blastcl3 -p blastp -e 10 -d swissprot -F T -T T -M BLOSUM62 -G 11 -E 1 >-W 3 >../blastcl3 -p blastp -e 10 -d nr -F T -T T -M BLOSUM62 -G 11 -E 1 -W 3 > >../blastcl3 -p blastp -e 20000 -d swissprot -F F -T T -M PAM30 -G 9 -E 1 -W >2 >../blastcl3 -p blastp -e 20000 -d nr -F F -T T -M PAM30 -G 9 -E 1 -W 2 > > >Although many 'hits' were returned, none had e-values less than 0.1. > >What is the threshold for 'significance' with such short peptides? Is >there a preferred method to find homologs when dealing with these short >fragments? I'm working on this. Two colleagues and I wrote an O'Reilly book on BLAST and we have a PERL module (BlastStats.pm) that can help with your problem. You can find this and other useful BLAST perl code, for free, at http://examples.oreilly.com/blast/ I have 2 ideas for this: 1. String your peptides together in one peptide and let the BLAST sum stats give you the combined significance (use ~10 X's between frags and -g F so you don't get extension across the different fragments). 2. Do the separate Blasts then calculate the significance using BlastStats.pm which can convert the raw scores into sum scores (combined significance). I'm testing out #1 now, using an in silico(copy and paste) N-term seq(14aa), and tryptic digests (13aa and 18aa) with a known protein. Also, what level of % identity do you expect to find for homologs? If you expect anything 80% and above, it would actually be better to use BLOSUM80 instead of BLOSUM62 because each amino acid match would then carry more information, thus increasing the significance (lowering the E-value). You're definitely going about it the right way by trying different parameters, varying the word size, increasing the E, etc. I'll get back to you on what I find. Joey ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Joseph A Bedell, Ph.D. Director, Bioinformatics Orion Genomics, LLC 4041 Forest Park Ave. St. Louis, MO 63108 (314)615-6979; fax:(314)615-6975 http://www.oriongenomics.com ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~