[ssml] Finding Matches using N-term & C-term sequences

Wed Dec 10 09:53:28 EST 2003

Hi Tristan,

>-----Original Message-----
>From: ssml-general-admin at bioinformatics.org [mailto:ssml-general-
>admin at bioinformatics.org] On Behalf Of Tristan Fiedler
>Sent: Tuesday, December 09, 2003 4:17 PM
>To: ssml-general at bioinformatics.org
>Subject: [ssml] Finding Matches using N-term & C-term sequences
>
>I am interested in finding any homologs to a protein I am working on,
>however, I have only an N-terminal sequence of about 15 amino acids,
and 3
>internal peptides from tryptic digests.

How big are the internal fragments? Can you share any of the sequence
information?

>I have used the default scoring matrices, gap existence & extension
>penalties, and word sizes for the NCBI blastp web interface as well as
for
>the 'search short nearly exact matches' using the blastcl3
client-server
>interface :
>
> ../blastcl3 -p blastp -e 10 -d swissprot -F T -T T -M BLOSUM62 -G 11
-E 1
>-W 3
>../blastcl3 -p blastp -e 10 -d nr -F T -T T -M BLOSUM62 -G 11 -E 1 -W 3
>
>../blastcl3 -p blastp -e 20000 -d swissprot -F F -T T -M PAM30 -G 9 -E
1 -W
>2
>../blastcl3 -p blastp -e 20000 -d nr -F F -T T -M PAM30 -G 9 -E 1 -W 2
>
>
>Although many 'hits' were returned, none had e-values less than 0.1.
>
>What is the threshold for 'significance' with such short peptides?  Is
>there a preferred method to find homologs when dealing with these short
>fragments?

I'm working on this. Two colleagues and I wrote an O'Reilly book on
BLAST and we have a PERL module (BlastStats.pm) that can help with your
problem. You can find this and other useful BLAST perl code, for free,
at http://examples.oreilly.com/blast/

I have 2 ideas for this:
1. String your peptides together in one peptide and let the BLAST sum
stats give you the combined significance (use ~10 X's between frags and
-g F so you don't get extension across the different fragments).

2. Do the separate Blasts then calculate the significance using
BlastStats.pm which can convert the raw scores into sum scores (combined
significance).

I'm testing out #1 now, using an in silico(copy and paste) N-term
seq(14aa), and tryptic digests (13aa and 18aa) with a known protein.

Also, what level of % identity do you expect to find for homologs? If
you expect anything 80% and above, it would actually be better to use
BLOSUM80 instead of BLOSUM62 because each amino acid match would then
carry more information, thus increasing the significance (lowering the
E-value). 

You're definitely going about it the right way by trying different
parameters, varying the word size, increasing the E, etc.

I'll get back to you on what I find.

Joey

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Joseph A Bedell, Ph.D.
Director, Bioinformatics
Orion Genomics, LLC
4041 Forest Park Ave.
St. Louis, MO 63108
(314)615-6979; fax:(314)615-6975
http://www.oriongenomics.com
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~