[ssml] Finding Matches using N-term & C-term sequences

Joseph Bedell jbedell at oriongenomics.com
Fri Dec 12 01:42:54 EST 2003

Hi Tristan,

Is your only ambiguity the K or R from the tryptic digest? If so, then
when using BLOSUM80 you get a positive score for either case (+6 for
exact match and +2 for the similar match). So, you could put either
amino acid there and it should be okay. 

Have you had a chance to try the XXX's in between the fragments? My
tests show that this works well. Since BLAST is a local alignment tool,
you don't have to worry about getting the order of the fragments
correct. The 3 fragments will show up as separate HSPs but the E-value
will represent the combined significance.

I have attached a fasta file which has a protein plucked out of genbank
( a lectin from Pea). I have then split it into 3 short pieces (Nterm,
internal_1, and internal_2) and also made one combined piece
(lectin_combined) which has the 3 frags strung together with 10 X's.

You can use this with a blastp search of NR and you'll see the advantage
of stringing the parts together. 

The command line parameters I use are:

blastall -p blastp -i lectin_parts.pep -d nr -M BLOSUM80 -g F -F 'mS' -W


>-----Original Message-----
>From: ssml-general-admin at bioinformatics.org [mailto:ssml-general-
>admin at bioinformatics.org] On Behalf Of Tristan Fiedler
>Sent: Wednesday, December 10, 2003 3:29 PM
>To: Kevin Karplus
>Cc: dmb at mrc-dunn.cam.ac.uk; t.fiedler at umiami.edu; ssml-
>general at bioinformatics.org
>Subject: Re: [ssml] Finding Matches using N-term & C-term sequences
>Greetings and Thank you all for the information.  I am working it up
>Using standard fasta files and blast queries, is it possible to
>sequence ambiguities such as :
>where [KR] means either Lys or Arg at that position?
>If possible, this would make my searching *much* simpler, since each of
>the sequence fragments I am working with has a few residues which are
>Happy Holidays,
>> Dan,
>> You mentioned the "product of p values" method for combining hits
>> one query to different sequences in the same family:
>> @inproceedings{product-of-p-values,
>> 	title="Classifying proteins by family using the product of
>> p-values",
>> 	author="Bailey, Timothy L. and Grundy, William N.",
>> 	booktitle=recomb99,
>> 	month="April 11-14",
>> 	year="1999",
>> 	pages="10-14",
>> 	publisher="ACM Press"
>> 	}
>> That is a useful technique, but different from what I was proposing,
>> which is to combine search results from independent queries (the
>> so that different queries bringing up the same sequence will strongly
>> reinforce the signal for that sequence.
>> Perhaps the best bet is to do as Joseph Bedell suggests, and
>> concatenate the peptides with XXXXXXXXXX spacers, and use the already
>> written multi-hit functions in BLAST.  Since the order of the
>> is unknown, 6 searches should be done, one for each order of the
>> residues.
>> I may be misunderstanding the problem, but I was assuming that the
>> problem was to identify a protein from an organism that did NOT have
>> genomic sequencing project near completion.  Thus the need to look
>> homologs in other organisms (which may not be very similar).  If
>> is some genomic data, the full-length putative homologs may be used
>> seach the genome of the organism for a match One a putative homolog
>> found, an HMM based on its full-length sequence could be used
>> using SAM-T2K or PSI-BLAST and HMMer) could be used for the search,
>> and to identify any regions likely to be highly conserved in the
>> protein.  The highly conserved regions may allow designing a primer
>> fish out the gene itself.
>> Kevin Karplus 	karplus at soe.ucsc.edu
>> Professor of Computer Engineering, University of California, Santa
>> Undergraduate and Graduate Director, Bioinformatics
>> Affiliations for identification only.
>Tristan J. Fiedler, Ph.D.
>Postdoctoral Research Fellow - Walsh Laboratory
>NIEHS Marine & Freshwater Biomedical Sciences Center
>Rosenstiel School of Marine & Atmospheric Sciences
>University of Miami
>tfiedler at rsmas.miami.edu
>t.fiedler at umiami.edu (alias)
>ssml-general mailing list
>ssml-general at bioinformatics.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: lectin_parts.pep
Type: application/octet-stream
Size: 501 bytes
Desc: lectin_parts.pep
Url : http://bioinformatics.org/pipermail/ssml-general/attachments/20031212/d0ecd480/lectin_parts.obj

More information about the ssml-general mailing list