[ssml] Finding Matches using N-term & C-term sequences

Joseph Bedell jbedell at oriongenomics.com
Fri Dec 12 01:42:54 EST 2003


Hi Tristan,

Is your only ambiguity the K or R from the tryptic digest? If so, then
when using BLOSUM80 you get a positive score for either case (+6 for
exact match and +2 for the similar match). So, you could put either
amino acid there and it should be okay. 

Have you had a chance to try the XXX's in between the fragments? My
tests show that this works well. Since BLAST is a local alignment tool,
you don't have to worry about getting the order of the fragments
correct. The 3 fragments will show up as separate HSPs but the E-value
will represent the combined significance.

I have attached a fasta file which has a protein plucked out of genbank
( a lectin from Pea). I have then split it into 3 short pieces (Nterm,
internal_1, and internal_2) and also made one combined piece
(lectin_combined) which has the 3 frags strung together with 10 X's.

You can use this with a blastp search of NR and you'll see the advantage
of stringing the parts together. 

The command line parameters I use are:

blastall -p blastp -i lectin_parts.pep -d nr -M BLOSUM80 -g F -F 'mS' -W
2

Cheers,
Joey


>-----Original Message-----
>From: ssml-general-admin at bioinformatics.org [mailto:ssml-general-
>admin at bioinformatics.org] On Behalf Of Tristan Fiedler
>Sent: Wednesday, December 10, 2003 3:29 PM
>To: Kevin Karplus
>Cc: dmb at mrc-dunn.cam.ac.uk; t.fiedler at umiami.edu; ssml-
>general at bioinformatics.org
>Subject: Re: [ssml] Finding Matches using N-term & C-term sequences
>
>Greetings and Thank you all for the information.  I am working it up
>currently.
>
>Using standard fasta files and blast queries, is it possible to
indicate
>sequence ambiguities such as :
>
>RLTGVDA[KR]TEIDKLSE
>
>where [KR] means either Lys or Arg at that position?
>
>If possible, this would make my searching *much* simpler, since each of
>the sequence fragments I am working with has a few residues which are
>ambiguous.
>
>Happy Holidays,
>
>Tristan
>
>
>>
>> Dan,
>>
>> You mentioned the "product of p values" method for combining hits
with
>> one query to different sequences in the same family:
>> @inproceedings{product-of-p-values,
>> 	title="Classifying proteins by family using the product of
correlated
>> p-values",
>> 	author="Bailey, Timothy L. and Grundy, William N.",
>> 	booktitle=recomb99,
>> 	month="April 11-14",
>> 	year="1999",
>> 	pages="10-14",
>> 	publisher="ACM Press"
>> 	}
>>
>> That is a useful technique, but different from what I was proposing,
>> which is to combine search results from independent queries (the
>peptides)
>> so that different queries bringing up the same sequence will strongly
>> reinforce the signal for that sequence.
>>
>> Perhaps the best bet is to do as Joseph Bedell suggests, and
>> concatenate the peptides with XXXXXXXXXX spacers, and use the already
>> written multi-hit functions in BLAST.  Since the order of the
peptides
>> is unknown, 6 searches should be done, one for each order of the
>> residues.
>>
>> I may be misunderstanding the problem, but I was assuming that the
>> problem was to identify a protein from an organism that did NOT have
a
>> genomic sequencing project near completion.  Thus the need to look
for
>> homologs in other organisms (which may not be very similar).  If
there
>> is some genomic data, the full-length putative homologs may be used
to
>> seach the genome of the organism for a match One a putative homolog
is
>> found, an HMM based on its full-length sequence could be used
(created
>> using SAM-T2K or PSI-BLAST and HMMer) could be used for the search,
>> and to identify any regions likely to be highly conserved in the
>> protein.  The highly conserved regions may allow designing a primer
to
>> fish out the gene itself.
>>
>> Kevin Karplus 	karplus at soe.ucsc.edu
http://www.soe.ucsc.edu/~karplus
>> Professor of Computer Engineering, University of California, Santa
Cruz
>> Undergraduate and Graduate Director, Bioinformatics
>> Affiliations for identification only.
>>
>>
>
>
>--
>Tristan J. Fiedler, Ph.D.
>Postdoctoral Research Fellow - Walsh Laboratory
>NIEHS Marine & Freshwater Biomedical Sciences Center
>Rosenstiel School of Marine & Atmospheric Sciences
>University of Miami
>
>tfiedler at rsmas.miami.edu
>t.fiedler at umiami.edu (alias)
>305-361-4626
>_______________________________________________
>ssml-general mailing list
>ssml-general at bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/ssml-general

-------------- next part --------------
A non-text attachment was scrubbed...
Name: lectin_parts.pep
Type: application/octet-stream
Size: 501 bytes
Desc: lectin_parts.pep
Url : http://bioinformatics.org/pipermail/ssml-general/attachments/20031212/d0ecd480/lectin_parts.obj


More information about the ssml-general mailing list