[ssml] Finding Matches using N-term & C-term sequences

Wed Dec 10 12:39:30 EST 2003

Dan,

You mentioned the "product of p values" method for combining hits with
one query to different sequences in the same family:
@inproceedings{product-of-p-values,
	title="Classifying proteins by family using the product of correlated p-values",
	author="Bailey, Timothy L. and Grundy, William N.",
	booktitle=recomb99,
	month="April 11-14",
	year="1999",
	pages="10-14",
	publisher="ACM Press"
	}

That is a useful technique, but different from what I was proposing,
which is to combine search results from independent queries (the peptides) 
so that different queries bringing up the same sequence will strongly
reinforce the signal for that sequence.

Perhaps the best bet is to do as Joseph Bedell suggests, and
concatenate the peptides with XXXXXXXXXX spacers, and use the already
written multi-hit functions in BLAST.  Since the order of the peptides
is unknown, 6 searches should be done, one for each order of the
residues. 

I may be misunderstanding the problem, but I was assuming that the
problem was to identify a protein from an organism that did NOT have a
genomic sequencing project near completion.  Thus the need to look for
homologs in other organisms (which may not be very similar).  If there
is some genomic data, the full-length putative homologs may be used to
seach the genome of the organism for a match One a putative homolog is
found, an HMM based on its full-length sequence could be used (created
using SAM-T2K or PSI-BLAST and HMMer) could be used for the search,
and to identify any regions likely to be highly conserved in the
protein.  The highly conserved regions may allow designing a primer to
fish out the gene itself.

Kevin Karplus 	karplus at soe.ucsc.edu	http://www.soe.ucsc.edu/~karplus
Professor of Computer Engineering, University of California, Santa Cruz
Undergraduate and Graduate Director, Bioinformatics
Affiliations for identification only.