[ssml] RE: Thanks for the lectin example

Fri Dec 12 22:31:07 EST 2003

Hi Tristan,

>-----Original Message-----
>From: Tristan Fiedler [mailto:tfiedler at rsmas.miami.edu]
>Sent: Friday, December 12, 2003 10:32 AM
>To: Joseph Bedell
>Subject: Thanks for the lectin example
>
>Hi Joey,
>
>1.  If the fragments in the query sequence are in an incorrect order,
but
>as you indicated, give a single HSP (presumably connected by hashed
lines
>in the graphical output), does this mean that only 1 entry (ie line)
will
>exist in the blast output table entitled "Sequences producing
significant
>alignments:" ?

Whoops, turns out I was wrong about the order issue. Kevin was right,
you need to get them in the correct order so that they are "consistent"
HSPs. At that point BLAST will do its SUM magic and combine them for a
better E-value. The good news is you know which is the N-terminal one so
you just have to switch around the other 2 fragments (for a total of 8?
or do you know the N-term vs. C-term for those tryptic digests?)

Oh, and they won't give a single HSP, it will be 3 different HSPs, but
the E-values will be the same b/c the significance will have been
combined.

>2. I have read up in your book on gapped/ungapped alignments, and do
not
>understand, (assuming the correct order of sequence fragments in the
query
>and also assuming a true hit exists in the database with regions of
>sufficient similarity to all three fragments) how an ungapped (ie -g F
)
>alignment will allow for an incorrect number of 'X's in the query
>sequence.

The -g F will just keep BLAST from gapping around the X's. X is an
ambiguity character and gives a negative score (-1 to -3 in BLOSUM80).
If using gapping, it may actually read across these Xs and try to stitch
it together, which is not what I want for the X's. Another way to avoid
this is to use gapping but just put in 50-100 X's. The idea behind the
X's is not to get one alignment for the 3 frags but to have BLAST think
of them as a single Query sequence so it will combine the significance.

>3.  Is it possible to set the '-g ' option using the NCBI blastp
website?

There's seems to be something wrong with the website now so I can't
check for sure, but you may be able to set -g F in the "Other Advanced "
box. It doesn't list that as one of the acceptable advanced options but
I'd give it a try. 

You should also check "Mask for lookup table only". This is called
softmasking which will not use low complexity sequence for the initial
word search but will allow extension across this region. I almost always
use the softmasking option for protein and especially DNA searches.

>4.  When I used the NCBI blastp webpage with :
>
>query   >lectin_combined
>MASLQTQMISFYAIXXXXXKVNSTETTSFLITXXXXXKPQTGGGYLGVFNSAEYD
>
>blosum 80, word size of 2, Expect value of 10,000, Gap Exist 11, Extend
1
>
>I did not retrieve the parent sequence (>gi|490035|gb|CAA01149.1|
lectin
>[Pisum sativum]) which concerns me?  Could you give any insight on what
I
>may be doing incorrectly?

Hmmm. Sounds like you are doing it correctly. Is your first hit
"gi|126148|sp|P02867|LEC_PEA"? If so, that's the same protein. The NR db
combines identical sequences and concatenates their deflines. I think
the CAA01149 is probably just several down the defline ladder and isn't
shown in the BLAST report. I pulled that out of one of our custom
Databases so it's not an NR protein.

>5. In the graphical portion of the blast output, I have noticed that
>sometimes the black bars are *not* connected by the hashed lines.
Further
>inspection of these shows that they are (to my understanding)
completely
>unrelated.  For exampled, two unconnected black bars referred to a UDP
>sugar Hydrolase (S=24, E=393) and a Favin precursor (S=39.2, E=0.011).
>The sequence alignments were in completely different places of the
output
>as well.  Whats the deal with these unconnected black bars?  I must be
>missing something on this.

The bars in the report represent all of the different hits to your
query. Ones connected with lines are HSPs from the same subject. When
they aren't connected by lines then they are different database subjects
(as you saw above)

>6.  I had many nice conversations and meals with your co-author Mark
>Yandell at the recent CSHL bioinformatics courses.  Its nice to
interact
>with yet another co-author!
>
Thanks, Tristan. Good talking to you too. I hope I've helped.

Joey