[Bioperl-l] [BiO BB] B, Z, N, X in refseq
boris.steipe at utoronto.ca
Mon May 30 22:46:20 EDT 2005
> I am using the refseq from Genbank. There are some strange
> characteristic such as B, Z, N, X in the protein sequence.
These are standard ambiguity codes, see for example
< http://www.ncbi.nlm.nih.gov/blast/html/search.html >
(except for "N", which is simply asparagine)
> Can anybody tell me what these "bad " characteristics means?
This most likely means that particular sequence was derived by chemical
sequencing of polypeptides, not by translation of nucleic acids; thus
it may be hard to distinguish between N/D or Q/E.
> should I do if my program compain these bad characteristics. Remove
> them or replace them with some specific amino acid?
That depends on what you want to do. For database sequence search the
BLAST server accepts these codes and they are correctly represented in
standard mutation-data matrices for alignment scores, so you don't need
to worry. For molecular weight calculations you could use an average or
randomly choose one or the other. I can't imagine an application where
this level of detail would make much of a difference. However: removing
them is always a bad choice, e.g. for sequence alignments you would be
introducing a gap (bad!).
Hope this helps, it's pretty standard textbook knowledge though, and
maybe it would be worthwhile to read up on the Net before you post to
several groups at once :-)
> Do not guess who I am. I am not Bush in BlackHouse
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
More information about the BBB