[BiO BB] Query

Fri Dec 5 07:51:30 EST 2008

Jaya Handa wrote:
> Hello Sir
> I wish to know when we read bacterial genomes in ncbi how do you number.
> Suppose the + starand is 5'-3' and - strand is 3'-5'.
> now if we number + trand as [5'] 1,2,3,4 [3'] bases then will the - strand
> be like [3'] 4,3,2,1 [5'].

Hmmm, '+' and '-' strands (and 'Watson' and 'Crick' strands) seem to
have gone out of fashion.

What we now have are sequences deposited in databases in the orientation
of the genome map (for whole genomes) or their coding sequences (for
individual genes) or their original sequence (3' ESTs for example will
have coding sequence on the reverse strand).

> Suppose we read this

This gene is on the forward strand (the transcribed RNA matches the
sequence in the database entry), numbers with 1 as the start of the
sequence in the NCBI entry.

>      CDS             49..288
>                      /gene="cspA"
>                      /locus_tag="BL0001"
>                      /note="CspA; COG family: cold shock proteins; PFAM_ID:
>                      CSD"
>                      /codon_start=1
>                      /transl_table=11
>                      /product="cold shock protein"
>                      /protein_id="AAN23868.1"
>                      /db_xref="GI:23325186"

This gene is on the reverse strand (the transcribed RNA is the reverse
complement of the sequence in the database entry). Again we number from
1 at the start of the sequence but the complement() in the location
tells us to reverse complement the sequence from 2248 to 2538.

The coding sequence starts at base 2538 and reads back along the reverse
strand to base 2248.

For splicing, the feature locations are more complicated: joins of
forward of complemented fragments, or joins of forward fragments that
are then complemented (ugh!) but the numbering principles are the same.

>      CDS             complement(2248..2538)
>                      /locus_tag="BL0003"
>                      /note="COG family: methyl-accepting chemotaxis protein"
>                      /codon_start=1
>                      /transl_table=11
>                      /product="hypothetical protein"
>                      /protein_id="AAN23870.1"
>                      /db_xref="GI:23325188"

In the early days (before we even called the field bioinformatics) there
were complaints about numbering sequences from 1, and numbering
promoters from -1 so there was no zero. My answer to those who
complained was that I will do it differently when there is a year AD
zero :-)

However, writing early programs in Fortran had something to do with it.
If we had been using C then sequences would start at zero.

Hope this helps,

Peter Rice