[BiO BB] random sequences

Wed Mar 16 23:10:36 EST 2005

Hi Seema,

On Monday, Mar 14, 2005, at 13:46 Canada/Eastern, Seema Trivedi wrote:

> Dr. Boris Steipe,
> I will be highly obliged if you could kindly let me know if there are 
> any softwares which I can use for generation of random sequences of 
> nucleotides.

Yes , I had posted some teaching code here some time ago, attached 
below. Raja makes a good point about RSAT though: when you do any kind 
of statistics you should think carefully about what kind of "random" 
you want (and RSAT offers some nice options). Equiprobable nucleotides 
(like in the python code) are a poor random model for genetic sequences 
from organisms with high GC content. Independent single nucleotides are 
a poor random model for coding sequence; these would be better modelled 
from randomly chosen codons, weighted by amino acid frequencies, etc.

I like to use the following principle: try to generate your random 
model so that it represents as much of what you already know as 
possible. Then a deviation from the model is due to something which is 
yet to be discovered.

Good luck.

Boris

> I need to do this with large sequences.
> Thank you
> Sorry for the trouble,
> seema
>

Disclaimer: this is *Teaching code* and could be written much
more concisely. But that's not the point  :-)

================================================================

#!/usr/bin/perl
#
# This program defines a number of Events, e.g. nucleotides
# or amino acids or characters. Then frequencies are defined for each
# of the events. Frequencies are converted into probabilities and
# probabilities into cumulative intervals.
#
# Eg: Frequencies    A: 4,   C: 3,   G: 2,   T: 1
#     Probabilities  A: 0.4, C: 0.3, G: 0.2, T: 0.1
#     Intervals      A: 0.4  C: 0.7, G: 0.9, T: 1.0
#
# This can be pictured like
#
#   0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0
#    +----+----+----+----+----+----+----+----+----+----+
#    |         A         |      C       |    G    | T  |
#    +----+----+----+----+----+----+----+----+----+----+
#
# .. now all we need to do is generate a random number between 0 and 1
# and see in which interval it falls. This is then the Event we
# have chosen.
# B. Steipe, public domain
# =====================================================================

use strict;
use warnings;

my $OutputLength = 1000;    # set target length
my @Events;                 # Stores the events
my @Probabilities;          # Stores the probability of each event

initializeArrays();

for (my $i=0; $i < $OutputLength; $i++) {
     print getEvent(rand);
}

print "\n";

exit();
# ====== Subroutines =======================
#
sub initializeArrays {

     my $Sum = 0;

     # Initialize the events array - we write this into the code in this 
example here
     # but the frequencies could also be the result of counting amino 
acids in a database
     # or nucleotides in a genome, or dinucleotides etc.

     # In this example we generate a skewed nucleotide composition of 
four
     # different nucleotides, but it could be any number of characters.

     $Events[0] = 'A'; $Probabilities[0] = 9;
     $Events[1] = 'G'; $Probabilities[1] = 21;
     $Events[2] = 'C'; $Probabilities[2] = 32;
     $Events[3] = 'T'; $Probabilities[3] = 45;

     # Convert the values of the probabilities array to the top 
boundaries
     # of intervals having widths proportional to their relative 
frequency.

     # Sum over all frequencies ...
     for (my $i=0; $i <= $#Probabilities; $i++) {
         $Sum += $Probabilities[$i];
     }

     # Divide frequencies by Sum to get probabilities ...
     for (my $i=0; $i <= $#Probabilities; $i++) {
         $Probabilities[$i] /= $Sum;
     }

     # Calculate upper bounds of cumulative intervals
     # going from 0 to 1.0  ...
     for (my $i=1; $i <= $#Probabilities; $i++) {
         $Probabilities[$i] += $Probabilities[$i-1];
     }

     return();
}

# ==========================================
sub getEvent {

     my ($RandomNumber) = @_;
     my $i=0;

     # Test which interval contains the RandomNumber ...
     # The index of the interval points to the Event we should choose.
     for ($i=0; $i <= $#Probabilities; $i++) {
         if ($Probabilities[$i] > $RandomNumber) { last; }
     }

     return($Events[$i]);
}