[BiO BB] generate random motifs
Boris Steipe
boris.steipe at utoronto.ca
Thu Jul 14 11:38:52 EDT 2005
Maya,
You had asked me whether some code I had posted here in March can be
adapted for your purpose
(http://bioinformatics.org/pipermail/bio_bulletin_board/2005-March/
002379.html).
I have attached the adapted version below. Just a little programming
fun on a hot summer Thursday morning in Toronto. I hope this helps.
Just edit the definitions of motif and alphabet for your purposes.
<disclaimer>This is what I consider "teaching code" i.e. with an
emphasis to demonstrate _how_ something can be done, with the newcomer
in mind; especially if you want to go in and change things for your own
purpose. Perl has many constructs to write something like this much
more concisely, but that's not the point here.</disclaimer>
Have fun,
Boris
On Wednesday, Jul 13, 2005, at 13:21 America/Montreal, Gao Zhang wrote:
> Hi all,
>
> I need some help to generate random motifs. The goal is to
> generate a random motif of width w. Each position will
> have a dominant letter with probability around 0.85.
>
> I know there is python module TAMO
> (http://jura.wi.mit.edu/fraenkel/TAMO/documentation/
> TAMO.MotifTools.html#Motif-emit )
> which can do this job.
>
> Can anybody know how to do this using Perl/Bioperl?
>
> Thank you very much and look forward to your reply!
>
> Best Regards,
> Maya
>
#!/usr/bin/perl
# randommotifs.pl
#
# This program generates randomized instances of pseudomotifs, i.e.
# sequences that are obtained by adding "noise", according to a
specified
# model to a consensus sequence. To keep this reasonably general, we
# proceed through the following steps:
#
# Define consensus motif and alphabet
# For the required number of repetitions
# For each position in motif
# Define an appropriate probability distribution over the
alphabet
# Produce a character according to the distribution
# ...
# ...
#
# The probability distributions are generated to produce a consensus
character
# with a particular "consensus probability" and all other characters
# with equal probabilites (the noise). This is defined in a single
subroutine
# so it is straightforward to change this for a different model of noise
# (e.g. increase the probabilities for "similar" characters). It's easy
# to become creative here and incorporate (biological) knowledge into
the
# noise-model.
#
# Probablities are then converted into cumulative intervals to chose a
# particular character:
#
# Eg: Probabilities A: 0.4, C: 0.3, G: 0.2, T: 0.1
# Intervals A: 0.4 C: 0.7, G: 0.9, T: 1.0
#
# This example can be pictured as
#
# 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
# +----+----+----+----+----+----+----+----+----+----+
# | A | C | G | T |
# +----+----+----+----+----+----+----+----+----+----+
#
# .. now all we need to do is generate a random number between 0 and 1
# and see in which interval it falls. This then points to the character
to be
# written to output.
#
# B. Steipe, July 2005. Public domain code.
# =====================================================================
use strict;
use warnings;
my $N_Sequences = 100; # Number of sequences to be
generated
my @Motif = split(//,'TTGACATATAAT'); # This example is simply a
concatenation
# of the -35 and -10 consensus
elements
# of prokaryotic promoters;
replace with
# whatever ...
my @Alphabet = split(//,'ACGT'); # Nucleotides; replace with
whatever ...
my $P_Consensus = 0.85; # Replace with whatever ...
# ====== Globals ==========================
my @Probabilities; # Stores the probability of each
character
# ====== Program ==========================
for (my $i=0; $i < $N_Sequences; $i++) {
for (my $j=0; $j < scalar(@Motif); $j++) {
loadConsensusCharacter($Motif[$j]); # Load the character for
this position
addNoiseToDistribution(); # Modify probabilities
according to noise model
convertToIntervals();
print(getRandomCharacter(rand(1.0))); # Output character
}
print "\n";
}
exit();
# ====== Subroutines =======================
#
sub loadConsensusCharacter {
my ($char) = @_;
my $Found = 'FALSE';
for (my $i=0; $i < scalar(@Alphabet); $i++) {
if ( $char eq $Alphabet[$i]) { # found character in i_th
position in Alphabet
$Probabilities[$i] = 1.0;
$Found = 'TRUE';
} else {
$Probabilities[$i] = 0.0;
}
}
if ($Found eq 'FALSE') {
die("Panic: Motif-Character\"$char\" was not found in Alphabet.
Aborting.\n");
}
return();
}
# ==========================================
sub addNoiseToDistribution {
# This implements a very simple noise model:
# We work on an array in which one element is 1.0 and
# all others are 0.0. The element with 1.0 has the same
# index as the consensus character in the Alphabet.
#
# We change that value to the consensus probability and
# we distribute the remaining probability equally among
# all other elements.
#
# It should be straightforward to implement more elaborate
# noise models, or use a position specific scoring matrix
# or something else here.
my $P_NonConsensus = (1.0-$P_Consensus) / (scalar(@Alphabet) - 1);
for (my $i=0; $i < scalar(@Probabilities); $i++) {
if ( $Probabilities[$i] == 1.0 ) { # ... this is the consensus
character
$Probabilities[$i] = $P_Consensus;
} else {
$Probabilities[$i] = $P_NonConsensus;
}
}
return();
}
# ==========================================
sub convertToIntervals {
my $Sum = 0;
# Convert the values of the probabilities array to the top
boundaries
# of intervals having widths proportional to their relative
frequency.
# Numbers range from 0 to 1.0 ...
for (my $i=1; $i < scalar(@Probabilities); $i++) {
$Probabilities[$i] += $Probabilities[$i-1];
}
return();
}
# ==========================================
sub getRandomCharacter {
my ($RandomNumber) = @_;
my $i=0;
# Test which interval contains the RandomNumber ...
# The index of the interval points to the Event we should choose.
for ($i=0; $i < scalar(@Probabilities); $i++) {
if ($Probabilities[$i] > $RandomNumber) { last; }
}
return($Alphabet[$i]);
}
More information about the BBB
mailing list