[BiO BB] Random Sequence Generator
landman at scalableinformatics.com
Wed Oct 6 10:23:54 EDT 2004
Boris Steipe wrote:
> In this kind of simulation, you assume that all nucleotides are
> independent, this does not conserve dinucleotide, trinucleotide
> frequencies etc. If higher order correlations may play a role, it
> would be more appropriate to randomly sample from the original, rather
> than simulate a sequence.
Might be better (if you need multi-letter properties to match some
sequence library set), to sample the distribution of the multi-letters,
and pull randomly from there as compared to single letters. This way
you can (to an extent) preserve correllations at the di-/tri-/... higher
orders as required, though you will miss still higher order patterns
(and isn't that what some of the HMM tools are for anyway?) and still
"randomly" sample. Though with all due respect, please don't use "rand"
for random numbers. The Mersenne twister and other modern pseudo-random
number generators (PRNG) have superior properties, and decades of work
on the part of folks doing Monte Carlo work in physics and chemistry
have indicated that the quality of the PRNG is quite important.
So what I am saying is that if you need to emit "random patterns" with
similar di-nucleotide or tri-nucletide frequencies, that you emit
di-nucleotides and tri-nucleotides versus single nucleotides.
[good/readable perl code removed: ]
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
phone: +1 734 612 4615
More information about the BBB