[Biodevelopers] Looking for repeat motifs - ideas?
Mike Marchywka
marchywka at hotmail.com
Mon Mar 3 11:48:43 EST 2008
> I did not find your program on the mailing list. Your idea of using regular
> expressions is interesting. Constructing them automatically from a specified
> number of mismatches and then filtering through them automatically is
> tricky.
I have to admit I have a few different features that I'm working with. First, general
regex code like boost or greta didn't seem to be very fast. Since I have 1000's or
regex's ( mostly from prosite and my own ideas ) and 10 or 50k-5M sequences,
I needed to specialize the evaluation code. So, I can decide what features are needed
in a given regex or regex list and execute an evaluator which only uses that "stuff"
effectively lifting various tests out of a long loop. Further, I index the sequences making
many searches faster.
Anyway, I also have diagnostic output that separates each component and reports
the hits as a single line with many words. So, I can easily use awk to take
only those hits with additional, ad hoc properties once the regex code has found
a reasonable candidate list in reasonable time. I haven't actually done this yet
but I do have exact forward and reverse-complement abilities now that can
find fairly short stems or pseudo-knot candidates ( no thermodynamics etc to
check for competing structures, but I'm hoping that there are easy ways to sort out
candidate structures from watson-crick pairing).
I'll have to give it some thought but I was generally going to use reference sequences
and exact rules and look for differences. If you compare two sequences and find
an important rule hit in one but not the other, then you could manually check it
and see if a single mismatch is likely to matter.
Mike Marchywka
586 Saint James Walk
Marietta GA 30067-7165
404-788-1216 (C)<- leave message
989-348-4796 (P)<- emergency only
marchywka at hotmail.com
Note: If I am asking for free stuff, I normally use for hobby/non-profit
information but may use in investment forums, public and private.
Please indicate any concerns if applicable.
Note: Hotmail is possibly blocking my mom's entire
ISP - try me on marchywka at yahoo.com if no reply
here. Thanks.
> From: nuhn at rhrk.uni-kl.de
> To: biodevelopers at bioinformatics.org
> Date: Thu, 28 Feb 2008 14:06:22 +0100
> Subject: Re: [Biodevelopers] Looking for repeat motifs - ideas?
>
> Hi, Mike!
>
> I took a look at rnamot. It looks very similar to rnabob and, just like
> rnamot, it does not look for direct repeats. :-(
>
> I did not find your program on the mailing list. Your idea of using regular
> expressions is interesting. Constructing them automatically from a specified
> number of mismatches and then filtering through them automatically is
> tricky.
>
> Your way of reducing the problem to a different one, brought me to another
> idea. Since rnabob can already find inverted repeats, every search for a
> normal repeat could perhaps be reduced to a search for an inverted repeat
> like so:
>
> Given the sequence: AAA GC AAA
> The repeat search should find the repetition of AAA
>
> The reduction goes like this:
>
> 1. Find a sequence that does not appear in the original sequence, this will
> be a seperator (here: XX)
> 2. Reverse complement the original sequence and join them with the
> seperator. This would be
>
> AAA GC AAA XX TTT GC TTT
>
> 3. Now instead of searching for
>
> - Repeat,
> - 2 spaces,
> - Repeat
>
> I search for
>
> - Repeat,
> - 2 + Length of sequence + Length of seperator spaces,
> - INVERTED_Repeat
>
> This would find an inverted repeat in the constructed sequence if and only
> if there is a normal repeat in the original sequence. Additionally, all of
> the nice functions of rnabob would be preserved.
>
> Since this is a bit complicated, I will have to sleep over this a few times.
> ;-) Even if it works in theory, rnabob might run into some memory problems,
> once the sequences get large. I don't know how big the motifs can be but I'm
> fairly certain, rnabob was not designed for something like this.
>
> - @Osnofian and RepeatMasker: I am still going to look into the program. I
> did not get to it yet, but it looks like it searches for a different kind of
> repeats at first glance.
>
> - @Mark: and http://sourceforge.net/projects/pars/ : There is no way of
> downloading your project. :-( I get the message: "This project has not yet
> created any file release packages." when I go to download.
>
> @all: Thanks for sharing your ideas so far.
>
> Cheers,
> Michael.
>
> ----- Original Message -----
> From: "Mike Marchywka"
> To: "Development in Bioinformatics"
> Sent: Wednesday, February 27, 2008 11:58 PM
> Subject: Re: [Biodevelopers] Looking for repeat motifs - ideas?
>
>
>
>> Using RegExes, how would you handle limited mismatches within the repeated
> motif, esp. when its position is unknown?
>>
>
> Well, first, I was looking for exact things and going with the idea that
> equality is
> easier than a metric in a high-dimensional space. But, if you are looking
> for
> short things, and willing to limit yourself to 1 or 2 mismatches, then you
> could
> split up an exact group into a pair of groups. For example, instead of
> looking for
> a thing 10 long with upto 1 mismatch, you could look for a pair of "things"
> each
> 1-10 long with a "match anything" field of length 0-1.
> [\-1]{1,10}.{0,1}[\-2]{1,10} etc
> and then take things that total to the desired length. This particular
> example
> may generate a lot of 1-0-1 hits ( two identical bases separated by 0 or 1
> "X"'s)
> but, depending on what you are doing you could filter the output with awk or
> I could make a total length requirement etc.
> ( the "-" is for forward match, NOT reverse complement that I do by
> default )
>
> I'd actually have to look- it may even be easier to code an allowed
> "mismatch"
> parameter if you are going to do this alot.
>
>
> I'd have to give this some thought and maybe someone on the boost list
> could explain how to do this with a "real" perl regex ( I have a made up
> syntax
> and set of limitations to meet my needs with best easily achievable
> performance).
>
>
>
>
>
>
>
> Mike Marchywka
> 586 Saint James Walk
> Marietta GA 30067-7165
> 404-788-1216 (C)<- leave message
> 989-348-4796 (P)<- emergency only
> marchywka at hotmail.com
> Note: Hotmail is blocking my mom's entire
> ISP claiming it is to reduce spam but probably
> to force users to use hotmail. Please DON'T
> assume I am ignoring you and try
> me on marchywka at yahoo.com if no reply
> here. Thanks.
>
>> Date: Wed, 27 Feb 2008 16:49:28 -0500
>> From: jeff at bioinformatics.org
>> To: biodevelopers at bioinformatics.org
>> Subject: Re: [Biodevelopers] Looking for repeat motifs - ideas?
>>
>> Mike,
>>
>> Using RegExes, how would you handle limited mismatches within the repeated
> motif, esp. when its position is unknown?
>>
>> Jeff
>>
>> Mike Marchywka wrote:
>>>
>>> The regex people probably question my syntax but I'm using things like
>>> [\1]{10,20}.{10,20}[\2]{10,20}.{10,20}[\1]{10,20}[\2]{10,20}
>>> to find pseudo knots with distance of 10-20 between reverse-complement
> regions.
>>
>> --
>> J.W. Bizzaro
>> Bioinformatics Organization, Inc. (Bioinformatics.Org)
>> E-mail: jeff at bioinformatics.org
>> Phone: +1 508 890 8600
>> --
>>
>> _______________________________________________
>> Biodevelopers mailing list
>> Biodevelopers at bioinformatics.org
>> http://www.bioinformatics.org/mailman/listinfo/biodevelopers
>
> _________________________________________________________________
> Climb to the top of the charts! Play the word scramble challenge with star
> power.
> http://club.live.com/star_shuffle.aspx?icid=starshuffle_wlmailtextlink_jan
> _______________________________________________
> Biodevelopers mailing list
> Biodevelopers at bioinformatics.org
> http://www.bioinformatics.org/mailman/listinfo/biodevelopers
>
>
> _______________________________________________
> Biodevelopers mailing list
> Biodevelopers at bioinformatics.org
> http://www.bioinformatics.org/mailman/listinfo/biodevelopers
_________________________________________________________________
Connect and share in new ways with Windows Live.
http://www.windowslive.com/share.html?ocid=TXT_TAGHM_Wave2_sharelife_012008
More information about the Biodevelopers
mailing list