[BiO BB] sequence analysis

Mike Marchywka marchywka at hotmail.com
Wed Oct 24 08:27:36 EDT 2007


>Before, you should transform the database file such that

I've taken my local blast databases and used their fasta form for
"grepping" ( using my own code that calls either greta or boos regex 
libraries)
against various genome sequences. It turns out to be too slow for repetitive
usage but I would comment as follow.

The patterns of biological interest tend to be subsets of regex so you can
implement special code that is a lot faster when your query isn't 
blast-friendly.
For example, a "conserved" domain may look like  "neutral"-many irrlelvant-
cysteine-X-cysteine-many irrelevant-H- etc (I just made this up but it is 
based on
many thing I've seen in the literature). You may have a hard time blasting 
for this
but you can grep for it with something like 
[ANCQGILMFPSTWYV].{50,60}C.C.{10,100}H

If you want a real-life example, here are some from prosite using my prosite 
to
PERL translation scheme ( I hate illustrating with real things that may not 
be right):

[LIVM][VIC].[^H]G[DENQTA].[GAC][^L].[LIVMFY]{4}.{2}G >rule|16|PEPDTIDE 
Prosite CNMP_BINDING_1
[EQ][^LNYH].[ATV][FY][^LDAM][^T]W[^PG]N >rule|18|PEPDTIDE Prosite ACTININ_1


>From what I've seen, this is too slow for grep against many genes ( or 
pre-translated peptides)
but you can compile the query and target for much faster searching ( similar
to a transient database index ). Even literal string matching can be slow 
without
doing this - I have 500k empirically discovered ( highly-redundant lots of 
junk )
repeats that I can now label against 100, 60kb sequences in "reasonable" 
time
which I could not do before. This works fine for the 600 or so mirna 
sequences
I finally figured out how to download from sanger too :)





Mike Marchywka
586 Saint James Walk
Marietta GA 30067-7165
404-788-1216 (C)<- leave message
989-348-4796 (P)<- emergency only
marchywka at hotmail.com
Note: Hotmail is blocking my mom's entire
ISP claiming it is to reduce spam but probably
to force users to use hotmail. Please DON'T
assume I am ignoring you and try
me on marchywka at yahoo.com if no reply
here. Thanks.





>From: "Dr. Christoph Gille" <christoph.gille at charite.de>
>Reply-To: "General Forum at Bioinformatics.Org" 
><bio_bulletin_board at bioinformatics.org>
>To: "General Forum at Bioinformatics.Org" 
><bio_bulletin_board at bioinformatics.org>
>Subject: Re: [BiO BB] sequence analysis
>Date: Wed, 24 Oct 2007 09:17:08 +0200 (CEST)
>
>You could use fgrep. Fgrep is faster than grep.
>
>Before, you should transform the database file such that
>each sequence takes one line without blank and without
>line breaks (using tr and sed)
>
>Database files are optimized for hole cards for
>historical reasons. Lines are  wrapped after at least after 72
>characters, preventing the use of fgrep.
>
>
>
>
>_______________________________________________
>General Forum at Bioinformatics.Org - BiO_Bulletin_Board at bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/bio_bulletin_board

_________________________________________________________________
Get a FREE Web site and more from Microsoft Office Live Small Business!  
http://clk.atdmt.com/MRT/go/aub0930004958mrt/direct/01/




More information about the BBB mailing list