[BiO BB] help with bacterial protein sequence comparisons
Sterten at aol.com
Sterten at aol.com
Fri May 18 01:50:12 EDT 2012
replying to myself again ... (posts appear with a lag, when the opinion
may have already changed)
ok, that's basically what blast is doing. I didn't know about blast.
_http://en.wikipedia.org/wiki/BLAST_ (http://en.wikipedia.org/wiki/BLAST)
most cited paper in the 90s !
But I wanted a small program that I understand and can easily manipulate
for different formats and additional features and without
copyright,licence,
so I wrote my own.
For the original problem (as I understood it) - I would send the blast
output to a file and filter the
output for representatives. There should be a tool for that, (I don't know)
Mine is again selfwritten. Hmm, sequences that are 99% identical to the
original can't
be only 90% identical to each other, so maybe I misunderstood.
In a message dated 18.05.2012 04:17:35 Westeuropäische Normalzeit,
Sterten at aol.com writes:
thinking more about it ...
task:
---------------------------------------
given two strings of amino-acids, S1 and S2.
Typically the lengths are ~100-~million bytes for S1 and
~100M-~4B bytes for S2, where e.g. S2 is the list of all
bacterial amino-acid sequences from genbank, joined together.
Given threshold D , find all triples(s1,s2,L) of lengths L and
subsequences s1 of length L from S1 and subsequences s2 of
length L from S2 such that p-value(|s1-s2|) < D .
Typically this is done for several Ds simultaneously,
basically trying to find the n best subsequences for some given n.
Solve that task as quickly as possible.
-----------------------------------
algorithm suggestion:
----------------------------------------------------
for all length=(4?) subsequences s1 from S1 make a table T[s1] of
addresses where that subsequence occurs in S1
walk through all length=4 subsequences s2 of S2
(load S2 into a cyclic 256 byte-buffer)
whenever s2 is marked in T (T[s2]>0), check the extended subsequences
x1 bytes forward, x2 bytes backward from s2 ,
enter it into the best(L) lists for the various lengths L=x1+x2+4.
Print those that are <tolerance
---------------------------------------------------
plot the best #matches(L) (=% similarity) in a chart
this misses triples (s1,s2,L) that have no 4-length 100%-match
but are good in total length nevertheless
----------------------------------------------------------------
does this program exist ? What is "blast" doing ?
amino acid frequencies in genbank bacterial files:
(I just calculated those - just in case someone is interested)
L 76 219736963
A 65 212665642
G 71 164123234
V 86 155933244
E 69 132851172
I 73 130723255
S 83 128052651
R 82 123992485
D 68 118503645
T 84 117688310
K 75 105966220
P 80 97716246
F 70 84543510
N 78 81056701
Q 81 79888596
Y 89 64269341
M 77 51573988
H 72 45239258
W 87 26873940
C 67 20419909
X 88 175383
_______________________________________________
BBB mailing list
BBB at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bbb
More information about the BBB
mailing list