# [BiO BB] help with bacterial protein sequence comparisons

Sterten at aol.com Sterten at aol.com
Thu May 17 02:00:19 EDT 2012

```thinking more about it ...

---------------------------------------
given two  strings of amino-acids, S1 and S2.
Typically the lengths are ~100-~million  bytes for S1 and
~100M-~4B bytes for S2, where e.g. S2 is the list of  all
bacterial amino-acid sequences from genbank, joined  together.

Given threshold D , find all triples(s1,s2,L) of lengths L  and
subsequences s1 of length L from S1 and subsequences s2 of
length L  from S2 such that p-value(|s1-s2|) < D .
Typically this is done for  several Ds simultaneously,
basically trying to find the n best subsequences  for some given n.

Solve that task as quickly as  possible.
-----------------------------------

algorithm  suggestion:
----------------------------------------------------
for all  length=(4?) subsequences s1 from S1 make a table T[s1] of
addresses where  that subsequence occurs in S1

walk through all length=4 subsequences s2  of S2
(load S2 into a cyclic 256 byte-buffer)

whenever s2 is marked in  T (T[s2]>0), check the extended subsequences
x1 bytes forward, x2 bytes  backward from s2 ,
enter it into the best(L) lists for the various lengths  L=x1+x2+4.
Print those that are  <tolerance
---------------------------------------------------

plot  the best #matches(L) (=% similarity) in a chart

this misses  triples (s1,s2,L) that have no 4-length 100%-match
but are good in total  length nevertheless

----------------------------------------------------------------

does this program exist ? What is "blast" doing ?

amino acid frequencies in genbank bacterial files:
(I just calculated those - just in case someone is interested)

L 76 219736963
A 65 212665642
G 71 164123234
V 86 155933244
E 69 132851172
I 73 130723255
S 83 128052651
R 82 123992485
D 68 118503645
T 84 117688310
K 75 105966220
P 80  97716246
F 70  84543510
N 78  81056701
Q 81  79888596
Y  89  64269341
M 77  51573988
H 72  45239258
W 87   26873940
C 67  20419909

X 88 175383

```