[BiO BB] Implementing HMM models in Hardware (FPGA)
vkode78 at yahoo.com
Mon Sep 15 16:37:23 EDT 2003
Mark, John, and Val, Thank you all for your perspectives. I couldnt have asked for any thing better. I am sure one day (soon enough) there will be a happy family of EEs, CSs and Bioinformatics all on the same page. Compared to you guys, I am a novice both in Hardware design and the Bio side. It was just great to get all these persepctives from insiders. And I thank you all for it. But for what it is worth, here is my take on it.
Clusters seems to be the buzz word in the Bioinformatics world these days. They want to build bigger and better clusters. And I dont blame them. There is a huge loads of data that need to be processed in these massive parallel implementations. Here is probably the cost involved in these clusters:
1) Each Node ~ $600
2) Bottlenecks, Overhead relating to the parallel implementation
This almost is analogous to using Brute force using multi purpose processors, and parallel implementations and all this with huge cost. I am sure there can be a better solution to all this. John mentioned Biochips, that sure would be nice but it certainly wont be a reality until a decade from now ( ignorant guess, please pardon me). With the current status quo sticking to computational processors, one of the alternatives could be Taaddaaa FPGA!!!!
FPGAs as Hardware Accelerators:
1) Run Time Reconfigurable(RTR)
2) High Density Devices that can implement highly parallel implemenations
There will be a huge cost/performance difference between clusters and array of FPGAs.
Computatations like Viterbi Decoding, Forward & Backward algorithms, and Log odd scores all of these can be very efficiently implemeted on FPGA because of the Look up Table architecture of FPGAs. Thats just the beginning. And these devices are run time reconfigurable. So I can have the Processing Engines ( PEs) which I call fixed logic which work on the data ( HMM profiles) which I call the Reconfigurable Logic(RC). These PEs will be working on the RC data, and I can independently re configure just the RC to load new data ( next HMM profile ) using RTR and ofcourse this is just one PE and RC. and depending on the size of the FPGA device used I can have more than one PE and RC all working in parallel.
Having said that, the trick is to make sure that the PEs are always busy, and the reconfiguration delay, and the computing cost involved in the host processor, all justfy the Cost/Performance parameter. And since everything is generated on the fly, there has to be an optimal way of scheduling reconfiguration. And that can be hard as well.
And ofcourse this eliminates the need for dedicated memory ON/OFF chip, at the cost of the reconfiguration delay. Key to all this is Cost/Performance factor.
Please feel free to throw in your comments,suggestions & questions. I sure would like to know if there any issues that you see with the route I am taking. I hope to have a website soon where I can post my progress.
Thanks again ,
val <val at vtek.com> wrote:
Thanx John for an interesting and refreshing post.
Your points sound very reasonable to me, although this is a
CS/cpu side of the story. What about other side, biochip side,
a direction which might be taken more comfortable than
HW accelerators. In other words, a computational acceleration
seems to be a good thing, but this is just a fragment of the
whole cell analysis *pipeline*.
Indeed, a final goal of bioinformatics and generally in-silico
cell analysis is to understand cell mechanisms/processes
and then based on that proceeed to drug design/discovery activities.
>From that perspective, further evolution and advancements in biochip
design and functionality would be a step in right direction.
And i mean silicon functionality, when talking about biochips,
and related data. Silicon designed for (floating point)
computing, including multiprocessor and cluster options, is still
very much silicon designed ~50 years ago and having little to do
with cell mechanisms analysis, understanding its result and then
using it for biomedical applications.
So when designing silicon functionality, why don't start right
from using silicon to implement a whole cell analysis pipeline?
Silicon - but not just a computational one, rather a *biotechnology*
(bt) silicon. That is, silicon directly interfaced electrically
with cells (in culture, 'a real sample'). The interface would
include an *input plane" (sensor plane) and an "output plane"
(driving plane). And a recognition and storage logic in-between.
This is indeed a quite known "system/lab-on-the-chip" approach, with
the lab directly interfaced with a sample, including
(on the following phases) electrical driving facilities designed
to move and/or immobilize cells, perform transfection,
electroporation and other cell modification operations.
Of course, such an active biochip would be a massively
parallel processor, and can be called a biotechnology (bt)processor
(vs. computational processor) since it directly implements
a programmable cell analysis technology pipeline - input, processing
and modification. Optical fluorescent binding patterns can also
be measured with such a chip. Its obvious advantage is that dynamic
analysis in time can be performed on the same chip, say, yeast life cycle
dynamics with a fine time resolution (say, seconds and less instead of
What seems to be a really good news is that such a silicon can and needs
to be designed as an *array*, a massively parallel, fine-grain architecture
with a relatively simple microcell (vs. spagetti-like x86s). If the total
number of transistors on
the "lab-on-the-chip" is ~10B (which is what possible now), a grain
(microprocessor) in the mega-array (1000x1000 microprocessors) may have up
10K transistors which is quite enough to implement the basic input/output
processing functionality at a grain level. For a ~1 sq.in. chip, a grain
would be of 250 um size. The input plane of a grain might have up to 32x32
sensors, so that linear spatial resolution for cell analysis would be ~8 um
which is Ok for mid-size and large cells (on average, an animal cell size is
So, i guess my point is that it does not make a lot of sense to
optimize, etc a fragment of the pipeline without looking at an integrated
cell/tissue analysis pipeline - how/where silicon functionality can be
----- Original Message -----
From: John Jakson
To: bio_bulletin_board at bioinformatics.org
Sent: Sunday, September 14, 2003 3:00 AM
Subject: Re: [BiO BB] Implementing HMM models in Hardware (FPGA)
Interesting to see others interested in applying FPGAs to Bioinformatics.
FPGAs don't get much mention here.
I am not convinced the Bio industry really cares for EE solutions it doesn't
understand. Linux clusters are bad enough but what the hell are FPGAs. As an
EE VLSI/FPGA hardhat visitor at the BioWorld show, held here in Boston not
that long ago all I saw was disinterest and plenty of tower server racks.
Not one HW company showed up with anything but Linux clusters or the
SGI/IBM/HP/... equivalent. TimeLogic & the 1 or 2 other (defunct ?)
accelerator companies were noshow. On talking with the floor folks I found
no interest or basic understanding of possible HW alternatives.
The issue comes down to how the problem is stated and how it can be
implemented in a solution that most Bio SW types can understand. That means
whatever the engine is, it must just run C code, simple as that, preferably
the free stuff from NCBI. That always leads to the same solution, clusters
of ever faster & ever hotter farms of todays x86. Any rational computer
scientist knows this is crazy, and that dedicated HW should be built.
TimeLogic says it very well on their web site. In crypto, video or DSP
processing, it is relatively easy to turn C code into HW since they are all
math intensive and are likely created by the same EEs.
It may come as a surprise to SW types but HW is routinely modeled in C, but
that code is used only to double check the design written in a decent HW
description languages like Verilog or VHDL both of which are implicitly
parallel languages. There is usually some formal mathematical model often
written in Matlab for the real heavy stuff. Its also interesting that the
Matlab code usually floating point intensive & the final ASIC/FPGA solutions
are not expected to produce identical results since HW is best built integer
fashion. One might regard the current Bio C codes as just simulations of HW
that hasn't been built yet since few know how to recode them in HW language.
TimeLogic did a few but not in a way that can be easily duplicated across
To turn C code into really fast HW requires understanding what the C code is
really doing and having permission to make subtle but harmless changes to it
to allow the really big speed ups. That means eliminating floating point. If
the Bio author of such SW is also a HW expert (of which there are probably
only a handfull or even 0 in the whole world) then equivalent algorithms
could be used that are relatively simply to map onto HW structures. I don't
see the Bio world hiring too many HW EEs either, we are far too different
culturally and we usually don't have Phds, esp not from the right schools.
There are other ways to turn C code into HW, maybe use a C based HW language
such as HandelC which is based on Occam & CSP. And there's the clue. If the
SW is broken up into the constituent parallel processes that are naturally
there but impossible to describe in plain C, then it becomes almost trivial
to map those parallel processes onto FPGA fabric or even something like a
Transputer farm. The only difference is the granularity. FPGAs are hot today
but can only readily be engineered by HW types because their most efficient
use requires detailed understanding of pipelines and combinatorial logic and
basic cpu design. Transputers if they still existed would be the natural way
to go because they are ameniable to both SW & HW engineers but they still
worked best when SW & HW were both understood. Occam was just a way to
describe parallel processes that decribed HW in a funny syntax. Transputers
only died out because the implementation fell far behind x86 performa nce
and was single sourced & underfunded. Most Transputer projects & users
ultimately switched to standard DSPs & FPGA leaving the SW user base behind.
Another approach would be to use one of the cpu farms on a chip such as
Clearspeed or PicoChip or BOPs (RIP) who have developed risc cpus that can
be upto 420 instances on a chip running at 100MHz plus. Interesting to see
if those devices can escape cell phone basestations.
So I have taken my passive interest in this subject back to the drawing
board to recreate a modern FPGA hosted Transputer that would naturally
execute sequential C code, or parallel Occam code & even Verilog code. That
means that if code can be partially migrated from seq C to par Occam style C
(ie HandelC) then to Verilog ( a C'ish like HW language), the same code
still runs on the same cpu (but a little slower perhaps). Extra process
scheduling HW is needed to support very fine grained concurrency in a modern
Transputer and also a logic simulator. The big pay off is that properly
parallized code once in Verilog form still runs either as compiled source
code on a farm of cpus using message passing and links, or it can be
synthesized with industry standard HW tools back onto the FPGA fabric for
the desired speed ups. In effect, sequential procedures in C code can be
morphed into on chip HW coproceesors using the reconfigurable features of
many FPGAs. Stable FPGA coproc essor engines can then be turned into much
faster and cheaper ASICs in return for nasty upfront NRE. Such solutions
could go much farther than current TimeLogics products for many industries
Xilinx & Intel can give us a clue here. A cluster cpu node based on a P4 at
say 3GHz might run to $2K per node depending on whats there even though the
fastest P4 chip is always say $600. An FPGA RISC cpu node based on
MicroBlaze runs at maybe only 125MHz but will cost about $1.40 per node in
volume plus extra support. Now if the cpu can be farmed by adding those
Transputer extensions, the 24x clock difference doesn't looks so bad
compared to the est 400+ fold cpu cost difference. Also a lot of slower cpus
each with local RLDRAM don't have the memory latency that P4s suffer from ie
1 DRAM cycle is a few cpu cycle instead of hundreds, and distributed
bandwidth is much easier to manage.
Its also interesting to see the changes at TimeLogic, the departure of Jim
and the merger with a company that I see has no obvious HW background.
sorry for long rant
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
BiO_Bulletin_Board maillist - BiO_Bulletin_Board at bioinformatics.org
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the BBB