[Bioclusters] Nightly updated BLAST databases

16 Dec 2002 20:28:34 -0500

On Mon, 2002-12-16 at 19:36, Jeremy Mann wrote:
> I am implementing a nightly updated and formatdb script for the BLAST
> unformatted databases from NCBI. A researcher today asked a question that
> I could not answer. His question was, if he runs a long BLAST search
> during the time my script is running, what will happen to his returned
> search? Will he get false positive from both databases (the old one and
> the newly created one)? Will the database be locked out during his search
> and my script will fail?

I used to get this question quite a bit when I talked about the previous
scalable BLAST products I had developed.  The short answer is that you
can design your process to fit the way you want to work.

> I was amazed that I didn't think of this sooner. What does everybody here
> use as a script and how do you prevent the database from being newly
> formatted if a current BLAST search is running?

Generally this is not so hard.  You can even incorporate the update into
a queuing system, as long as you use an O(1) data distribution system,
such as the old ccp I had architected, or some newer stuff.  Use a
priority based mechanism to schedule the update to occur between
computing runs.  This requires some tuning/tweaking of the queuing
system, but it is generally not that hard to do.  

If you are going to do this by hand, use the "lsof" command to see if
you have any processes using the particular file.

Right before I did a quick run, I looked at my database indices:

        [root@head run]# lsof db/nr*
        [root@head run]

Then I started a quick run

        [root@head small]# /big/run/ncbi/build/blastall -i  cherry_tomato.fsa -o x -e 0.0001 -d nr -p blastx

and back I went to look at my indices:

        [root@head run]# lsof db/nr*
        COMMAND    PID USER  FD   TYPE DEVICE      SIZE  NODE NAME
        blastall 11443 root mem    REG    9,0   9947488 98310 db/nr.pin
        blastall 11443 root mem    REG    9,0 396957149 98309 db/nr.psq
        blastall 11443 root mem    REG    9,0 272675746 98308 db/nr.phr

You can basically implement 2 lines of Perl to do a "reference count" on
the file:

	$reference_count = `lsof $filename | tail +2 | wc -l`;
	chomp($reference_count);

Do the update if $reference_count == 0.  

There are other "tricks" you can play.  The one I used to use was to
download the database, append the date/time to it, wait for the run to
finish (e.g. reference count goes to 0, system quiescent), and then swap
links as Chris indicated.  

It is usually advisable to have a few levels of previous libraries
available (to check older calculations if need be, especially useful for
examining whether you are looking at a signal or at noise).  These
aren't things you want to commit to CVS or other versioning systems, all
you really need is to maintain a few with versioning meta-data attached.

There are many ways to do this.  This is somewhat beyond the scope of
what I can cover in a short message.

> 
> Thanks for any answers.
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman@scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615