[Biococoa-dev] BCCachedSequenceFile

Fri Sep 21 15:01:38 EDT 2007

I checked in this code a week or so again, but never got around to  
posting a message.  I've added a new class, BCCachedSequenceFile, and  
a concrete implementation class, BCCachedFastaFile.  The idea behind  
a cached sequence file is that the sequence file is too large to load  
up into memory, yet you want to be able to access the sequence data  
while it remains on disk.  The design is a factor class,  
BCCachedSequenceFile, that defines the interface and returns a  
concrete implementation class, BCCachedFastaFile, that knows how to  
handle a specific file format.  Currently I only have a FASTA class  
as it seems most genome data is provided that way.  The  
implementation reads the sequence file and collects meta-data about  
each sequence in the file, where it starts, ends, length, etc.  Then  
the data can be access by providing a sequence id and a position  
within the sequence.  The class figures out a file offset of where  
that data resides, reads from disk and returns.  I still would like  
to do some optimization to speed up file access and return chunks of  
data instead of just one symbol, but for now it works pretty good.   
It is not perfect of course, for FASTA it assumes that the line width  
within a sequence is constant, though it can vary from sequence to  
sequence in the file, but I think this is pretty typical for FASTA  
files.

cheers
Scott