[Biococoa-dev] BCCachedSequenceFile
Scott Christley
schristley at mac.com
Fri Sep 21 15:01:38 EDT 2007
I checked in this code a week or so again, but never got around to
posting a message. I've added a new class, BCCachedSequenceFile, and
a concrete implementation class, BCCachedFastaFile. The idea behind
a cached sequence file is that the sequence file is too large to load
up into memory, yet you want to be able to access the sequence data
while it remains on disk. The design is a factor class,
BCCachedSequenceFile, that defines the interface and returns a
concrete implementation class, BCCachedFastaFile, that knows how to
handle a specific file format. Currently I only have a FASTA class
as it seems most genome data is provided that way. The
implementation reads the sequence file and collects meta-data about
each sequence in the file, where it starts, ends, length, etc. Then
the data can be access by providing a sequence id and a position
within the sequence. The class figures out a file offset of where
that data resides, reads from disk and returns. I still would like
to do some optimization to speed up file access and return chunks of
data instead of just one symbol, but for now it works pretty good.
It is not perfect of course, for FASTA it assumes that the line width
within a sequence is constant, though it can vary from sequence to
sequence in the file, but I think this is pretty typical for FASTA
files.
cheers
Scott
More information about the Biococoa-dev
mailing list