[Biococoa-dev] BCCachedSequenceFile

Fri Sep 21 20:30:22 EDT 2007

Hi Scott,

Thanks for adding these files, they seems very useful. I was thinking  
based on how you factored out the BCCachedFastaFile class, maybe we  
should do the same for BCSequenceReader as well? This makes it maybe  
a little easier to maintain and add other formats.  Just a thought.

Also, the way your new class is now set up is quite different from  
BCSequenceReader, the latter which returns an BCSequenceArray (even  
if there's only one sequence in the file). Is it possible to use a  
similar approach for BCCachedSequenceFile as well? I think we need to  
make sure that we use a consistent approach throughout the framework,  
not only for the developers, but also for the (possible) users.  
Again, just a thought.

cheers,

- Koen.

On Sep 21, 2007, at 3:01 PM, Scott Christley wrote:

>
> I checked in this code a week or so again, but never got around to  
> posting a message.  I've added a new class, BCCachedSequenceFile,  
> and a concrete implementation class, BCCachedFastaFile.  The idea  
> behind a cached sequence file is that the sequence file is too  
> large to load up into memory, yet you want to be able to access the  
> sequence data while it remains on disk.  The design is a factor  
> class, BCCachedSequenceFile, that defines the interface and returns  
> a concrete implementation class, BCCachedFastaFile, that knows how  
> to handle a specific file format.  Currently I only have a FASTA  
> class as it seems most genome data is provided that way.  The  
> implementation reads the sequence file and collects meta-data about  
> each sequence in the file, where it starts, ends, length, etc.   
> Then the data can be access by providing a sequence id and a  
> position within the sequence.  The class figures out a file offset  
> of where that data resides, reads from disk and returns.  I still  
> would like to do some optimization to speed up file access and  
> return chunks of data instead of just one symbol, but for now it  
> works pretty good.  It is not perfect of course, for FASTA it  
> assumes that the line width within a sequence is constant, though  
> it can vary from sequence to sequence in the file, but I think this  
> is pretty typical for FASTA files.
>
> cheers
> Scott
>
> _______________________________________________
> Biococoa-dev mailing list
> Biococoa-dev at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biococoa-dev