[Biococoa-dev] reading large fasta files
Scott Christley
schristley at mac.com
Mon Oct 22 11:30:49 EDT 2007
On Oct 21, 2007, at 1:30 PM, Koen van der Drift wrote:
>
> On Oct 21, 2007, at 11:22 AM, Charles Parnot wrote:
>
>> In general, it would be best to have the implementation hidden, so
>> that indeed, the framework decides when to use one subclass or
>> another. Just like NSString, NSData, or NSArray use different
>> underlying data structures depending on the size of the data (I
>> think). This is of course all hidden behind the class cluster
>> design...
>>
>> I also don't know how things are already implemented, maybe things
>> are already addressed this way?
>
>
> Yes, I agree with having all that code hidden, so that there's only
> one class for users to implement when reading data, whether it's
> from a path or a string or data. Right now the class to read large
> (fasta) files is a separate class that works with a filePath, but
> is not a subclass of BCSequenceReader. So we need to think about
> how to implement it. The way we use BCSequenceReader right now is
> as follows:
>
> BCSequenceReader *sequenceReader = [[BCSequenceReader alloc] init];
> BCSequenceArray *sequenceArray = [sequenceReader
> readFileUsingPath: aPath];
> BCSequence *mySequence = [sequenceArray objectAtIndex: i];
>
>
> We could change this (or add the possibility) to use it as follows:
>
> BCSequenceReader *sequenceReader = [[BCSequenceReader alloc]
> initWithPath: aPath];
> BCSequenceArray *sequenceArray = [sequenceReader
> readSequenceArray];
> BCSequence *mySequence = [sequenceArray objectAtIndex: i];
>
>
> However, to make it more complicated, BCCachedFastaFile doesn't
> return an array of sequences, IIRC, it is actually a standalone
> object that can be used to access regions of very large files,
> without reading the whole sequence. I can't think of a way right
> now to combine this with BCSequenceReader. Anyone has a suggestion?
The change to BCSequenceReader sounds reasonable; it follows the
design of the Cocoa collection classes for initializing from a file.
I suppose the conceptual difference though is that a BCSequenceReader
can be used to read from multiple files, it doesn't represent the
collection itself, it is BCSequenceArray which represents the
collection. So one could consider changing BCSequenceArray instead ...
BCSequenceArray *sequenceArray = [[BCSequenceArray alloc]
initWithPath: aPath]
Then BCSequenceArray would use BCSequenceReader or
BCCacheSequenceFile to perform the operation. Now while I like this
design, it is complicated because we have many possible sequence file
formats, but we can certainly extend the initWithPath: method to
support an additional format: parameter like I recently did with
BCSequenceReader.
The point about the BCCachedFastaFile interface is well taken; in
order for programs to use it with the implementation hidden would
require a "cached" version of BCSequence. Conceptually think of it
where BCSequenceReader has data in memory and BCSequence holds
pointers to memory data, while BCCachedFastaFile has data on disk and
"BCCachedSequence" holds pointers to disk data. BCSequence and
BCCachedSequence would implement the same interface, so the user
wouldn't know the difference.
cheers
Scott
More information about the Biococoa-dev
mailing list