[Biococoa-dev] reading large fasta files

Scott Christley schristley at mac.com
Mon Oct 22 11:30:49 EDT 2007


On Oct 21, 2007, at 1:30 PM, Koen van der Drift wrote:

>
> On Oct 21, 2007, at 11:22 AM, Charles Parnot wrote:
>
>> In general, it would be best to have the implementation hidden, so  
>> that indeed, the framework decides when to use one subclass or  
>> another. Just like NSString, NSData, or NSArray use different  
>> underlying data structures depending on the size of the data (I  
>> think). This is of course all hidden behind the class cluster  
>> design...
>>
>> I also don't know how things are already implemented, maybe things  
>> are already addressed this way?
>
>
> Yes, I agree with having all that code hidden, so that there's only  
> one class for users to implement when reading data, whether it's  
> from a path or a string or data. Right now the class to read large  
> (fasta) files is a separate class that works with a filePath, but  
> is not a subclass of BCSequenceReader. So we need to think about  
> how to implement it. The way we use BCSequenceReader right now is  
> as follows:
>
> 	BCSequenceReader	*sequenceReader = [[BCSequenceReader alloc] init];
> 	BCSequenceArray	*sequenceArray = [sequenceReader  
> readFileUsingPath: aPath];			
> 	BCSequence		*mySequence = [sequenceArray objectAtIndex: i];
>
>
> We could change this (or add the possibility) to use it as follows:
>
> 	BCSequenceReader 	*sequenceReader = [[BCSequenceReader alloc]  
> initWithPath: aPath];
> 	BCSequenceArray	*sequenceArray = [sequenceReader  
> readSequenceArray];			
> 	BCSequence		*mySequence = [sequenceArray objectAtIndex: i];
>
>
> However, to make it more complicated, BCCachedFastaFile doesn't  
> return an array of sequences, IIRC, it is actually a standalone  
> object that can be used to access regions of very large files,  
> without reading the whole sequence. I can't think of a way right  
> now to combine this with BCSequenceReader. Anyone has a suggestion?


The change to BCSequenceReader sounds reasonable; it follows the  
design of the Cocoa collection classes for initializing from a file.   
I suppose the conceptual difference though is that a BCSequenceReader  
can be used to read from multiple files, it doesn't represent the  
collection itself, it is BCSequenceArray which represents the  
collection.  So one could consider changing BCSequenceArray instead ...


BCSequenceArray *sequenceArray = [[BCSequenceArray alloc]  
initWithPath: aPath]


Then BCSequenceArray would use BCSequenceReader or  
BCCacheSequenceFile to perform the operation.  Now while I like this  
design, it is complicated because we have many possible sequence file  
formats, but we can certainly extend the initWithPath: method to  
support an additional format: parameter like I recently did with  
BCSequenceReader.


The point about the BCCachedFastaFile interface is well taken; in  
order for programs to use it with the implementation hidden would  
require a "cached" version of BCSequence.  Conceptually think of it  
where BCSequenceReader has data in memory and BCSequence holds  
pointers to memory data, while BCCachedFastaFile has data on disk and  
"BCCachedSequence" holds pointers to disk data.  BCSequence and  
BCCachedSequence would implement the same interface, so the user  
wouldn't know the difference.



cheers
Scott




More information about the Biococoa-dev mailing list