[Biococoa-dev] biococoa svn and everything

Scott Christley schristley at mac.com
Wed Oct 3 20:50:25 EDT 2007


On Oct 3, 2007, at 8:32 PM, Koen van der Drift wrote:

>> I tend to be an iterative programmer, trying out a design then  
>> tweaking it until it converges on a best setup, so I think I'm  
>> getting there and once I do I will document more thoroughly.  The  
>> suffix array is a nifty data structure, it essentially holds all  
>> of the strings in a sequence in sorted order, making it quick to  
>> search for exact string matches.
>
> Ok, ignorant question #2, isn't a sequence just one (1) string?  
> Again, this is from someone who works with proteins ;-)

To be more specific, it maintains a sorted list of suffix strings.   
So if this is your sequence:

ATTGCAGTCCG

Then the suffix array keeps a sorted list of suffix strings:

AGTCCG
ATTGCAGTCCG
CAGTCCG
CCG
CG
G
GCAGTCCG
GTCCG
TCCG
TGCAGTCCG
TTGCAGTCCG


So if you are searching for exact strings or almost exact strings in  
a large sequence, using a suffix array is considerably faster than  
trying to use BLAST for example.



>
>>
>>> Another question I have is why you are using calls such as fopen,  
>>> fread, etc instead of the methods that Obj-C and Cocoa provide  
>>> for I/O. Mind you, I am just trying to understand the code, it's  
>>> no criticism at all.
>>
>> I presume you mean NSFileHandle?  I was actually thinking of using  
>> it, the current code which uses fopen, fread, etc is from the  
>> original code for my standalone programs.  The main reason why I  
>> didn't switch over is that NSFileHandle can only return data with  
>> NSData, and the type of programs which use suffix arrays and etc  
>> do alot of file reading, which would mean lots and lots of NSData  
>> objects being created and released.  If only NSFileHandle could  
>> put the data directly into a buffer provided by the user, or an  
>> existing NSMutableData, that would be perfect.
>
> A quick search in cocoabuilder.com gave the following snippet:
>
>   NSMutableData *data = [NSMutableData data];
>   NSData *someData;
>   NSFileHandle *readHandle = [[aTask standardOutput]   
> fileHandleForReading];
>
>   while ((someData = [readHandle availableData]) && [someData  
> length]) {
>     [data appendData:someData];  // Or, if possible, process the  
> data here
>   }

Unfortunately not, if you look in the while loop, it is still  
creating a temporary autoreleased NSData object with [readHandle  
availableData], so imagine that while loop being called billions of  
times to read sort pieces of data from the file, that's a lot of  
objects being created and released.

What I really want is an interface something like this:

char buffer[1000];
while ([readHandle: buffer length: 1000]) {
	// do something with data in buffer
}

If you look at BCCachedSequenceFile, you will see that I implemented  
such an interface.


cheers
Scott




More information about the Biococoa-dev mailing list