[Biococoa-dev] biococoa svn and everything
Scott Christley
schristley at mac.com
Wed Oct 3 20:50:25 EDT 2007
On Oct 3, 2007, at 8:32 PM, Koen van der Drift wrote:
>> I tend to be an iterative programmer, trying out a design then
>> tweaking it until it converges on a best setup, so I think I'm
>> getting there and once I do I will document more thoroughly. The
>> suffix array is a nifty data structure, it essentially holds all
>> of the strings in a sequence in sorted order, making it quick to
>> search for exact string matches.
>
> Ok, ignorant question #2, isn't a sequence just one (1) string?
> Again, this is from someone who works with proteins ;-)
To be more specific, it maintains a sorted list of suffix strings.
So if this is your sequence:
ATTGCAGTCCG
Then the suffix array keeps a sorted list of suffix strings:
AGTCCG
ATTGCAGTCCG
CAGTCCG
CCG
CG
G
GCAGTCCG
GTCCG
TCCG
TGCAGTCCG
TTGCAGTCCG
So if you are searching for exact strings or almost exact strings in
a large sequence, using a suffix array is considerably faster than
trying to use BLAST for example.
>
>>
>>> Another question I have is why you are using calls such as fopen,
>>> fread, etc instead of the methods that Obj-C and Cocoa provide
>>> for I/O. Mind you, I am just trying to understand the code, it's
>>> no criticism at all.
>>
>> I presume you mean NSFileHandle? I was actually thinking of using
>> it, the current code which uses fopen, fread, etc is from the
>> original code for my standalone programs. The main reason why I
>> didn't switch over is that NSFileHandle can only return data with
>> NSData, and the type of programs which use suffix arrays and etc
>> do alot of file reading, which would mean lots and lots of NSData
>> objects being created and released. If only NSFileHandle could
>> put the data directly into a buffer provided by the user, or an
>> existing NSMutableData, that would be perfect.
>
> A quick search in cocoabuilder.com gave the following snippet:
>
> NSMutableData *data = [NSMutableData data];
> NSData *someData;
> NSFileHandle *readHandle = [[aTask standardOutput]
> fileHandleForReading];
>
> while ((someData = [readHandle availableData]) && [someData
> length]) {
> [data appendData:someData]; // Or, if possible, process the
> data here
> }
Unfortunately not, if you look in the while loop, it is still
creating a temporary autoreleased NSData object with [readHandle
availableData], so imagine that while loop being called billions of
times to read sort pieces of data from the file, that's a lot of
objects being created and released.
What I really want is an interface something like this:
char buffer[1000];
while ([readHandle: buffer length: 1000]) {
// do something with data in buffer
}
If you look at BCCachedSequenceFile, you will see that I implemented
such an interface.
cheers
Scott
More information about the Biococoa-dev
mailing list