[Biococoa-dev] BCSequence class cluster? [Was Re: Introducing myself]

Fri Feb 27 18:11:11 EST 2009

I non-hypothetically agree with Charles :)

Also, I can highly recommend the new developers to look at the  
mailinglist archives to see some elongated discussions about the  
design and structure of the framework. I think at the end we all  
agreed that this structure works very well.  The two main important  
ones are the use of sequence-alphabets and class clusters. The first  
one allows only the use of the BCSequence class, instead of seperate  
classes for proteins, nucleotides, etc. By specifying the alphabet,  
the right type of sequence will be created. It also takes care of  
allowing only those actions on a sequence that are sensible. Eg, you  
cannot translate a protein, etc. The alphabet design was 'borrowed'  
from the BioJava project. The class cluster (IIRC) allows the hiding  
of the implementation for a group of similar classes, such as  
BCSequence and BCCachedSequence, as pointed out by Charles.

Cheers,

- Koen.

On Feb 27, 2009, at 4:18 PM, Charles Parnot wrote:

> Maybe it's not fair for me to vote, since I don't contribute  
> (anymore) to BioCocoa, but my preference as a hypothetical user of  
> the framework develping an hypothetical application, I would prefer  
> a class cluster that allows me to not have to care about the size of  
> the data, and to let the framework make the right decision for me  
> (and for the hypothetical users of my hypothetical application) :)
>
> charles
>
>
> On Feb 27, 2009, at 12:39 PM, Craig Bateman wrote:
>
>> After looking at this for a while, I agree that a protocol would do  
>> it, and would be consistent with using an Interface in many other  
>> languages, but a BCSequence class cluster (and probably a  
>> BCSequenceArray cluster that included a sequenceWithId: method  
>> since many file formats support multiple sequences) might be a bit  
>> more elegant.  Especially if we're serious about wanting a  
>> BCMutableSequence.
>>
>> This pattern is common in objective-c when you have multiple  
>> classes that all implement the same interface and the actual class  
>> to use is discernible at the time of construction.  It's a little  
>> harder to implement, but then consumers of the library don't need  
>> to worry about which class(es) they need for a given purpose.
>>
>> The pseudo-code to use them would then be something like: (Sorry  
>> about the naming here, I don't have the source in front of me as I  
>> write this)
>>
>> BCSequenceFile *myFile = [BCCachedSequenceFile  
>> fileWithContentsofFile:@"Whatever.fs"];
>> BCSequenceArray *myArray = [BCSequenceArray  
>> arrayWithSequenceFile:myFile];
>> BCSequence *first = [myArray sequenceAtIndex:0];
>> or
>> BCSequence *mySeq = [myArray sequenceWithId:@"GYS2"]
>>
>> The end user would be given an instance of BCCachedFastaFile in  
>> myFile, BCCachedSequenceArray in myArray and BCCachedSequence for  
>> the two sequence calls.  This would all happen transparently behind  
>> the scenes and they wouldn't necessarily need to know what class  
>> they were using.  Externally the memory vs file sequences look the  
>> same.  Internally the memory BCSequence utilizes an NSData while  
>> the file-based one utilizes an NSFileHandle with an NSRange over  
>> the sequence (at lesat that's how FASTA would work, other format  
>> implementations would vary significantly).
>>
>> I'm pretty sure this can be done without introducing any breaking  
>> changes.  Does anyone object to me attempting to implement these  
>> this way?
>>
>> On Wed, Feb 25, 2009 at 4:01 PM, Scott Christley  
>> <schristley at mac.com> wrote:
>> Hey Craig,
>>
>> Well that is great, really.  Sounds similar to my experience, I  
>> entered my PhD to do core computer science, software engineering,  
>> then got involved with a biology project, was hooked and been  
>> following every since.
>>
>> One thing you might want to look at are the two main genome  
>> browsers that exist today, one by UC Santa Cruz and the other by  
>> Ensembl.
>>
>> http://genome.cse.ucsc.edu/
>> http://www.ensembl.org/index.html
>>
>> There is also a project that I'm involved with, VectorBase, which  
>> also uses the Ensembl browser.
>>
>> http://www.vectorbase.org/index.php
>>
>> The reason I point these out is because all of them are web-based,  
>> which is great, but a potential killer app "might be" to have a  
>> local application which would allow researchers to analyze their  
>> local data.  Reproducing the functionality of these genome browsers  
>> isn't the way to go, but there are many potential niches to be  
>> filled.
>>
>> Yes, shotgun sequencing is exactly what it is called.  Humorous  
>> name for sure, but you are exactly right, the "shotgun" blasts the  
>> genome into many smaller bits, which are then assembled together  
>> afterward.  It was quite controversial when Venter's company took  
>> the approach for the human genome project, in defiance of the  
>> public consortium which was doing it the expensive, slow, but more  
>> accurate way.  But now it is the standard way, though its not  
>> perfect, and assembly in general is a difficult problem.
>>
>> So for the BC*Sequence classes, if you look in the BCSequenceIO  
>> group then you will find a BCCachedSequenceFile and  
>> BCCachedFastaFile classes, which handle the file I/O.  What is  
>> missing is a BCCachedSequence class, to correspond to BCSequence.   
>> From a design perspective, the two classes should stay separate  
>> (memory-based versus file-based) but I think a protocol which  
>> defines a common interface is what is needed.
>>
>> cheers
>> Scott
>>
>>
>> On Feb 24, 2009, at 2:31 PM, Craig Bateman wrote:
>>
>>> I accidently dropped the list in my reply, so Scott was the only  
>>> one that got it.
>>>
>>> ---------- Forwarded message ----------
>>> From: Craig Bateman <craig at batemanspace.com>
>>> Date: Mon, Feb 23, 2009 at 2:01 AM
>>> Subject: Re: [Biococoa-dev] Introducing myself
>>> To: Scott Christley <schristley at mac.com>
>>>
>>>
>>> Well, unfortunately I can't state what, in particular interests me  
>>> about genetics, mostly because I know so little.  I read the blind  
>>> watchmaker and was intrigued by the author's explanation of how  
>>> genes work, and since then have read other books about the human  
>>> genome and the effects of certain genes on human development,  
>>> etc.  I guess I'm just vaguely interested in genetics research  
>>> because I want to know.  I certainly can't state that I'm  
>>> interested in any one sub-topic over any other.  In short, I've  
>>> barely scratched the surface, and want to learn so much more...
>>>
>>> I am, however, an avid programmer, and was hoping that my vague  
>>> interest in the domain of genetics coupled with my years of  
>>> writing software (banking analysis software, but software all the  
>>> same) would combine to provide a great developer resource for the  
>>> project.
>>>
>>> As far as a "killer app" goes, I couldn't even guess what  
>>> something like that would look like for BioCocoa...  If you have  
>>> some ideas I can certainly bring something to light, but honestly  
>>> I haven't a clue about how any of this sequence information is  
>>> actually used and/or what features in such an app would be useful.
>>>
>>> Unifying the BC*Sequence classes is a good idea, maybe I'll look  
>>> at that first as a tooth-cutting exercise.  Aside from that, I  
>>> read a bit about "shotgun" sequencing, which may not be what it's  
>>> actually called, but where overlapping bits of a sequence are used  
>>> to assemble an entire sequence.
>>>
>>> So I've got a lot to learn, but anything I can contribute to this  
>>> project or genetics/proteins/cancer/whatever research in general  
>>> is a win in my book.
>>>
>>>
>>>
>>> On Feb 22, 2009, at 11:16 PM, Scott Christley wrote:
>>>
>>> Hello Craig,
>>>
>>> The coding I've been doing lately is primarily related to the  
>>> research I'm doing, so from this sense it doesn't necessarily go  
>>> fast.  My long-term goal is to add some advanced analysis  
>>> techniques into BioCocoa.
>>>
>>> One of the key things I would like to do is make the sequence and  
>>> cached sequence class correspond in their interface.  The cached  
>>> sequence class is important to do large scale analysis on large  
>>> genomes, because they are too big to load completely into memory.   
>>> This is something that BioCocoa can offer above other toolkits  
>>> like BioPerl and BioPython, high performance and large scale  
>>> analysis.
>>>
>>> What interests you about genetics?  Much of the algorithms in  
>>> genetics, bioinformatics and so on are still being developed, even  
>>> things like assembly of genomes is not a "done" technology.  If  
>>> you have a  specific interest area, then I can help lay out a  
>>> series of tasks that would be both highly useful and be  
>>> interesting algorithmic work.
>>>
>>> Koen is right, the todo list is still accurate, and those are  
>>> certainly useful enhancements to make.  And the creation of a  
>>> "killer app" is definitely desired, especially to bring these  
>>> advanced analysis techniques together into an easy-to-use GUI and/ 
>>> or command line applications that biologists can use.
>>>
>>> cheers
>>> Scott
>>>
>>> On Feb 21, 2009, at 12:43 PM, Craig Bateman wrote:
>>>
>>> I'm an experienced software engineer looking for an open source  
>>> mac project to contribute to, and I'm recently very interested in  
>>> genetics.  So BioCocoa seemed an obvious choice.
>>>
>>> I looked at the To Do list, and fear that 2+ years later it must  
>>> be out of date unless there's just nobody left working on this  
>>> project.  Is it officially dead?  There hasn't been a lot of  
>>> movement on this list in the past few months since the 2.1.0 "non"- 
>>> release.  I've checked out the source and will start digging now  
>>> to get a feel for what's here and how it works.  What/where are  
>>> the primary missing pieces? Has all the 1.x functionality been  
>>> incorporated to 2.1?  Is anything on the todo list still up for  
>>> doing?  Should I be looking at the framework itself or the  
>>> applications?
>>>
>>> Anyway, to whoever is still alive on this project, let me know how  
>>> and where I can help and I'll be glad to.
>>>
>>> Thanks,
>>> Craig Bateman
>>>
>>> _______________________________________________
>>> Biococoa-dev mailing list
>>> Biococoa-dev at bioinformatics.org
>>> http://www.bioinformatics.org/mailman/listinfo/biococoa-dev
>>>
>>>
>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> Biococoa-dev mailing list
>> Biococoa-dev at bioinformatics.org
>> http://www.bioinformatics.org/mailman/listinfo/biococoa-dev
>
> --
> OpenMacGrid
> Help science move fast forward:
> http://www.macresearch.org/openmacgrid
>
> Charles Parnot
> charles.parnot at gmail.com
>
>
>
>
>
>
> _______________________________________________
> Biococoa-dev mailing list
> Biococoa-dev at bioinformatics.org
> http://www.bioinformatics.org/mailman/listinfo/biococoa-dev