From jtimmer at bellatlantic.net Thu Nov 4 20:23:56 2004 From: jtimmer at bellatlantic.net (John Timmer) Date: Thu, 04 Nov 2004 20:23:56 -0500 Subject: [Biococoa-dev] BCCodon init question In-Reply-To: <8FA1F618-2B5B-11D9-8785-003065A5FDCC@earthlink.net> Message-ID: Hi - Sorry, the house sold and I started my new job, so I've had zero free time in the last week or so. Anyway, returning nil from the super's init method was just a leftover that I forgot to fix. I stick nil there until I have the class well defined, and know what's safe to return with a simple "init" (as opposed to something like "initWithSomeInfo:"). I definitely had the code working on my machine, so I'm not sure why that hadn't propagated to the CVS server. Sorry about the slip up, and thanks for finding and fixing it. John PS - the new place of employment seems to have MAC based DHCP authentication, so my laptop wasn't networking. I just set the office's desktop to share its connection over Firewire and I'm good to go. God, I love OS-X. > Hi, > > I was trying to figure out why the current translation demo is not > working (no protein is displayed in the right panel). So after some > debugging I found that both BCCodonDNA and BCCodonRNA call self = > [super init] in their init method. However, the super init method > (BCCodon) always returns nil, so BCCodonDNA and BCCodonRNA always > return nil. I commented out the init method of BCCodon, and now indeed > I see a protein sequence in the right panel. Not sure if this is the > intended way to make this work, though. John, is their a particular > reason why BCCodon's init() always returns nil? > > > thanks, > > - Koen. > > _______________________________________________ > Biococoa-dev mailing list > Biococoa-dev at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/biococoa-dev _______________________________________________ This mind intentionally left blank From kvddrift at earthlink.net Thu Nov 4 21:05:14 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Thu, 4 Nov 2004 21:05:14 -0500 Subject: [Biococoa-dev] BCCodon init question In-Reply-To: References: Message-ID: <21E17B14-2ECF-11D9-ADB2-003065A5FDCC@earthlink.net> On Nov 4, 2004, at 8:23 PM, John Timmer wrote: > Hi - > > Sorry, the house sold and I started my new job, so I've had zero free > time > in the last week or so. > > Anyway, returning nil from the super's init method was just a leftover > that > I forgot to fix. I stick nil there until I have the class well > defined, and > know what's safe to return with a simple "init" (as opposed to > something > like "initWithSomeInfo:"). I definitely had the code working on my > machine, > so I'm not sure why that hadn't propagated to the CVS server. > > Sorry about the slip up, and thanks for finding and fixing it. Hi John, It's fixed in CVS now, too (I waited for your reply before I did that). Congrats with your new job, where are you now? - Koen. From jtimmer at bellatlantic.net Fri Nov 5 14:06:02 2004 From: jtimmer at bellatlantic.net (John Timmer) Date: Fri, 05 Nov 2004 14:06:02 -0500 Subject: [Biococoa-dev] BCCodon init question In-Reply-To: <21E17B14-2ECF-11D9-ADB2-003065A5FDCC@earthlink.net> Message-ID: > It's fixed in CVS now, too (I waited for your reply before I did that). > > > Congrats with your new job, where are you now? > Thanks and thanks again. The job's just down the block at Cornell Med, and I've moved up from a post-doc level position to a non-tenure track faculty one. It's only good for 3 years, but it should help me re-establish a project to try to get a job with, after the last one crashed and burned. Cheers, JT _______________________________________________ This mind intentionally left blank From kvddrift at earthlink.net Wed Nov 10 19:42:34 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Wed, 10 Nov 2004 19:42:34 -0500 Subject: [Biococoa-dev] BCSequenceReader Message-ID: <93CEE41A-337A-11D9-A52E-003065A5FDCC@earthlink.net> Hi all, I have added an initial attempt for a new class BCSequenceReader. I also added some code to the translation demo to test this. I am using the original code from Peter, so the code figures out what the format of the data is. For now I have only added a readFasta method. Fasta files (and other formats as well) can contain DNA sequences or protein sequences. But how do I figure out which of the two I am dealing with, so I can return the proper subclass of BCSequence? Any suggestions how to approach this? thanks, - Koen. From mek at mekentosj.com Thu Nov 11 02:30:09 2004 From: mek at mekentosj.com (Alexander Griekspoor) Date: Thu, 11 Nov 2004 08:30:09 +0100 Subject: [Biococoa-dev] BCSequenceReader In-Reply-To: <93CEE41A-337A-11D9-A52E-003065A5FDCC@earthlink.net> References: <93CEE41A-337A-11D9-A52E-003065A5FDCC@earthlink.net> Message-ID: <843A2239-33B3-11D9-A0F0-000D93AE89A4@mekentosj.com> Ha Koen, Very nice, in fact I'll see if I can further expand that as well soon because this is the part that I'm most interested in on the short term. If that works OK, I'll implement it immediately in the update of EnzymeX. About the distinction between protein and sequence, I think we need some form of a "distinction algorithm" anyway, a small test wether a sequence is DNA (both with and without ambiguous bases), RNA and Proteins. For some formats this won't be a problem because they only support one or the other, but in most cases we should first identify the type. These methods would come in handy in many cases, for instance if someone enters text by hand to feed it in some methods, we could quickly check to see what the type is. The best way to do it is to check for the presence of certain characters or look at overall % of certain characters. Though you can never distinguish a stretch of 7 Alanines from 7 Adenosines I'm afraid. I would it that case default to either one, although it might be handy to have alternatives ready there. For instance, a read method which has the type you want as an argument, and also an argument that says what to do if the thing fails (i.e. skip or stop). The same holds true for the "checkType" methods, it would be nice if they return nil or a self-defined constant (BCSequenceTypeUnknown or something) if it can't be determined. Finally, one thing we might already think about a bit. The DNA strider format is a binary one, to test this we need to work with paths instead of strings. Therefore, I suggest to pass the path to the readFile method instead of the already read file. Then in that method determine the type, and either read the file to a string and pass it to methods like the one for fasta files, or pass the path to methods that need direct access to the original file. The rest can stay the same because I like the way we could now also pass a string to the readFasta method without the need for a file per se. One minor thing, after updating from CVS I did see the new files of BCSequenceReader, but not in the project. At first I thought this was an XCode thing, but even after a clean checkout they weren't there. Guess you forgot to update the project file as well, could you still do that? Cheers, Alex Op 11-nov-04 om 1:42 heeft Koen van der Drift het volgende geschreven: > Hi all, > > I have added an initial attempt for a new class BCSequenceReader. I > also added some code to the translation demo to test this. I am using > the original code from Peter, so the code figures out what the format > of the data is. For now I have only added a readFasta method. Fasta > files (and other formats as well) can contain DNA sequences or protein > sequences. But how do I figure out which of the two I am dealing with, > so I can return the proper subclass of BCSequence? Any suggestions how > to approach this? > > thanks, > > - Koen. > > _______________________________________________ > Biococoa-dev mailing list > Biococoa-dev at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/biococoa-dev > > ************************************************************** ** Alexander Griekspoor ** ************************************************************** The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com MacOS X: The power of UNIX with the simplicity of the Mac *************************************************************** From kvddrift at earthlink.net Thu Nov 11 06:29:31 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Thu, 11 Nov 2004 06:29:31 -0500 Subject: [Biococoa-dev] BCSequenceReader In-Reply-To: <843A2239-33B3-11D9-A0F0-000D93AE89A4@mekentosj.com> References: <93CEE41A-337A-11D9-A52E-003065A5FDCC@earthlink.net> <843A2239-33B3-11D9-A0F0-000D93AE89A4@mekentosj.com> Message-ID: On Nov 11, 2004, at 2:30 AM, Alexander Griekspoor wrote: > One minor thing, after updating from CVS I did see the new files of > BCSequenceReader, but not in the project. At first I thought this was > an XCode thing, but even after a clean checkout they weren't there. > Guess you forgot to update the project file as well, could you still > do that? > Done - I hope :) Let me know if it worked, I had to do it in the terminal. Short comment on BCSequenceReader, the use of Alphabets (BCSymbolSet) could also be useful to distinguish between formats. I have done a little work on BCSymbolSet, but it is not ready for use yet. Feel free to work on it ;-) - Koen. From jtimmer at bellatlantic.net Thu Nov 11 11:05:42 2004 From: jtimmer at bellatlantic.net (John Timmer) Date: Thu, 11 Nov 2004 11:05:42 -0500 Subject: [Biococoa-dev] BCSequenceReader In-Reply-To: <93CEE41A-337A-11D9-A52E-003065A5FDCC@earthlink.net> Message-ID: A couple of additional thoughts - I agree with Alex, in that simply calculating the percentage of GCAT should give you a strong sense of what the sequence is in the majority of situations. There might be a very quick way of doing that, though I'm not sure. T vs. U should also be considered - maybe count the Us, then decide whether to count GCAT or GCAU. Another thing is that FASTA provides a comment line, which often indicates what the sequence is, though I'm pretty sure this isn't standardized (ie - people are probably free to call a DNA sequence a "protein coding region", so selecting basic terms from the comments will probably fail). The last thought is that it's most important for there to be a defined order of assumptions. Explicitly state which conditions are tested in which order and what the fallback is, so people know what they're getting into. The last thing is that I think FASTA defines an alignment format, too - does the existing code account for this? Cheers, JT > I have added an initial attempt for a new class BCSequenceReader. I > also added some code to the translation demo to test this. I am using > the original code from Peter, so the code figures out what the format > of the data is. For now I have only added a readFasta method. Fasta > files (and other formats as well) can contain DNA sequences or protein > sequences. But how do I figure out which of the two I am dealing with, > so I can return the proper subclass of BCSequence? Any suggestions how > to approach this? > > thanks, > > - Koen. _______________________________________________ This mind intentionally left blank From kvddrift at earthlink.net Thu Nov 11 17:13:31 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Thu, 11 Nov 2004 17:13:31 -0500 Subject: [Biococoa-dev] BCSequenceReader In-Reply-To: <843A2239-33B3-11D9-A0F0-000D93AE89A4@mekentosj.com> References: <93CEE41A-337A-11D9-A52E-003065A5FDCC@earthlink.net> <843A2239-33B3-11D9-A0F0-000D93AE89A4@mekentosj.com> Message-ID: On Nov 11, 2004, at 2:30 AM, Alexander Griekspoor wrote: > > Very nice, in fact I'll see if I can further expand that as well soon > because this is the part that I'm most interested in on the short > term. If that works OK, I'll implement it immediately in the update of > EnzymeX. Before you move on, I think the class should actually return an NSArray with BCSequence objects. This is because, as discussed before, some file formats can contain more than one sequence. I also suggest that we for now only focus on the sequence itself. But we do need to think of classes to hold features (modifications, helix, b-sheet) and annotations (name, author, organism, etc). A possible way to do this is to have a class that has an NSDictionary with key-value pairs of each feature or annotation. These classes could be members of BCSequence. Another option could be to make a new class BCSequenceHolder which has a BCSequence member, and optionally a BCFeatures and BCAnnotations member. An advantage of the latter approach would be that BCSequence's only responsibility is maintaining the sequence, all additional information would be in the new classes. In that case, BCSequenceReader would return an NSArray of BCSequenceHolder objects. > The best way to do it is to check for the presence of certain > characters or look at overall % of certain characters. Though you can > never distinguish a stretch of 7 Alanines from 7 Adenosines I'm > afraid. On the other hand, the user probably knows what the input file contains, so why not pass an identifier: mySequence = [mySequenceReader readFile: file ofType: seqType] where seqType could be one of the BCSequenceType's or a specific BCSymbolSet. (Of course, we could also always return a BCSequence, and forget about BCSequenceProtein, BCSequenceDNA, etc ;-) > > Finally, one thing we might already think about a bit. The DNA strider > format is a binary one, to test this we need to work with paths > instead of strings. Therefore, I suggest to pass the path to the > readFile method instead of the already read file. Good idea, we can also pass a url and read from a database. > > One minor thing, after updating from CVS I did see the new files of > BCSequenceReader, but not in the project. At first I thought this was > an XCode thing, but even after a clean checkout they weren't there. > Guess you forgot to update the project file as well, could you still > do that? > > Talking about cvs, is it possible to remove all the obsolete folders at the root level? - Koen. From kvddrift at earthlink.net Fri Nov 12 21:18:46 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Fri, 12 Nov 2004 21:18:46 -0500 Subject: [Biococoa-dev] BCSequenceReader In-Reply-To: References: <93CEE41A-337A-11D9-A52E-003065A5FDCC@earthlink.net> <843A2239-33B3-11D9-A0F0-000D93AE89A4@mekentosj.com> Message-ID: <591DC6A6-351A-11D9-9447-003065A5FDCC@earthlink.net> On Nov 11, 2004, at 5:13 PM, Koen van der Drift wrote: > > Before you move on, I think the class should actually return an > NSArray with BCSequence objects. This is because, as discussed before, > some file formats can contain more than one sequence. I have implemented that, and also added some more formats. Fo the time being I made the readFasta file return a BCSequenceDNA, just to make the example work, but this is still work in progress. - Koen. From kvddrift at earthlink.net Sat Nov 13 07:11:15 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sat, 13 Nov 2004 07:11:15 -0500 Subject: [Biococoa-dev] BCSequenceReader In-Reply-To: <93CEE41A-337A-11D9-A52E-003065A5FDCC@earthlink.net> References: <93CEE41A-337A-11D9-A52E-003065A5FDCC@earthlink.net> Message-ID: <1E2AF338-356D-11D9-9447-003065A5FDCC@earthlink.net> On Nov 10, 2004, at 7:42 PM, Koen van der Drift wrote: > But how do I figure out which of the two I am dealing with, so I can > return the proper subclass of BCSequence? I asked the same question on the biojava mailinglist, just to get an idea how they have solved this. And guess what? They haven't ;-) see: http://www.biojava.org/pipermail/biojava-l/2004-November/004696.html - Koen. From jtimmer at bellatlantic.net Sat Nov 13 09:44:30 2004 From: jtimmer at bellatlantic.net (John Timmer) Date: Sat, 13 Nov 2004 09:44:30 -0500 Subject: [Biococoa-dev] BCSequenceReader In-Reply-To: <1E2AF338-356D-11D9-9447-003065A5FDCC@earthlink.net> Message-ID: > > On Nov 10, 2004, at 7:42 PM, Koen van der Drift wrote: > >> But how do I figure out which of the two I am dealing with, so I can >> return the proper subclass of BCSequence? > > > I asked the same question on the biojava mailinglist, just to get an > idea how they have solved this. And guess what? They haven't ;-) I wrote everything below (which I'm including just for other ideas if this turns out to be a bad one) when suddenly a simple answer hit me. All the sequence classes use [symbol undefined] of the appropriate subclass if they hit a character they can't recognize. Koen also put the sequenceCountedSet code in. Simply send the string to each of the three sequence classes, then use the counted set to determine the one which results in the fewest undefined symbols. If the number turns out to be equal, use DNA > RNA > protein to decide which sequence to use so that we can stay within the central dogma. The code should be very clean and easy to follow, though it may not be as fast as I'd like, given there's three sequence objects created and looped through. My previous thoughts follow - disregard them unless you think the above is a bad idea: With the sequence in an uppercased NSString, try: "rangeOfCharactersInSet:" using "ATCGN" - >DNA if range.length = string.length "rangeOfCharactersInSet:" using "AUCGN" - >RNA if range.length = string.length That'll handle the easiest cases first and keep things responsive if we have a simple case. After that, I think a loop using that method could be the quickest way to figure out the AT/UCGN percentage - the logic would look like: Get range Add range.length to total Loop { Get range starting from range.length + range.location + 1 Add range.length to total make sure you're not at the end of the string } Percent = total/string length We could just declare a cutoff - anything over 80% would be made into a nucleotide, otherwise a protein. Where this is going to work poorly is very short sequences, like restriction sites - I think we should only enter this code if the sequence is over 10bp or so. Maybe we should just treat anything under 10 characters as a protein? One other thought - I know the nucleotides have a non-base character, and you also have code for And that guy who answered your email was VERY optimistic in assuming there's an accession number in the comment field.... Cheers, Jay _______________________________________________ This mind intentionally left blank From kvddrift at earthlink.net Sat Nov 13 19:49:01 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sat, 13 Nov 2004 19:49:01 -0500 Subject: [Biococoa-dev] BCSequenceReader In-Reply-To: References: Message-ID: On Nov 13, 2004, at 9:44 AM, John Timmer wrote: > > > All the sequence classes use [symbol undefined] of the appropriate > subclass > if they hit a character they can't recognize. Koen also put the > sequenceCountedSet code in. Simply send the string to each of the > three > sequence classes, then use the counted set to determine the one which > results in the fewest undefined symbols. If the number turns out to be > equal, use DNA > RNA > protein to decide which sequence to use so > that we > can stay within the central dogma. > > The code should be very clean and easy to follow, though it may not be > as > fast as I'd like, given there's three sequence objects created and > looped > through. That's a problem, I agree. But this situation is not going to happen that often, because in most cases the user probably knows what format is used. However, we should be prepared for such cases. I suggest we use a sequencefactory class that takes care of creating sequences in a centralized location, instead of scattered throughout the framework in classes that might encounter such situations. I will have a look at this this weekend, to see if I can get this to work. > > > > My previous thoughts follow - disregard them unless you think the > above is a > bad idea: I don't know yet :) > Where this is going to work poorly is very short sequences, like > restriction > sites - I think we should only enter this code if the sequence is over > 10bp > or so. Maybe we should just treat anything under 10 characters as a > protein? I would call it a peptide then ;-) > > One other thought - I know the nucleotides have a non-base character, > and > you also have code for ...... Actually, proteins can have ambigous symbols as well, I still need to update the BCSymbolAminoAcid class. I will post another message on this subject in a new thread. > > > And that guy who answered your email was VERY optimistic in assuming > there's > an accession number in the comment field.... Yeah, that's not going to work. - Koen. From kvddrift at earthlink.net Sat Nov 13 20:03:59 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sat, 13 Nov 2004 20:03:59 -0500 Subject: [Biococoa-dev] ambiguous symbols Message-ID: <10D921D6-35D9-11D9-B29B-003065A5FDCC@earthlink.net> Hi, It turns out that there are some letters defined for ambiguous amino acids, eg B for Asp and Asn and Z for Glu and Gln. I have already added this in the amino acid plist including a Represents key for each amino acid. So I need to add some code to BCSymbolAminoAcid that is similar to the series of representsBase code in the nucleotide classes. To make thing easier for ourselves, I suggest we rename the following methods: - (void) initializeBaseRelationships; - (NSArray *) representedBases - (NSArray *) representingBases; - (BOOL) representsBase: (BCNucleotideRNA *) entry; - (BOOL) isRepresentedByBase: (BCNucleotideRNA *) entry; - (NSCharacterSet *) symbolsOfRepresentedBases; to: - (void) initializeSymbolRelationships; - (NSArray *) representedSymbols - (NSArray *) representingSymbols; - (BOOL) representsSymbol: (BCSymbol *) entry; - (BOOL) isRepresentedBySymbol: (BCSymbol *) entry; - (NSCharacterSet *) symbolsOfRepresentedSymbols; And move them to BCSymbol. If needed the subclasses can still override them, for instance in the case of initializeSymbolRelationships. We could even merge representsSymbol with IsEqualToSymbol, using a 'strict' flag. Let me know what you guys think, and I will implement if y'all agree. cheers, - Koen. From mek at mekentosj.com Sat Nov 13 20:08:13 2004 From: mek at mekentosj.com (Alexander Griekspoor) Date: Sun, 14 Nov 2004 02:08:13 +0100 Subject: [Biococoa-dev] ambiguous symbols In-Reply-To: <10D921D6-35D9-11D9-B29B-003065A5FDCC@earthlink.net> References: <10D921D6-35D9-11D9-B29B-003065A5FDCC@earthlink.net> Message-ID: Hi Koen, I think it makes perfectly sense to "upgrade" the code to the symbol class if both nucleotides and aminoacids contain ambiguous species... You're doing a great job in the recent days! Cheers, Alex Op 14-nov-04 om 2:03 heeft Koen van der Drift het volgende geschreven: > Hi, > > It turns out that there are some letters defined for ambiguous amino > acids, eg B for Asp and Asn and Z for Glu and Gln. I have already > added this in the amino acid plist including a Represents key for each > amino acid. So I need to add some code to BCSymbolAminoAcid that is > similar to the series of representsBase code in the nucleotide > classes. To make thing easier for ourselves, I suggest we rename the > following methods: > > - (void) initializeBaseRelationships; > - (NSArray *) representedBases > - (NSArray *) representingBases; > - (BOOL) representsBase: (BCNucleotideRNA *) entry; > - (BOOL) isRepresentedByBase: (BCNucleotideRNA *) entry; > - (NSCharacterSet *) symbolsOfRepresentedBases; > > to: > > - (void) initializeSymbolRelationships; > - (NSArray *) representedSymbols > - (NSArray *) representingSymbols; > - (BOOL) representsSymbol: (BCSymbol *) entry; > - (BOOL) isRepresentedBySymbol: (BCSymbol *) entry; > - (NSCharacterSet *) symbolsOfRepresentedSymbols; > > > And move them to BCSymbol. If needed the subclasses can still override > them, for instance in the case of initializeSymbolRelationships. We > could even merge representsSymbol with IsEqualToSymbol, using a > 'strict' flag. > > > Let me know what you guys think, and I will implement if y'all agree. > > > cheers, > > - Koen. > > _______________________________________________ > Biococoa-dev mailing list > Biococoa-dev at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/biococoa-dev > > ************************************************************** ** Alexander Griekspoor ** ************************************************************** The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com MacOS X: The power of UNIX with the simplicity of the Mac *************************************************************** From kvddrift at earthlink.net Sat Nov 13 23:37:04 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sat, 13 Nov 2004 23:37:04 -0500 Subject: [Biococoa-dev] ambiguous symbols In-Reply-To: References: <10D921D6-35D9-11D9-B29B-003065A5FDCC@earthlink.net> Message-ID: On Nov 13, 2004, at 8:08 PM, Alexander Griekspoor wrote: > I think it makes perfectly sense to "upgrade" the code to the symbol > class if both nucleotides and aminoacids contain ambiguous species... Now done in CVS, please test. - Koen. From kvddrift at earthlink.net Sun Nov 14 07:42:07 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sun, 14 Nov 2004 07:42:07 -0500 Subject: [Biococoa-dev] ambiguous symbols In-Reply-To: <10D921D6-35D9-11D9-B29B-003065A5FDCC@earthlink.net> References: <10D921D6-35D9-11D9-B29B-003065A5FDCC@earthlink.net> Message-ID: <980BB31E-363A-11D9-B29B-003065A5FDCC@earthlink.net> John, To make BCAminoAcid work similarly as the nucleotides regarding ambiguous symbols, I copied the code from initializeSymbolRelationships. The first line for the nucleotides reads: NSString *baseReference = [baseInfo objectForKey: @"Complement"]; However, the key "Complement" doesn't exist for amino acids, but I replaced the line with: NSString *aaReference = [aaInfo objectForKey: @"Name"]; Is this correct? Also in the min/max mass calc code, I initially was using if (isSingleBase) to test whether a symbol is unambiguous or not. Again, this doesn't exist for amino acids, so I replaced it by if ([represents count] == 1) Is this also correct? thanks, - Koen. From kvddrift at earthlink.net Sun Nov 14 13:19:26 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sun, 14 Nov 2004 13:19:26 -0500 Subject: [Biococoa-dev] ambiguous symbols In-Reply-To: References: Message-ID: > > There's two challenges here - one is that the nucleotides have > complements > and the amino acids don't, as you pointed out in the quote just below. > A > second is that the relationships have to be made to members of a > specific > subclass - ie, a BCNucleotideDNA should only complement others from its > subclass. I'd guess you could get around this by figuring out what > kind of > class self is, but I don't know enough about how ObjC handles > inheritance to > know how well this would work - does an object know what class it is > when > it's executing functions in it's super? > > Clearly, we could just declare that method in the super and implement > it in > each subclass. Since you're doing the work, how are you hoping to > handle > it, Koen? For now, I've left the method in the super class empty, and put all the code in the subclasses. But I think we can at least put the code that fills represents and representedBy in the superclass. I will check that later. When similar code gets repeated in the various subclasses, it definitely is a candidate to go into the super. (This was also the reason for my plea a few weeks ago to put all the rangeOfSubsequence methods in BCSequence only. But now that BCFindSequence is in place, we probably can remove rangeOfSubsequence et al completely). > > As far as I can tell, > there's no need to access the "Name" value in this method. The code > should > probably start somewhere around the line: > > infoArray = [baseInfo objectForKey: @"Represents"]; > Ah, thanks. Will fix that. >> if ([represents count] == 1) >> >> Is this also correct? > > That would work, but if we're adding the code to handle the ambiguity > in the > superclass, we could probably move this method up to the superclass, > too. I already did :) > You may want to reverse the logic, though, as "isAmbiguous" seems > (pardon me > here) less ambiguous than "isSingleSymbol" in terms of method names > that > clearly indicate the function. You mean to use: if ([represents count] >1) ? thanks for the input, - Koen. From kvddrift at earthlink.net Sun Nov 14 13:41:05 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sun, 14 Nov 2004 13:41:05 -0500 Subject: [Biococoa-dev] ambiguous symbols In-Reply-To: References: Message-ID: On Nov 14, 2004, at 1:19 PM, Koen van der Drift wrote: >> Clearly, we could just declare that method in the super and implement >> it in >> each subclass. Since you're doing the work, how are you hoping to >> handle >> it, Koen? > > For now, I've left the method in the super class empty, and put all > the code in the subclasses. But I think we can at least put the code > that fills represents and representedBy in the superclass. I will > check that later. Hmmm, if we want to move this into the super, we have to rename baseInfo and aaInfo to symbolInfo. Any objections? - Koen. From mek at mekentosj.com Sun Nov 14 16:18:22 2004 From: mek at mekentosj.com (Alexander Griekspoor) Date: Sun, 14 Nov 2004 22:18:22 +0100 Subject: [Biococoa-dev] ambiguous symbols In-Reply-To: <49AEA484-3679-11D9-B29B-003065A5FDCC@earthlink.net> References: <200411141147.AA2309095696@mekentosj.com> <49AEA484-3679-11D9-B29B-003065A5FDCC@earthlink.net> Message-ID: Da's mooi! Sorry I'll translate to english I hadn't seen your post op cocoa-dev, but I remember talking about performSelector and if you could send it to classes a while ago. I understand the problem. Could the following perhaps help? temp = [(BCSymbol *)[self class] performSelector: NSSelectorFromString( ) I wonder, because you never know what class it will be in the end (hence the warning ;-), but who knows. Just for the fun, also try: temp = [(BCAminoAcid *)[self class] performSelector: NSSelectorFromString( ) At least, I think that's the class you are trying to partially "upgrade" to super, doing this would give the compiler just as much knowledge as it has now... Back then we had a discussion about this with Jim, I'll copy a few snippets from those emails, perhaps it helps... Cheers, Alex **** > Well, the problem really is that we get the name of the method from a > plist (the plist would contain all the complementary bases), so as a > NSString. The trick is how to convert the name into a selector that we > can call. Another complicating factor is that the performSelector: > method is a NSObject instance method, so won't work on a class. I have > found an ObjC function that gets you to the method location (it works > because in my simple test it did not return NULL). But how does the > ObjC runtime invokes a class method? My knowledge on the runtime > architecture is clearly insufficient here... > Alex > >> On Aug 13, 2004, at 3:22 PM, Jim Balhoff wrote: >> >>> On Aug 13, 2004, at 3:15 PM, Alexander Griekspoor wrote: >>> >>>>> Not true, all classes are objects too, it seems like it would work. >>>> >>>> I'm not so sure about that, but probably you are right. Anyway, the >>>> description says: >>>> >>>> objc_msgSend >>>> >>>> Sends a message with a simple return value to an instance of a >>>> class. >>>> >>>> So I'm afraid it doesn't work on the class itself. I couldn't get >>>> it to work either, but perhaps I'm doing something else wrong >>>> here... >>>> >>>> >>> >>> It works, try this: >>> >> >> Actually you can just send the performSelector: message to the class. >> I wasn't sure if class objects conformed to the NSObject protocol, >> but if you try sending the message [MyClass performSelector:message], >> you get the same result as the objc_msgSend() stuff. >> >> - Jim **** > >> That's even nicer! So should we file a bug against the documentation? >> :-) >> A. >> > > I don't know - maybe the Cocoa-dev list could help with this. There > is a type, Class, and classes are objects, according to the > Objective-C manual. But I can't find anywhere that says it inherits > from NSObject (I guess that would be impossible) or conforms to the > NSObject protocol. So I'm not sure how you know what messages you can > send to it as an object. It would be nice to have a better > understanding before relying too heavily on the result of > performSelector. > ++++++++++++++ Op 14-nov-04 om 21:10 heeft Koen van der Drift het volgende geschreven: > > On Nov 14, 2004, at 2:47 PM, Alexander Griekspoor wrote: > >> Nope ;-) >> >> >> > > Mooi - het werkt allemaal soepeltjes. Lang leve inheritance :D > > Ik denk dat ik alles uit initializeSymbolRelationships verplaats naar > BCSymbol. Aminozuren hebben weliswaar geen complements, dus blijven ze > daar nil, maar dat maakt niet zoveel uit. Ik wacht nog even met een > commit totdat ik een duidelijk antwoord krijg over mijn vraag over > performSelector op cocoa-dev. > > > - Koen. > > > ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com Windows is a 32-bit patch to a 16-bit shell for an 8-bit operating system, written for a 4-bit processor by a 2- bit company without 1 bit of sense. ********************************************************* ************************************************************** ** Alexander Griekspoor ** ************************************************************** The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com MacOS X: The power of UNIX with the simplicity of the Mac *************************************************************** From mek at mekentosj.com Sun Nov 14 16:29:59 2004 From: mek at mekentosj.com (Alexander Griekspoor) Date: Sun, 14 Nov 2004 22:29:59 +0100 Subject: [Biococoa-dev] BCSequenceReader In-Reply-To: <591DC6A6-351A-11D9-9447-003065A5FDCC@earthlink.net> References: <93CEE41A-337A-11D9-A52E-003065A5FDCC@earthlink.net> <843A2239-33B3-11D9-A0F0-000D93AE89A4@mekentosj.com> <591DC6A6-351A-11D9-9447-003065A5FDCC@earthlink.net> Message-ID: <5623203D-3684-11D9-BBE7-000D93AE89A4@mekentosj.com> > I have implemented that, and also added some more formats. Fo the time > being I made the readFasta file return a BCSequenceDNA, just to make > the example work, but this is still work in progress. Very nice! I saw you change the method to use NSData as well, that allows me to add the DNAStrider binary and ascii formats reading/writing from the test app that I had created a while ago... I'll add it one of these days. To test automatic detection I'm not sure exactly how to test for the binary format, but I'll think about it... Alex ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com Windows vs Mac 65 million years ago, there were more dinosaurs than humans. Where are the dinosaurs now? ********************************************************* From kvddrift at earthlink.net Sun Nov 14 17:26:01 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sun, 14 Nov 2004 17:26:01 -0500 Subject: [Biococoa-dev] ambiguous symbols In-Reply-To: References: <200411141147.AA2309095696@mekentosj.com> <49AEA484-3679-11D9-B29B-003065A5FDCC@earthlink.net> Message-ID: <2A342587-368C-11D9-B29B-003065A5FDCC@earthlink.net> On Nov 14, 2004, at 4:18 PM, Alexander Griekspoor wrote: > Da's mooi! Sorry I'll translate to english > > I hadn't seen your post op cocoa-dev, but I remember talking about > performSelector and if you could send it to classes a while ago. I > understand the problem. Could the following perhaps help? > > temp = [(BCSymbol *)[self class] performSelector: > NSSelectorFromString( ) Tadaa! Yes that worked, thanks Alex! I have now commited the updated code. The method initializeSymbolRelationships is only in BCSymbol, I removed it from all subclasses. I also changed the error logic in that method by removing all the if (foo == nil) return calls, and replacing them by if (foo != nil) doSomething. At least the method will finish this way. I have only tested the translation demo, and that worked for me. ***Please review and test*** cheers, - Koen. From mek at mekentosj.com Sun Nov 14 17:31:15 2004 From: mek at mekentosj.com (Alexander Griekspoor) Date: Sun, 14 Nov 2004 23:31:15 +0100 Subject: [Biococoa-dev] ambiguous symbols In-Reply-To: <2A342587-368C-11D9-B29B-003065A5FDCC@earthlink.net> References: <200411141147.AA2309095696@mekentosj.com> <49AEA484-3679-11D9-B29B-003065A5FDCC@earthlink.net> <2A342587-368C-11D9-B29B-003065A5FDCC@earthlink.net> Message-ID: Ha! Sometimes I'm amazed how far a simple Biology guy can come ;-) Explicit typecasting is THE solution to most compiler warnings, that's why I guessed that might help. I most certainly will give it a try the moment my paperwork is sent back to the reviewers... Cheers, Alex Op 14-nov-04 om 23:26 heeft Koen van der Drift het volgende geschreven: > > On Nov 14, 2004, at 4:18 PM, Alexander Griekspoor wrote: > >> Da's mooi! Sorry I'll translate to english >> >> I hadn't seen your post op cocoa-dev, but I remember talking about >> performSelector and if you could send it to classes a while ago. I >> understand the problem. Could the following perhaps help? >> >> temp = [(BCSymbol *)[self class] performSelector: >> NSSelectorFromString( ) > > > Tadaa! Yes that worked, thanks Alex! > > I have now commited the updated code. The method > initializeSymbolRelationships is only in BCSymbol, I removed it from > all subclasses. I also changed the error logic in that method by > removing all the if (foo == nil) return calls, and replacing them by > if (foo != nil) doSomething. At least the method will finish this way. > I have only tested the translation demo, and that worked for me. > > ***Please review and test*** > > > cheers, > > - Koen. > > > ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com Microsoft is not the answer, Microsoft is the question, NO is the answer ********************************************************* From kvddrift at earthlink.net Tue Nov 16 20:24:51 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Tue, 16 Nov 2004 20:24:51 -0500 Subject: [Biococoa-dev] BCSymbolSet problem Message-ID: <7A4D5B70-3837-11D9-AFCC-003065A5FDCC@earthlink.net> Hi, I am having some trouble getting to populate a BCSymbolSet. I am using the code that was originally committed by Alex, and changed it so that BCSymbolSet now is a subclass of NSMutableSet. To populate a set, I put some code into the dnaStrictSymbolSet and initwithString methods. I also added two lines in the translation demo. You need to uncomment them to debug the symbolset code. Everytime I reach the line addSymbol, the program raises an exception. I have no idea why this is happening, so if you see what I am doing wrong, let me know! There is another problem with the current approach. initWithString is now hardcoded to create a BCNucleotideDNA, but what if the method is called for RNA or a protein? Not sure yet how to solve this. We might need to take a different approach after all. cheers, - Koen. From kvddrift at earthlink.net Tue Nov 16 20:30:47 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Tue, 16 Nov 2004 20:30:47 -0500 Subject: [Biococoa-dev] more ramblings Message-ID: <4EB065F1-3838-11D9-AFCC-003065A5FDCC@earthlink.net> Hi, In two recent code cleanups I did (rangeOfSubsequence and initializing the symbols) I found that code that was originally in each subclass could be moved either to the super or to an external wrapper. I hope you can appreciate that the code became more transparent and also more easy to maintain. For example, during the coding of BCFindSequence, I found an error in the rangeOfSubsequence code (see my post October 30th). Once I found the problem, it was easy to fix with BCFindSequence, because the code is just in one place, instead of in each variation of rangeOfSubsequence in all the subclasses (which I didn't fix yet ;). I would appreciate it if you could check and try out the code in BCFindSequence. I already put some test code in the translation demo. Here are the relevant lines in the demo: BCFindSequence *sequenceFinder = [BCFindSequence sequenceFinderWithSequence: theSequence]; [sequenceFinder setStrict: NO]; [sequenceFinder setFirstOnly: NO]; NSArray *foundIt = [sequenceFinder findSequence: [BCSequenceDNA DNASequenceWithString: @"AAT" skippingNonBases: YES]]; NSLog ( @"the found-array is %@", foundIt ); Try changing the setStrict and setFirstOnly values, and the @"AAT" search string, and see if the results displayed by NSLog in the console are what you expect. Note that the results in 'foundIt' are stored as NSRanges in NSValue, we way have to change that. Maybe you can try to put an ambiguous symbol in the search string. Try feeding it a protein, or rna. If I have done everything right, BCFindSequence should be similar to all the variations of rangeOfSubsequence in BCSequence and its subclasses. If not let me know what went wrong and I can see if I can fix it. By introducing BCFindSequence, I hope I showed that we don't need all the variations of rangeOfSubsequence in multiple locations. I am confident that the same applies for other sequence manipulations. For instance, code to calculate a complement or reverse complement could also go into a wrapper class. Code to translate a sequence is already in a wrapper class. You probably can guess where I am going next :-) Having said all that, again I want to make a case that we don't have to subclass BCSequence. A sequence object IMO should only take care of maintaining the array of symbols, and maybe store additional information about the sequence, such as annotations and features. I don't think this is distorting biology, because in real life, DNA and proteins also use additional proteins to extend their behaviour (translate, get the complement, look for a epitope, digest, transport through the membrane, etc). Another advantage is the following. Last week I asked for a way to determine if a fasta file contains a dna or protein. We don't know in advance, so what should the readFasta method return, BCSequenceProtein or BCSequenceDNA? If we just have readFasta return a BCSequence the read-method doesn't have to worry about that! Of course, when actually creating the sequence, we could either set BCSequenceType or a introduce a symbolset/alphabet, so at least we and the user knows what we are dealing with. But this is not the responsibility of readFasta which only extracts the relevant information from a file, and passes it on the code that creates a sequence. I hope that with showing some concreate examples that this time I can convince you guys that we don't have to subclass BCSequence, or at least use wrappers for all additional functionality. please now go ahead and shoot me ;-) cheers, - Koen. From jtimmer at bellatlantic.net Tue Nov 16 22:24:31 2004 From: jtimmer at bellatlantic.net (John Timmer) Date: Tue, 16 Nov 2004 22:24:31 -0500 Subject: [Biococoa-dev] more ramblings In-Reply-To: <4EB065F1-3838-11D9-AFCC-003065A5FDCC@earthlink.net> Message-ID: Once again, I have to say I think this is a really bad idea. Let me count the ways... For starters, nucleotide and protein sequences have some things in common, but in general, they're very different. They have different information content. You do different things with them. Why try to squish them together? Treating them as the same object reduces the object's information content without gaining any clear benefit. One of the whole ideas of object oriented programming is to group the data with the methods that act on it. Complement/reverse complement are things that only work with nucleotide sequences - they belong in a class that handles nucleotide sequences. The way you're trying to structure things is by separating methods from data. As a result, we're going to have one of two situations - a TON of small wrapper classes that only perform a limited number of functions, or a few giant conglomerations of utility functions. It's not going to be easier to find and maintain methods. The rangeOfSubsequence isn't the horrible situation you make it out to be. We've got a set of related methods that work on all sequences in the superclass. When I get free time (hah!), I'll move the other set (handling ambiguous sequences) into the superclass to - I'll just have it check for the sequence type at the start. The methods will go with the data they work with. The FASTA situation is also a bad example to support your case. Some file formats contain information regarding the type of sequence, other's don't. Why should we make a sequence object handle that, or create a new class to act as an intermediary - dealing with differences in file format is the job of the object that knows about file formats, not a sequence. Given all these things I view as negatives, I still don't understand what advantages a single sequence class would provide. The concrete examples you provide seem to me to be causing more organizational issues than they solve, and not following good OOP design. My first instinct would be to take anything in BCFindSequence and work it back in to BCSequence. Another way to think about this - let's assume that Apple knows what they're doing in designing their classes. The most analogous item in Cocoa's Foundation is NSMutableString. There is only one utility class that's directly related to strings (NSScanner - maybe two with NSCharacterSet). Just about all the methods needed for handling the contents of strings are either in NSMutableString or its superclass. It's good design. No shooting though! At least not unless I ever invest in a copy of Halo... JT > By introducing BCFindSequence, I hope I showed that we don't need all > the variations of rangeOfSubsequence in multiple locations. I am > confident that the same applies for other sequence manipulations. For > instance, code to calculate a complement or reverse complement could > also go into a wrapper class. Code to translate a sequence is already > in a wrapper class. > > > You probably can guess where I am going next :-) > > Having said all that, again I want to make a case that we don't have to > subclass BCSequence. A sequence object IMO should only take care of > maintaining the array of symbols, and maybe store additional > information about the sequence, such as annotations and features. I > don't think this is distorting biology, because in real life, DNA and > proteins also use additional proteins to extend their behaviour > (translate, get the complement, look for a epitope, digest, transport > through the membrane, etc). > > Another advantage is the following. Last week I asked for a way to > determine if a fasta file contains a dna or protein. We don't know in > advance, so what should the readFasta method return, BCSequenceProtein > or BCSequenceDNA? If we just have readFasta return a BCSequence the > read-method doesn't have to worry about that! Of course, when actually > creating the sequence, we could either set BCSequenceType or a > introduce a symbolset/alphabet, so at least we and the user knows what > we are dealing with. But this is not the responsibility of readFasta > which only extracts the relevant information from a file, and passes it > on the code that creates a sequence. > > I hope that with showing some concreate examples that this time I can > convince you guys that we don't have to subclass BCSequence, or at > least use wrappers for all additional functionality. > > please now go ahead and shoot me ;-) _______________________________________________ This mind intentionally left blank From kvddrift at earthlink.net Wed Nov 17 17:51:00 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Wed, 17 Nov 2004 17:51:00 -0500 Subject: [Biococoa-dev] more ramblings In-Reply-To: References: Message-ID: <26A34B90-38EB-11D9-AFCC-003065A5FDCC@earthlink.net> On Nov 16, 2004, at 10:24 PM, John Timmer wrote: > Once again, I have to say I think this is a really bad idea. Let me > count > the ways... > > For starters, nucleotide and protein sequences have some things in > common, > but in general, they're very different. They have different > information > content. You do different things with them. Why try to squish them > together? Treating them as the same object reduces the object's > information > content without gaining any clear benefit. > For me, the differences between nucleotide and protein sequences are the building blocks. That's where all the information is. Other than that a sequence is 'just a black box with an array of symbols', regardless of the nature of the symbols. Adding, removing, replacing symbols is now indeed handled through BCSequence, and is similar for all types of sequences. When obtaining a complement/reverse complement of a sequence, you actually do this for each individual symbol. The sequence is just a convenient way to iterate over the symbols. This is why we store information about complements, etc in the symbols, not in the sequence. When searching a sequence, you actually compare symbols not the sequence itself. The same for translating, digesting, MW calculations, etc. (And each symbol knows what kind of subclass of BCSymbol it is, even though we use selfSymbol = (BCSymbol *)CFArrayGetValueAtIndex( (CFArrayRef) selfArray, loopCounter), so there is no need to explicitly cast them as BCAminoAcid, etc). > One of the whole ideas of object oriented programming is to group the > data > with the methods that act on it. That's very true, but I don't think that I am trying to separate the methods from the data, because the actual method to get a complement, MW, etc is in BCSymbol and it's subclasses, together with the data. > The rangeOfSubsequence isn't the horrible situation you make it out to > be. I tried to be milder this time :) > We've got a set of related methods that work on all sequences in the > superclass. When I get free time (hah!), I'll move the other set > (handling > ambiguous sequences) into the superclass to - I'll just have it check > for > the sequence type at the start. Right now, you have several very similar methods in BCSequence (and its subclasses). As I said before, this is usually a situation in OOP when one has to rethink the design, and try to find a way to avoid duplicating code. This is what I tried to do by introducing the BCFindSequence class. If you look at that code, you'll see that I even didn't have to check for which BCSymbol I was dealing with. The real comparing is done by the symbol itself. By setting a flag, you can search for ambiguous as well, all with the same method. But if you can find a way to simplify the code within BCSequence, I'll be all for it. And you'll see that when the code is only in BCSequence, all sequences are treated equally, the search algorithm is the same for proteins and nucleotide sequences, only the symbols are different. > The FASTA situation is also a bad example to support your case. Some > file > formats contain information regarding the type of sequence, other's > don't. > Why should we make a sequence object handle that, or create a new > class to > act as an intermediary - dealing with differences in file format is > the job > of the object that knows about file formats, not a sequence. > I agree with you not to make a sequence responsible for dealing with the file format, that would complicate things only more :) Dealing with differences in file format is what BCSequenceReader already does. It's first method tests the first line for specific characters and then passes it on the the appropriate method, readFasta, readSwissProt, etc. This method then extracts the necessary information, including a string of symbols. But should the reader methods be concerned with whether it is a protein or nucleotide sequence? I don't think they should. The introduction of a factory class is a well established design pattern in OOP that deals with these sort of situations. An advantage is that when you ever decide to change the way a sequence is created, or introduce a new type of sequence, you only have to modify the code once in the factory, not in each readXXX method. Or maybe later on we decide to implement a new read method, or introduce a class to obtain a sequence from a database. Maybe the user types in a sequence in an NSTextView and wants to make a BCSequence. Should each of these classes then try to figure out whether it's a protein or nucleotide sequence? If we keep that code in one place (factory or whatever you want to call it) it makes it much easier to maintain. > Given all these things I view as negatives, I still don't understand > what > advantages a single sequence class would provide. The concrete > examples you > provide seem to me to be causing more organizational issues than they > solve, > and not following good OOP design. I have added some more examples in this reply, and hopefully showed that this is also a good OOP design. I am very guilty of supporting the BCSequence subclasses myself when we just started. But now that BioCocoa is growing, I came to the realization that we may have to shuffle things around to make the code easier to use and maintain. I have enough experience with OOP to know that when the project becomes larger, you're glad that you kept the code modular. If you ever decide you have to change a method, it's much easier to just fix it in one place, instead of to have to remember in which places this code was added. > My first instinct would be to take > anything in BCFindSequence and work it back in to BCSequence. Please do so, but leave the BCFindSequence code as an alternative :) > > Another way to think about this - let's assume that Apple knows what > they're > doing in designing their classes. The most analogous item in Cocoa's > Foundation is NSMutableString. There is only one utility class that's > directly related to strings (NSScanner - maybe two with > NSCharacterSet). > Just about all the methods needed for handling the contents of strings > are > either in NSMutableString or its superclass. It's good design. > NSString indeed maintains a list of characters, and also does some basic character manipulation, and substring searching. But it doesn't translate a string to another language! cheers, - Koen. From jtimmer at bellatlantic.net Wed Nov 17 22:26:44 2004 From: jtimmer at bellatlantic.net (John Timmer) Date: Wed, 17 Nov 2004 19:26:44 -0800 Subject: [Biococoa-dev] more ramblings In-Reply-To: <26A34B90-38EB-11D9-AFCC-003065A5FDCC@earthlink.net> Message-ID: Koen - > For me, the differences between nucleotide and protein sequences are > the building blocks. That's where all the information is. Other than > that a sequence is 'just a black box with an array of symbols', > regardless of the nature of the symbols. Adding, removing, replacing > symbols is now indeed handled through BCSequence, and is similar for > all types of sequences. When obtaining a complement/reverse complement > of a sequence, you actually do this for each individual symbol. The > sequence is just a convenient way to iterate over the symbols. This is > why we store information about complements, etc in the symbols, not in > the sequence. When searching a sequence, you actually compare symbols > not the sequence itself. The same for translating, digesting, MW > calculations, etc. (And each symbol knows what kind of subclass of > BCSymbol it is, even though we use selfSymbol = (BCSymbol > *)CFArrayGetValueAtIndex( (CFArrayRef) selfArray, loopCounter), so > there is no need to explicitly cast them as BCAminoAcid, etc). This is an interesting point, and may come down to style more than anything else. You're absolutely right that things like a complement are handled primarily at the symbol level. There are two reasons I'd argue that they're worth keeping at the sequence level, too though. The first is simply convenience - a lot of people are going to want it, so why make them write their own? The second is that we can optimize it heavily (as we've already done partly) in a way that not everybody is going to be interested in doing, so a lot more people will have access to better code. If you accept that the method should exist at all, then it makes the most sense to put it with the data it operates on (ie as part of some sort of sequence class). Otherwise, anybody using the Framework has to figure out what class handles that type of method, and then dig through its docs to find the appropriate method. > Right now, you have several very similar methods in BCSequence (and its > subclasses). As I said before, this is usually a situation in OOP when > one has to rethink the design, and try to find a way to avoid > duplicating code. Right, and last time this came up, I mentioned that I had every intention of fixing it. It's not a fundamental class structure problem - it was a problem with me trying to put something in place first, and fix it later. I don't know how else to possibly say that this situation is temporary, and doesn't say anything informative about the class structure. I'd also like to point out that having 2 methods vs. 1 method with a boolean flag, as yours apparently does, doesn't make any argument about class complexity at all. I went back and forth on which to do for a while, and settled on 2. If people prefer 1, it can be changed. Anyway, it's probably good that I ran out of time. I'm pretty sure that all individual sequence elements now have the notion of ambiguity (if codons don't have it, they should! And I know who to blame if they don't!), it should be easy to implement at the top level class. > But should the reader methods be concerned with whether it > is a protein or nucleotide sequence? I don't think they should. My point was that in some cases, it absolutely has to. If it's a protein specific file format, the file reader has to specifically produce a protein sequence, even if it's got illegal, DNA characters in it. Good design would also dictate that we have a defined way of how things should behave when there is no way of determining what type of sequence it is from the file or metadata. Now, I'm not arguing heavily about where this specificity should be provided - a factory type object is fine by me. The point was primarily that this situation doesn't argue for or against having sequence subclasses. >> My first instinct would be to take >> anything in BCFindSequence and work it back in to BCSequence. > > Please do so, but leave the BCFindSequence code as an alternative :) Don't worry, I'd never intentionally delete someone else's work. Where's that located in the CVS directory structure, anyway? It's not showing up as an added file in XCode on my machine, so I'd like to download it at some point. > NSString indeed maintains a list of characters, and also does some > basic character manipulation, and substring searching. But it doesn't > translate a string to another language! Funny you should mention that - it actually does, to a degree. Look at the locale methods. Changing case is also a form of translation. Just to prove that I'm not interested in mindlessly adding stuff to a class, though, if I were left in charge, I'd move all the path methods over to NSFileManager immediately ;). Anyway, my main argument is that there are a lot of things that are going to be specific for one type of sequence or another. Hydrophobicity, charge, etc. that are all protein specific, while complements, melting temperature, haripin possibilities, GC% and such are all DNA/RNA specific. There are also going to be a lot of useful things that are sequence-type specific that none of us here have thought about yet, and will only be revealed to us if we get more developers onto the project who need that feature. We're going to want sequence-type specific methods to do all these things. It comes down to the design decision of whether you want to send the sequence off somewhere else to get information back on it, or whether you want to ask the sequence to tell you something about itself. I'd say that for the most part, for someone trying to use this framework, it's much easier to ask the sequence, instead of trying to figure out what object/method they need to send the sequence to. I also don't think that it leads to a painful burden on us developers in terms of organization. I think the individual symbols are great examples of this approach - they are incredibly powerful because, unlike a character, they know things about themselves. You don't have to dig around to find out which class/method are needed to find out what the complement of a base is - the base already knows what its complement is. I'd love to see the same power extended to sequences as a whole. Anyway, I think i've blathered enough on this topic for one day - Cheers, John _______________________________________________ This mind intentionally left blank From kvddrift at earthlink.net Wed Nov 17 19:45:56 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Wed, 17 Nov 2004 19:45:56 -0500 Subject: [Biococoa-dev] more ramblings In-Reply-To: References: Message-ID: <354D5B5E-38FB-11D9-AFCC-003065A5FDCC@earthlink.net> On Nov 17, 2004, at 10:26 PM, John Timmer wrote: > Where's > that located in the CVS directory structure, anyway? It's not showing > up as > an added file in XCode on my machine, so I'd like to download it at > some > point. Just a quick reply to this question. It should be in BCTools/BCSequenceTools. On the CVSweb site you can find it at: http://bioinformatics.org/cgi-bin/cvsweb.cgi/BioCocoa/BCFoundation/ BCTools/BCSequenceTools/ I'll need some more time to digest (no pun) all your other comments :-) I hope we get some input from the other developers as well, it seems to be a too fundamental issue to discuss between the two of us. - Koen. From mek at mekentosj.com Thu Nov 18 08:06:32 2004 From: mek at mekentosj.com (Alexander Griekspoor) Date: Thu, 18 Nov 2004 14:06:32 +0100 Subject: Fwd: [Biococoa-dev] more ramblings Message-ID: Sorry, forgot to include the biococoa list.... ******* Ok guys, this is going really fast now, I can hardly keep up, it's a good thing though ;-) I'm not sure if this is the best thing, but I decided to comment personally on a per email basis, instead of aggregating them in one large one. Unfortunately, that probably means even more reading ;-) Practically, I haven't had the time to help solving the initWithString method in BCSymbolSet, neither did I had the time to look if BCFindSequence works fine. The work on BCFindSequence looks really promising though Koen, well done! < snippet of nice work from Koen en suggestions to check it our, which I will certainly do> > > By introducing BCFindSequence, I hope I showed that we don't need all > the variations of rangeOfSubsequence in multiple locations. I am > confident that the same applies for other sequence manipulations. For > instance, code to calculate a complement or reverse complement could > also go into a wrapper class. Code to translate a sequence is already > in a wrapper class. Yep, I totally agree in this case. I believe I expressed my preference before in keeping BCSequence mere data storages and put manipulations like these in specialized wrapper classes that fulfill certain tasks very well. The restriction enzyme / digester thing is another perfect example that has come up a number of times. Still, there's a fairly large borderline here. For example, the complementation and reversion of sequences are fairly simple things and I'm not sure if you should have a wrapper for that. [mySequence complement] is so simple compared to a wrapper solution. Also NSString nicely shows that example, there are a lot of these methods there as well. Two remarks here; 1) I see partially why Koen wants to factor all these method out of BCSequence, as making BCSequence a general "one-for-all-types" sequence object wouldn't allow you anymore to keep these kind of methods very simple. 2) Of course, one alternative would be to have the best of both worlds my making, as an example, [mySequence complement] a convenience method that internally calls the proper wrapper/helper object. We can than still have a simple interface, AND have a central place for the code that does the work behind the scenes. > You probably can guess where I am going next :-) Let me see... No not really. Kidding, so the question here is do we go for one general BCSequence class or multiple ones. > Having said all that, again I want to make a case that we don't have > to subclass BCSequence. A sequence object IMO should only take care of > maintaining the array of symbols, and maybe store additional > information about the sequence, such as annotations and features. I > don't think this is distorting biology, because in real life, DNA and > proteins also use additional proteins to extend their behaviour > (translate, get the complement, look for a epitope, digest, transport > through the membrane, etc). True, but physically they are different as well, and use different enzymes to be synthesized and degraded for instance. But in principle you're right, there's lots to say for this option. > Another advantage is the following. Last week I asked for a way to > determine if a fasta file contains a dna or protein. We don't know in > advance, so what should the readFasta method return, BCSequenceProtein > or BCSequenceDNA? If we just have readFasta return a BCSequence the > read-method doesn't have to worry about that! Of course, when actually > creating the sequence, we could either set BCSequenceType or a > introduce a symbolset/alphabet, so at least we and the user knows what > we are dealing with. But this is not the responsibility of readFasta > which only extracts the relevant information from a file, and passes > it on the code that creates a sequence. Yes, but that's just pushing the problem ahead, and has a few more consequences. For instance in the case of the fasta file, say we have "AAAATTT" (worst case scenario I agree). Sure we can instantiate a very general class for the sequence, but then which symbol do you pick to fill it? The A for Alanine, or the A for Adenine? I hope not a "N" or "Unknown". In the end, you MUST choose for which type to go, and if you made that choice, then you can just as well set the BCSequence type, or in our case pick the proper subclass. Unless I do not see the better alternative. But even if you could read a fasta file in an untyped bcsequence with "untyped" symbols, what happens if you feed this one to a "make_complement" wrapper? You get the same problem again and again, what is the complement of an A symbol, either nothing in the protein world (or perhaps a codon ;-) or a T (I know it doesn't make sense to ask a protein for its complement, but as an example I think it illustrates the problem well). > > I hope that with showing some concreate examples that this time I can > convince you guys that we don't have to subclass BCSequence, or at > least use wrappers for all additional functionality. To a certain point yes, at least I agree with the latter part. I'm strongly in favor of the wrapper classes, I use them a lot myself and think they nicely separate the model from the "controller". Also, with convenience methods one can still keep things "hybrid" I think. In principle, I don't believe in untyped sequences, but of course the biojava way shows one possibility to indeed have a general BCSequence that is typed by the attached BCSymbolSet (Alphabet) you attach to it. I'm not sure as I can't overlook many consequences and problems that this strategy has. Most important, I think that the current solution works nicely, though more methods could be transferred to wrapper classes. Finally, the annotation stuff and alike are indeed very general, but hey that's why they belong in the BCSequence super class right! > > please now go ahead and shoot me ;-) No shooting please ;-) Cheers, Alex > > > _______________________________________________ > Biococoa-dev mailing list > Biococoa-dev at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/biococoa-dev > > ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com iRNAi, do you? http://www.mekentosj.com/irnai ********************************************************* ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com Windows is a 32-bit patch to a 16-bit shell for an 8-bit operating system, written for a 4-bit processor by a 2- bit company without 1 bit of sense. ********************************************************* -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 7698 bytes Desc: not available URL: From jtimmer at bellatlantic.net Thu Nov 18 15:53:38 2004 From: jtimmer at bellatlantic.net (John Timmer) Date: Thu, 18 Nov 2004 12:53:38 -0800 Subject: [Biococoa-dev] more ramblings In-Reply-To: Message-ID: > > To a certain point yes, at least I agree with the latter part. I'm strongly in > favor of the wrapper classes, I use them a lot myself and think they nicely > separate the model from the "controller". Also, with convenience methods one > can still keep things "hybrid" I think. > In principle, I don't believe in untyped sequences, but of course the biojava > way shows one possibility to indeed have a general BCSequence that is typed by > the attached BCSymbolSet (Alphabet) you attach to it. I'm not sure as I can't > overlook many consequences and problems that this strategy has. Most > important, I think that the current solution works nicely, though more methods > could be transferred to wrapper classes. Finally, the annotation stuff and > alike are indeed very general, but hey that's why they belong in the > BCSequence super class right! Yeah, the more I look at BioJava?s actual code, the less excited I become about using their progress as a model. Have you ever tried to trace through their process for translation? I never got to the point where I could see anything actually related to an amino acid. It calls through so many methods before it attempts to do anything that it must take about a half an hour to accomplish anything BioJava rant aside ? I?m comfortable with the idea mentioned somewhere in Alex?s message of shifting the actual code for some of the sequence manipulation/calculation into wrapper classes, but providing call throughs to the methods in the sequence classes. Another alternative would be to have these methods attached as categories on BCSequences. With either of these, you would get Koen?s code separation and I?d be happy about the more direct connection of methods to data. Anyway, to stir up more controversy around here, I had always envisioned something along the following structure: Sequence bundle (groups related sequences) | Sequence wrapper (holds features, notes, etc.) | Sequence The reason being that I see features as being abstractions, not inherent to any type of sequence. They?re mostly a bit of information and a range it?s relevant to. There are some exceptions to this ? for example, a phosphorylation site changes the MW of a protein ? but they are largely exceptions. These exceptions are going to be difficult to handle regardless ? how to tell if a site is or isn?t glycosylated is going to be very context dependent. The majority of features (ORFs, kinase domains, restriction sites, etc.) don?t require that sort of heavy lifting. Just something else to think about.... Cheers, Jay _______________________________________________ This mind intentionally left blank -------------- next part -------------- An HTML attachment was scrubbed... URL: From mek at mekentosj.com Thu Nov 18 15:27:25 2004 From: mek at mekentosj.com (Alexander Griekspoor) Date: Thu, 18 Nov 2004 21:27:25 +0100 Subject: [Biococoa-dev] more ramblings In-Reply-To: <26A34B90-38EB-11D9-AFCC-003065A5FDCC@earthlink.net> References: <26A34B90-38EB-11D9-AFCC-003065A5FDCC@earthlink.net> Message-ID: <429780BD-39A0-11D9-9BD4-000D93AE89A4@mekentosj.com> Hi guys, Reading over the rest of the emails I missed, I think I have made my point already, just a few remarks that come up while reading: > But should the reader methods be concerned with whether it is a > protein or nucleotide sequence? I don't think they should. The > introduction of a factory class is a well established design pattern > in OOP that deals with these sort of situations. An advantage is that > when you ever decide to change the way a sequence is created, or > introduce a new type of sequence, you only have to modify the code > once in the factory, not in each readXXX method. Or maybe later on we > decide to implement a new read method, or introduce a class to obtain > a sequence from a database. Maybe the user types in a sequence in an > NSTextView and wants to make a BCSequence. Should each of these > classes then try to figure out whether it's a protein or nucleotide > sequence? If we keep that code in one place (factory or whatever you > want to call it) it makes it much easier to maintain. Again, although certainly a possibility, sometimes there are alternatives just as good. For example, the thing that I immediately thought after reading: > An advantage is that when you ever decide to change the way a sequence > is created, or introduce a new type of sequence, you only have to > modify the code once in the factory, not in each readXXX method. was "Why not do the type checking/bcsequence subclass creation in one method inside the current implementation?" We don't check the type in each readXXX method either, so one method like "determineSequenceType" or "sequenceObjectForFile" would also allow you to keep all "factory" methods centralized and easy to change right? > I have added some more examples in this reply, and hopefully showed > that this is also a good OOP design. Again, there are many examples that prove both to be good OOP design IMHO. > I am very guilty of supporting the BCSequence subclasses myself when > we just started. But now that BioCocoa is growing, I came to the > realization that we may have to shuffle things around to make the code > easier to use and maintain. That's a good thing Koen, it's never smart to keep going without reflection, still in this case we can end up with a good mix I hope. >> My first instinct would be to take >> anything in BCFindSequence and work it back in to BCSequence. > > Please do so, but leave the BCFindSequence code as an alternative :) Nope, let's choose! We should all agree on one way again IMHO, we can provide convenience methods, but not two completely separate things please. >> Another way to think about this - let's assume that Apple knows what >> they're >> doing in designing their classes. The most analogous item in Cocoa's >> Foundation is NSMutableString. There is only one utility class that's >> directly related to strings (NSScanner - maybe two with >> NSCharacterSet). >> Just about all the methods needed for handling the contents of >> strings are >> either in NSMutableString or its superclass. It's good design. >> > > NSString indeed maintains a list of characters, and also does some > basic character manipulation, and substring searching. But it doesn't > translate a string to another language! [myString utf8string]; [myString fileSystemRepresentation]; [myString cString]; (no spellcheck) seem nice examples to me (maybe not so complicated, but they are representations in different "languages"). >> Right now, you have several very similar methods in BCSequence (and >> its >> subclasses). As I said before, this is usually a situation in OOP when >> one has to rethink the design, and try to find a way to avoid >> duplicating code. > Right, and last time this came up, I mentioned that I had every > intention of > fixing it. It's not a fundamental class structure problem - it was a > problem with me trying to put something in place first, and fix it > later. I > don't know how else to possibly say that this situation is temporary, > and > doesn't say anything informative about the class structure. Work in progress ;-) > I'd also like to point out that having 2 methods vs. 1 method with a > boolean flag, as > yours apparently does, doesn't make any argument about class > complexity at > all. I went back and forth on which to do for a while, and settled on > 2. > If people prefer 1, it can be changed. As you might have understood (boy I almost sound like Steve Balmer with his developers, developers, developers!!), I'm a great fan of convenience methods (convenience, convenience, convenience!!). Please provide two methods, one detailed, the other simple and convenient. 1 myMethodDoesThis: withThisAsAnArgument: andThisAsAnArgument etc 2 myMethodDoesThis (with default arguments). One of the things I absolutely love in the Cocoa frameworks. > It comes down to the design decision of whether you want to send the > sequence off somewhere else to get information back on it, or whether > you > want to ask the sequence to tell you something about itself. I'd say > that > for the most part, for someone trying to use this framework, it's much > easier to ask the sequence, instead of trying to figure out what > object/method they need to send the sequence to. I also don't think > that it > leads to a painful burden on us developers in terms of organization. I think it all comes down to how to describe the border or guidelines of when to choose for internal or external methods. My gut feeling says, hardcoded properties, "one liner" calculations, and trivial methods can be done internally. Also speed is an issue, if it takes time to calculate things it's way nicer to provide a wrapper object because that allows to go for threading, asynchronous methods, and progress monitoring very elegantly. Things like length, is so easy to calculate (a typical one-liner) that it would be ridiculous to have a helper calculate that. Also complex calculations with many lines of code, special conditions, many parameters etc should definitely go outside. (maybe the guideline for internal methods should be "no arguments in the methodname" ;-) For some things I'm tended to let the gut feeling be determined by biology (strange huh). Translation needs a complete machinery in the cell, so it should here ;-) I would say properties and representations inside, conversions, calculations, and manipulations outside. > I think the individual symbols are great examples of this approach - > they > are incredibly powerful because, unlike a character, they know things > about > themselves. They have properties I would say in the light of the above. > You don't have to dig around to find out which class/method are > needed to find out what the complement of a base is - the base already > knows > what its complement is. I'd love to see the same power extended to > sequences as a whole. Right so a BCSequence should have a GC% method or MW method or something alike for example right? So we add what the favorite thing what everyone would love to see (in the superclass if it's a general thing (in the case for MW, and not for GC%)): [mySequence gcPercentage] or [mySequence molecularWeight] (purely hypothetical). But now comes the clue, would the enduser or our framework care that the actual method is a convenience one and that there's a helper/wrapper object to handle the things needed behind the scenes? I wouldn't think so. Would we care? Absolutely!!! 1) it allows to keep our codebase of the sequence object in this case clean and lightweight 2) it can centralize code that works on multiple types, all subclasses can call the same convenience method (so it can go in the superclass) if necessary and guess what, the wrapper knows instantly the type of sequence it's working on (simply ask the sender its type). Central code is easier to maintain, change and optimize. 3) caching, think of sharedHelper objects, one can keep it (and the data it requires to work like enzyme dictionaries) alive if you want to do batch conversions/processing! For users of our framework there's no problem to understand the code if we document our methods well and tell when certain methods make use of wrappers or not. So than the final question, when to go for helpers or not? In the end we should decide on a per method basis I guess, it depends on how complicated things are to generate, how much it is shared by multiple sequence types etc... Let's leave that up to further discussions when we're actually getting there ;-) Cheers, Alex ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com Microsoft is not the answer, Microsoft is the question, NO is the answer ********************************************************* -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 9488 bytes Desc: not available URL: From mek at mekentosj.com Thu Nov 18 17:03:11 2004 From: mek at mekentosj.com (Alexander Griekspoor) Date: Thu, 18 Nov 2004 23:03:11 +0100 Subject: Fwd: [Biococoa-dev] more ramblings Message-ID: Again! Man I got to remember to send my emails to the list as well.... Begin doorgestuurd bericht: Van: Alexander Griekspoor Datum: 18 november 2004 21:32:44 GMT+01:00 Aan: John Timmer Onderwerp: Antw.: [Biococoa-dev] more ramblings > Anyway, to stir up more controversy around here, I had always > envisioned something along the following structure: > > Sequence bundle > (groups related sequences) > ????| > Sequence wrapper > (holds features, notes, etc.) > ????| > Sequence > Yes, my idea exactly! Imagine a multi-fasta file, wouldn't it be fantastic to initialize such a sequence bundle directly from it? Or write one out to disk in fasta format.... Also, alignments could be a perfect subclass of a sequence bundle object (one that only in addition has to store the interrelated positions... Awesome! > The reason being that I see features as being abstractions, not > inherent to any type of sequence. Yep, absolutely agree. We have feared to approach this problem a bit in the past, but this should be the underlying idea to keep in mind. > ?They?re mostly a bit of information and a range it?s relevant to. > ?There are some exceptions to this ? for example, a phosphorylation > site changes the MW of a protein ? but they are largely exceptions. I see some discussions already on the horizons rapidly popping up (you should have stopped with the previous sentence when everything was still perfect ;-) > ?These exceptions are going to be difficult to handle regardless ? how > to tell if a site is or isn?t glycosylated is going to be very context > dependent. ?The majority of features (ORFs, kinase domains, > restriction sites, etc.) don?t require that sort of heavy lifting. I guess, I'm gonna read some emails in which we discussed this previously, we have been talking about this. Alex ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com Microsoft is not the answer, Microsoft is the question, NO is the answer ********************************************************* ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com Claiming that the Macintosh is inferior to Windows because most people use Windows, is like saying that all other restaurants serve food that is inferior to McDonalds ********************************************************* -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 4141 bytes Desc: not available URL: From mek at mekentosj.com Thu Nov 18 17:06:45 2004 From: mek at mekentosj.com (Alexander Griekspoor) Date: Thu, 18 Nov 2004 23:06:45 +0100 Subject: [Biococoa-dev] more ramblings In-Reply-To: References: Message-ID: <22FC2C7A-39AE-11D9-9BD4-000D93AE89A4@mekentosj.com> That's a very nice idea as well! We discussed storage a few times before as well, and one of the things I still see a future for is a biococoa native file format (to preserve all our "added value") which is represented as a bundle as well. Anyway, that will come later... Alex Op 18-nov-04 om 22:01 heeft John Timmer het volgende geschreven: > Alex ? was this supposed to go to the list at large? > > I agree about your thoughts on the bundle use. ?Another way I was > thinking - ?the bundle could contain a large genomic sequence, an > mRNA, sequences of each exon, the ORF, and the protein sequence. ?Each > of these sequences can be features of the other. ?IE ? an feature of > the genomic sequence could have a pointer to the exon sequence in the > same bundle. ?A feature of the mRNA could point to the same exon. ?The > mRNA could have an ORF feature, which points to the protein sequence > that it encodes. ?Basically, we can define features in such a way that > they point to another sequence in the same bundle. > > Cheers, > > Jay > > > > ?Anyway, to stir up more controversy around here, I had always > envisioned something along the following structure: > > ?Sequence bundle > ?(groups related sequences) > ?????| > ?Sequence wrapper > ?(holds features, notes, etc.) > ?????| > ?Sequence > > > > Yes, my idea exactly! Imagine a multi-fasta file, wouldn't it be > fantastic to initialize such a sequence bundle directly from it? Or > write one out to disk in fasta format.... Also, alignments could be a > perfect subclass of a sequence bundle object (one that only in > addition has to store the interrelated positions... Awesome! > > > The reason being that I see features as being abstractions, not > inherent to any type of sequence. ? > > Yep, absolutely agree. We have feared to approach this problem a bit > in the past, but this should be the underlying idea to keep in mind. > > > ?They?re mostly a bit of information and a range it?s relevant to. > ?There are some exceptions to this ? for example, a phosphorylation > site changes the MW of a protein ? but they are largely exceptions. ? > > I see some discussions already on the horizons rapidly popping up (you > should have stopped with the previous sentence when everything was > still perfect ;-) > > > ?These exceptions are going to be difficult to handle regardless ? how > to tell if a site is or isn?t glycosylated is going to be very context > dependent. ?The majority of features (ORFs, kinase domains, > restriction sites, etc.) don?t require that sort of heavy lifting. > > I guess, I'm gonna read some emails in which we discussed this > previously, we have been talking about this. > > > Alex > ********************************************************* ? > ????????????????????** Alexander Griekspoor ** > ********************************************************* ? > ?????????????The Netherlands Cancer Institute > ?????????????Department of Tumorbiology (H4) > ????????Plesmanlaan 121, 1066 CX, Amsterdam > ???????????????????Tel: ?+ 31 20 - 512 2023 > ???????????????????Fax: ?+ 31 20 - 512 2029 > ???????????????????AIM: mekentosj at mac.com > ???????????????????E-mail: a.griekspoor at nki.nl > ???????????????Web: http://www.mekentosj.com > ?? > ??????Microsoft is not the answer, > ??????Microsoft is the question, > ??????NO is the answer > > ********************************************************* > > > > > _______________________________________________ > This mind intentionally left blank > ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 E-mail: a.griekspoor at nki.nl AIM: mekentosj at mac.com Web: http://www.mekentosj.com EnzymeX - To cut or not to cut http://www.mekentosj.com/enzymex ********************************************************* -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 5950 bytes Desc: not available URL: From kvddrift at earthlink.net Thu Nov 18 18:37:06 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Thu, 18 Nov 2004 18:37:06 -0500 Subject: [Biococoa-dev] more ramblings In-Reply-To: References: Message-ID: Hi, Some short comments here - too busy to read all the recent postings. > Yeah, the more I look at BioJava?s actual code, the less excited I > become about using their progress as a model. ?Have you ever tried to > trace through their process for translation? ?I never got to the point > where I could see anything actually related to an amino acid. ?It > calls through so many methods before it attempts to do anything that > it must take about a half an hour to accomplish anything I like their interface, however the implementation is a twisty little maze of passages. > > BioJava rant aside ? I?m ?comfortable with the idea mentioned > somewhere in Alex?s message of shifting the actual code for some of > the sequence manipulation/calculation into wrapper classes, but > providing call throughs to the methods in the sequence classes. > ?Another alternative would be to have these methods attached as > categories on BCSequences. ?With either of these, you would get Koen?s > code separation and I?d be happy about the more direct connection of > methods to data. I think we all agree on this approach then. I wouldn't use categories, though, we can leave it in BCSequence. So we put the sequence manipulations in wrapper classes and provide appropriate convenience methods in BCSequence or one of its subclasses. > Anyway, to stir up more controversy around here, I had always > envisioned something along the following structure: > > Sequence bundle > (groups related sequences) > ????| > Sequence wrapper > (holds features, notes, etc.) > ????| > Sequence Sorry, but I have no idea what you mean by the scheme above. - Koen. From kvddrift at earthlink.net Thu Nov 18 20:33:32 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Thu, 18 Nov 2004 20:33:32 -0500 Subject: [Biococoa-dev] more ramblings In-Reply-To: <0883A056-3933-11D9-B26D-000D93AE89A4@mekentosj.com> References: <4EB065F1-3838-11D9-AFCC-003065A5FDCC@earthlink.net> <0883A056-3933-11D9-B26D-000D93AE89A4@mekentosj.com> Message-ID: <0617DAF4-39CB-11D9-970E-003065A5FDCC@earthlink.net> On Nov 18, 2004, at 2:25 AM, Alexander Griekspoor wrote: > Yes, but that's just pushing the problem ahead, and has a few more > consequences. For instance in the case of the fasta file, say we have > "AAAATTT" (worst case scenario I agree). Sure we can instantiate a > very general class for the sequence, but then which symbol do you pick > to fill it? The A for Alanine, or the A for Adenine? I hope not a "N" > or "Unknown". In the end, you MUST choose for which type to go, and if > you made that choice, then you can just as well set the BCSequence > type, or in our case pick the proper subclass. Unless I do not see the > better alternative. But even if you could read a fasta file in an > untyped bcsequence with "untyped" symbols, what happens if you feed > this one to a "make_complement" wrapper? You get the same problem > again and again, what is the complement of an A symbol, either nothing > in the protein world (or perhaps a codon ;-) or a T (I know it doesn't > make sense to ask a protein for its complement, but as an example I > think it illustrates the problem well). You are absolutely right that it is a problem to create an untyped BCSequence, that's not what I was trying to say. My point was that readFasta cannot always know if it is a protein or nucleotide sequence, so we let it just create a BCSequence. Even if it is clear what the sequence is, we should not have 2 different readFasta methods, one for proteins, and one for dna/rna. If we just create a BCSequence, the readFasta method will always work. It's only task IMO should be to parse the file (which should have a constant structure, independent of the sequence type, so it works always), extract the requested data, and pass it on to the class that actually creates a new BCSequence object. I think it is the responsibility of the user/caller to ask for either protein or dna or rna, by passing the right sequence type or symbol set. Just for fun try the following. I have added two test sequences to the translation demo. Now edit the controller class so it will read the test2 file (a protein). The start the program, and hit translate. Tadaa ;) To prevent these sort of situations, we just let the wrapper test first what the sequence type is, and either return the complement or nil/an error if it is a protein. - Koen. From kvddrift at earthlink.net Thu Nov 18 20:46:25 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Thu, 18 Nov 2004 20:46:25 -0500 Subject: [Biococoa-dev] more ramblings In-Reply-To: <429780BD-39A0-11D9-9BD4-000D93AE89A4@mekentosj.com> References: <26A34B90-38EB-11D9-AFCC-003065A5FDCC@earthlink.net> <429780BD-39A0-11D9-9BD4-000D93AE89A4@mekentosj.com> Message-ID: > Again, although certainly a possibility, sometimes there are > alternatives just as good. For example, the thing that I immediately > thought after reading: >> An advantage is that when you ever decide to change the way a >> sequence is created, or introduce a new type of sequence, you only >> have to modify the code once in the factory, not in each readXXX >> method. > was "Why not do the type checking/bcsequence subclass creation in one > method inside the current implementation?" > We don't check the type in each readXXX method either, so one method > like "determineSequenceType" or "sequenceObjectForFile" would also > allow you to keep all "factory" methods centralized and easy to change > right? Right. See my other post, IMO a readXXX file should only parse the input, and then pass on the requested info to a class that creates a sequence. >> Please do so, but leave the BCFindSequence code as an alternative :) > Nope, let's choose! We should all agree on one way again IMHO, we can > provide convenience methods, but not two completely separate things > please. No, you're right. I vote for separate wrappers combined with convenience methods. > As you might have understood (boy I almost sound like Steve Balmer > with his developers, developers, developers!!), I'm a great fan of > convenience methods (convenience, convenience, convenience!!). I hope for your collegues you don't sweat that much ;-) > Please provide two methods, one detailed, the other simple and > convenient. > 1 myMethodDoesThis: withThisAsAnArgument: andThisAsAnArgument etc > 2 myMethodDoesThis (with default arguments). > One of the things I absolutely love in the Cocoa frameworks. That's very nice indeed. > I think it all comes down to how to describe the border or guidelines > of when to choose for internal or external methods. My gut feeling > says, hardcoded properties, "one liner" calculations, and trivial > methods can be done internally. Also speed is an issue, if it takes > time to calculate things it's way nicer to provide a wrapper object > because that allows to go for threading, asynchronous methods, and > progress monitoring very elegantly. Things like length, is so easy to > calculate (a typical one-liner) that it would be ridiculous to have a > helper calculate that. I agree, but let's then focus on having these one-liners in BCSequence only, not in the subclasses. > I would say properties and representations inside, conversions, > calculations, and manipulations outside. Sounds good to me. > Right so a BCSequence should have a GC% method or MW method or > something alike for example right? So we add what the favorite thing > what everyone would love to see (in the superclass if it's a general > thing (in the case for MW, and not for GC%)): > [mySequence gcPercentage] or [mySequence molecularWeight] (purely > hypothetical). I'd say in the superclass only. We can have the wrappers test if it is the appropriate sequencetype. Guys, I'm glad the mailinglist is back alive. We're having some good and very useful discussions. - Koen. From mek at mekentosj.com Fri Nov 19 05:41:25 2004 From: mek at mekentosj.com (Alexander Griekspoor) Date: Fri, 19 Nov 2004 11:41:25 +0100 Subject: [Biococoa-dev] more ramblings In-Reply-To: References: Message-ID: <9001A071-3A17-11D9-9BD4-000D93AE89A4@mekentosj.com> > I think we all agree on this approach then. I wouldn't use categories, > though, we can leave it in BCSequence. So we put the sequence > manipulations in wrapper classes and provide appropriate convenience > methods in BCSequence or one of its subclasses. Yes, categories are nice for developers to quickly add features to classes, but we shouldn't use them in a framework in my opinion. It hopelessly scatters methods and often ends up in one big mess.... > > >> Anyway, to stir up more controversy around here, I had always >> envisioned something along the following structure: >> >> Sequence bundle >> (groups related sequences) >> ????| >> Sequence wrapper >> (holds features, notes, etc.) >> ????| >> Sequence > > > > Sorry, but I have no idea what you mean by the scheme above. It's an idea about how to further organize and extend the sequence classes to a higher level. Especially with regards to annotations. The sequence class which we all know and love (haha) would be the core which contains the actual symbol sequence. Above that would be a wrapper that contains this sequence, but also the accompanying annotations, metadata and notes. Certainly when processing data you don't always need the annotations and this way you can keep things lightweight if needed. Above that you can envision the need for sequence groups or clusters, related sequences for instance (translations/conversions), alignments, genomes etc. So, these is not a class hierarchy rather a wrapper hierarchy (if that exist). How we practically implement and name this is another issue (and probably long debate ;-), but I think it's a nice working model... Hope I did makes clearer, and not even more fussy. Alex ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com Windows is a 32-bit patch to a 16-bit shell for an 8-bit operating system, written for a 4-bit processor by a 2- bit company without 1 bit of sense. ********************************************************* From kvddrift at earthlink.net Fri Nov 19 23:03:04 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Fri, 19 Nov 2004 23:03:04 -0500 Subject: [Biococoa-dev] complement Message-ID: <1439C4A2-3AA9-11D9-A7A3-003065A5FDCC@earthlink.net> Hi, I have added a new class BCComplement (in BCTools/SequenceTools) and a convenience method in BCSequence to access it. I haven't removed the code in the subclasses yet, wanted your comments first. cheers, - Koen. From kvddrift at earthlink.net Sat Nov 20 19:56:36 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sat, 20 Nov 2004 19:56:36 -0500 Subject: [Biococoa-dev] BCSymbolSet problem In-Reply-To: <7A4D5B70-3837-11D9-AFCC-003065A5FDCC@earthlink.net> References: <7A4D5B70-3837-11D9-AFCC-003065A5FDCC@earthlink.net> Message-ID: <31D64861-3B58-11D9-9685-003065A5FDCC@earthlink.net> On Nov 16, 2004, at 8:24 PM, Koen van der Drift wrote: > Hi, > > I am having some trouble getting to populate a BCSymbolSet. I am using > the code that was originally committed by Alex, and changed it so that > BCSymbolSet now is a subclass of NSMutableSet. To populate a set, I > put some code into the dnaStrictSymbolSet and initwithString methods. > I also added two lines in the translation demo. You need to uncomment > them to debug the symbolset code. Everytime I reach the line > addSymbol, the program raises an exception. > > I have no idea why this is happening, so if you see what I am doing > wrong, let me know! > > There is another problem with the current approach. initWithString is > now hardcoded to create a BCNucleotideDNA, but what if the method is > called for RNA or a protein? Not sure yet how to solve this. We might > need to take a different approach after all. > The following solves the second problem: + (BCSymbolSet *)dnaStrictSymbolSet { BCSymbolSet *symbolSet = [[BCSymbolSet alloc] init]; NSMutableArray *symbolArray = [NSMutableArray array]; [symbolArray addObject: [BCNucleotideDNA baseForSymbol: 'A']]; [symbolArray addObject: [BCNucleotideDNA baseForSymbol: 'C']]; [symbolArray addObject: [BCNucleotideDNA baseForSymbol: 'G']]; [symbolArray addObject: [BCNucleotideDNA baseForSymbol: 'T']]; [symbolSet addObjectsFromArray: symbolArray]; return [symbolSet autorelease]; } But I still crash when I hit the line [symbolSet addObjectsFromArray: symbolArray]; From the debugger I see this: (gdb) po symbolSet So even if I initialized the object, it causes an error. But it did do something during the initialization: (gdb) po [symbolSet class] BCSymbolSet (gdb) po [symbolSet superclass] NSMutableSet A possible solution could be to not subclass NSMutableSet, but let BCSymbolSet have a member NSMutableSet. Any ideas? cheers, - Koen. From kvddrift at earthlink.net Sat Nov 20 20:35:26 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sat, 20 Nov 2004 20:35:26 -0500 Subject: [Biococoa-dev] BCSymbolSet problem In-Reply-To: <31D64861-3B58-11D9-9685-003065A5FDCC@earthlink.net> References: <7A4D5B70-3837-11D9-AFCC-003065A5FDCC@earthlink.net> <31D64861-3B58-11D9-9685-003065A5FDCC@earthlink.net> Message-ID: <9EE8C21C-3B5D-11D9-9685-003065A5FDCC@earthlink.net> On Nov 20, 2004, at 7:56 PM, Koen van der Drift wrote: > A possible solution could be to not subclass NSMutableSet, but let > BCSymbolSet have a member NSMutableSet. > Well, that turned out to be the solution! NSMutableSet is a part of a class cluster and it takes some additional work to subclass. So I chose to make the mutableset a member - works just fine. It's now in CVS with some symbol sets already populated. Feel free to add the right symbols to the other sets. cheers, - Koen. From kvddrift at earthlink.net Sat Nov 20 22:53:18 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sat, 20 Nov 2004 22:53:18 -0500 Subject: [Biococoa-dev] finding characters in an NSString Message-ID: Hi, Just spend over an hour trying to figure out how to determine if an NSString contains one or more characters from an NSCharacterSet. Aaaarrrggghhhh. There is no such thing in the Foundation Kit (at least, I couldn't find it). How do I do this? cheers, - Koen. From kvddrift at earthlink.net Sun Nov 21 07:19:45 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sun, 21 Nov 2004 07:19:45 -0500 Subject: [Biococoa-dev] more ramblings In-Reply-To: <9001A071-3A17-11D9-9BD4-000D93AE89A4@mekentosj.com> References: <9001A071-3A17-11D9-9BD4-000D93AE89A4@mekentosj.com> Message-ID: On Nov 19, 2004, at 5:41 AM, Alexander Griekspoor wrote: > It's an idea about how to further organize and extend the sequence > classes to a higher level. Especially with regards to annotations. The > sequence class which we all know and love (haha) would be the core > which contains the actual symbol sequence. Above that would be a > wrapper that contains this sequence, but also the accompanying > annotations, metadata and notes. Certainly when processing data you > don't always need the annotations and this way you can keep things > lightweight if needed. Above that you can envision the need for > sequence groups or clusters, related sequences for instance > (translations/conversions), alignments, genomes etc. > So, these is not a class hierarchy rather a wrapper hierarchy (if that > exist). How we practically implement and name this is another issue > (and probably long debate ;-), but I think it's a nice working > model... > Hope I did makes clearer, and not even more fussy. > Thanks - that indeed sounds like a good plan. I was working a bit on readClustalFile and it would indeed make a lot of sense to keep the files together, including the line that indicates the alignment. Also so far readFile ignores all information but the sequence. Right now, readFile returns an NSArray with sequences from the same file, but that should be not too difficult change, maybe to an NSDictionary? Or are you guys talking about a complete wrapper class? cheers, - Koen. From james.balhoff at duke.edu Sun Nov 21 12:08:27 2004 From: james.balhoff at duke.edu (Jim Balhoff) Date: Sun, 21 Nov 2004 12:08:27 -0500 Subject: [Biococoa-dev] finding characters in an NSString In-Reply-To: References: Message-ID: On Nov 20, 2004, at 10:53 PM, Koen van der Drift wrote: > Hi, > > Just spend over an hour trying to figure out how to determine if an > NSString contains one or more characters from an NSCharacterSet. > Aaaarrrggghhhh. There is no such thing in the Foundation Kit (at > least, I couldn't find it). > > How do I do this? How about: NSCharacterSet *set = [NSCharacterSet characterSetWithCharactersInString:theString]; BOOL characterIsInString = [set characterIsMember:someUnichar]; I guess you would have to loop through the unichar's in your set. Or maybe [theString rangeOfCharacterFromSet:aSet] would be helpful. Just some guesses. Jim ____________________________________________ James P. Balhoff Dept. of Biology Duke University Durham, NC 27708-0338 USA From kvddrift at earthlink.net Sun Nov 21 12:43:30 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sun, 21 Nov 2004 12:43:30 -0500 Subject: [Biococoa-dev] finding characters in an NSString In-Reply-To: References: Message-ID: > How about: > > NSCharacterSet *set = [NSCharacterSet > characterSetWithCharactersInString:theString]; > > BOOL characterIsInString = [set characterIsMember:someUnichar]; > > I guess you would have to loop through the unichar's in your set. > > Or maybe [theString rangeOfCharacterFromSet:aSet] would be helpful. > > Just some guesses. Hi Jim, Welcome back :) This seems to work: if ( [theString rangeOfCharacterFromSet: aSet].location != NSNotFound ) which will return YES if one or more of the characters are in the string. cheers, - Koen. From mek at mekentosj.com Thu Nov 25 17:04:31 2004 From: mek at mekentosj.com (Alexander Griekspoor) Date: Thu, 25 Nov 2004 23:04:31 +0100 Subject: [Biococoa-dev] more ramblings In-Reply-To: <0617DAF4-39CB-11D9-970E-003065A5FDCC@earthlink.net> References: <4EB065F1-3838-11D9-AFCC-003065A5FDCC@earthlink.net> <0883A056-3933-11D9-B26D-000D93AE89A4@mekentosj.com> <0617DAF4-39CB-11D9-970E-003065A5FDCC@earthlink.net> Message-ID: Koen, >> Yes, but that's just pushing the problem ahead, and has a few more >> consequences. For instance in the case of the fasta file, say we have >> "AAAATTT" (worst case scenario I agree). Sure we can instantiate a >> very general class for the sequence, but then which symbol do you >> pick to fill it? The A for Alanine, or the A for Adenine? I hope not >> a "N" or "Unknown". In the end, you MUST choose for which type to go, >> and if you made that choice, then you can just as well set the >> BCSequence type, or in our case pick the proper subclass. Unless I do >> not see the better alternative. But even if you could read a fasta >> file in an untyped bcsequence with "untyped" symbols, what happens >> if you feed this one to a "make_complement" wrapper? You get the same >> problem again and again, what is the complement of an A symbol, >> either nothing in the protein world (or perhaps a codon ;-) or a T (I >> know it doesn't make sense to ask a protein for its complement, but >> as an example I think it illustrates the problem well). > > You are absolutely right that it is a problem to create an untyped > BCSequence, that's not what I was trying to say. My point was that > readFasta cannot always know if it is a protein or nucleotide > sequence, so we let it just create a BCSequence. Right, but what I try to make clear is that that is only shifting ahead the problem... The question is whether we want to make all methods compatible with untyped sequences as a consequence. I don't think so, but perhaps you guys think differently. > Even if it is clear what the sequence is, we should not have 2 > different readFasta methods, one for proteins, and one for dna/rna. Totally agree! But this should be possible with typed BCSequences as well. > If we just create a BCSequence, the readFasta method will always work. Sure, but I still haven't heard a solution of the most important problem. Those characters that have an equivalent BCSymbol in multiple types, like A (Alanine and Adenosine). You can only solve this problem if you also introduce untyped BCSymbols, but as you can't add MW's and other properties (because you don't know what it represents) to them, they are merely replacements for characters. Also, what in the world would you return if you feed such a thing to an object that calculates it molecular weight? Get the problems we will get ourselves into? > It's only task IMO should be to parse the file (which should have a > constant structure, independent of the sequence type, so it works > always), extract the requested data, and pass it on to the class that > actually creates a new BCSequence object. Hmm, ok, if you see it that way that's a possibility yes. Still it sounds more complicated than necessary. If you read a fasta file, you want a BCSequence (or a group of them) right? Why do it in two steps? I think there's plenty to distill in general METHODS within the sequenceIO class that all readXXX methods can use. It would keep things limited to one class though. > I think it is the responsibility of the user/caller to ask for either > protein or dna or rna, by passing the right sequence type or symbol > set. So, then you can just as well ask him to tell us right away, and instantiate the right BCSequence type immediately! > Cheers, Alex > ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com Claiming that the Macintosh is inferior to Windows because most people use Windows, is like saying that all other restaurants serve food that is inferior to McDonalds ********************************************************* From mek at mekentosj.com Thu Nov 25 17:13:15 2004 From: mek at mekentosj.com (Alexander Griekspoor) Date: Thu, 25 Nov 2004 23:13:15 +0100 Subject: [Biococoa-dev] more ramblings In-Reply-To: References: <26A34B90-38EB-11D9-AFCC-003065A5FDCC@earthlink.net> <429780BD-39A0-11D9-9BD4-000D93AE89A4@mekentosj.com> Message-ID: <344EC4FA-3F2F-11D9-94C9-000D93AE89A4@mekentosj.com> Op 19-nov-04 om 2:46 heeft Koen van der Drift het volgende geschreven: > >> Again, although certainly a possibility, sometimes there are >> alternatives just as good. For example, the thing that I immediately >> thought after reading: >>> An advantage is that when you ever decide to change the way a >>> sequence is created, or introduce a new type of sequence, you only >>> have to modify the code once in the factory, not in each readXXX >>> method. >> was "Why not do the type checking/bcsequence subclass creation in one >> method inside the current implementation?" >> We don't check the type in each readXXX method either, so one method >> like "determineSequenceType" or "sequenceObjectForFile" would also >> allow you to keep all "factory" methods centralized and easy to >> change right? > > Right. See my other post, IMO a readXXX file should only parse the > input, and then pass on the requested info to a class that creates a > sequence. Yes, but again, I don't see why that would require different classes. The readXXXFile METHOD should indeed only parse the input, but ANOTHER METHOD in the same class could help return the proper sequence. An idea of methods to implement: - a general method to determine (guess) the filetype - a general method to determine (guess) the sequence type (protein, dna, rna) - methods that do the parsing - based on the sequence type determination - based on the type as a (user set) argument >> I think it all comes down to how to describe the border or guidelines >> of when to choose for internal or external methods. My gut feeling >> says, hardcoded properties, "one liner" calculations, and trivial >> methods can be done internally. Also speed is an issue, if it takes >> time to calculate things it's way nicer to provide a wrapper object >> because that allows to go for threading, asynchronous methods, and >> progress monitoring very elegantly. Things like length, is so easy to >> calculate (a typical one-liner) that it would be ridiculous to have a >> helper calculate that. > > I agree, but let's then focus on having these one-liners in BCSequence > only, not in the subclasses. Why? There will certainly be cases where you can have dna sequence characteristics/calculations in one line? Why not add them? > >> >> Right so a BCSequence should have a GC% method or MW method or >> something alike for example right? So we add what the favorite thing >> what everyone would love to see (in the superclass if it's a general >> thing (in the case for MW, and not for GC%)): >> [mySequence gcPercentage] or [mySequence molecularWeight] (purely >> hypothetical). > > I'd say in the superclass only. We can have the wrappers test if it is > the appropriate sequencetype. Same question here... > > Guys, I'm glad the mailinglist is back alive. We're having some good > and very useful discussions. Copy that! Alex ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 E-mail: a.griekspoor at nki.nl AIM: mekentosj at mac.com Web: http://www.mekentosj.com EnzymeX - To cut or not to cut http://www.mekentosj.com/enzymex ********************************************************* From mek at mekentosj.com Thu Nov 25 17:17:48 2004 From: mek at mekentosj.com (Alexander Griekspoor) Date: Thu, 25 Nov 2004 23:17:48 +0100 Subject: Fwd: [Biococoa-dev] more ramblings Message-ID: Op 21-nov-04 om 13:19 heeft Koen van der Drift het volgende geschreven: > > On Nov 19, 2004, at 5:41 AM, Alexander Griekspoor wrote: > >> It's an idea about how to further organize and extend the sequence >> classes to a higher level. Especially with regards to annotations. >> The sequence class which we all know and love (haha) would be the >> core which contains the actual symbol sequence. Above that would be a >> wrapper that contains this sequence, but also the accompanying >> annotations, metadata and notes. Certainly when processing data you >> don't always need the annotations and this way you can keep things >> lightweight if needed. Above that you can envision the need for >> sequence groups or clusters, related sequences for instance >> (translations/conversions), alignments, genomes etc. >> So, these is not a class hierarchy rather a wrapper hierarchy (if >> that exist). How we practically implement and name this is another >> issue (and probably long debate ;-), but I think it's a nice working >> model... >> Hope I did makes clearer, and not even more fussy. >> > > Thanks - that indeed sounds like a good plan. I was working a bit on > readClustalFile and it would indeed make a lot of sense to keep the > files together, including the line that indicates the alignment. Also > so far readFile ignores all information but the sequence. Right now, > readFile returns an NSArray with sequences from the same file, but > that should be not too difficult change, maybe to an NSDictionary? Or > are you guys talking about a complete wrapper class? Yes, because it would not only be a simple container of multiple sequences, but would also contain "Metadata" about them, like annotations, notes, relation parameters etc... A dictionary would be a property of the wrapper class. But for the moment, if you return an array of sequences it should be easily converted to use the wrappers once they are there... Alex ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 E-mail: a.griekspoor at nki.nl AIM: mekentosj at mac.com Web: http://www.mekentosj.com EnzymeX - To cut or not to cut http://www.mekentosj.com/enzymex ********************************************************* ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com LabAssistant - Get your life organized! http://www.mekentosj.com/labassistant ********************************************************* From kvddrift at earthlink.net Fri Nov 26 08:16:03 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Fri, 26 Nov 2004 08:16:03 -0500 Subject: [Biococoa-dev] more ramblings In-Reply-To: References: Message-ID: <5300705A-3FAD-11D9-AD65-003065A5FDCC@earthlink.net> On Nov 25, 2004, at 5:17 PM, Alexander Griekspoor wrote: >> Thanks - that indeed sounds like a good plan. I was working a bit on >> readClustalFile and it would indeed make a lot of sense to keep the >> files together, including the line that indicates the alignment. Also >> so far readFile ignores all information but the sequence. Right now, >> readFile returns an NSArray with sequences from the same file, but >> that should be not too difficult change, maybe to an NSDictionary? Or >> are you guys talking about a complete wrapper class? > > Yes, because it would not only be a simple container of multiple > sequences, but would also contain "Metadata" about them, like > annotations, notes, relation parameters etc... A dictionary would be a > property of the wrapper class. But for the moment, if you return an > array of sequences it should be easily converted to use the wrappers > once they are there... > Right now, I am working on the clustal reading method. I could store the BCSequences in an NSArray, but then Ioose the info about which sequence is which. So one thing I could do is make an NSDictionary with a key for the ID, and a BCSequence for the value. However, the 'ID' is different for each file format. Sometimes it is easier to use the file name (eg in SwissProt). Peter's original solution, by storing an NSArray in the NSDictionary that only contains the ID's for lookup is also a possibility. I would like to move this forward, so we need to come up with a definitive structure for the wrapper. John, Alex, you proposed this, do you have a more concrete idea of what it should look like? cheers, - Koen. From kvddrift at earthlink.net Fri Nov 26 08:39:18 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Fri, 26 Nov 2004 08:39:18 -0500 Subject: [Biococoa-dev] more ramblings In-Reply-To: <344EC4FA-3F2F-11D9-94C9-000D93AE89A4@mekentosj.com> References: <26A34B90-38EB-11D9-AFCC-003065A5FDCC@earthlink.net> <429780BD-39A0-11D9-9BD4-000D93AE89A4@mekentosj.com> <344EC4FA-3F2F-11D9-94C9-000D93AE89A4@mekentosj.com> Message-ID: <921348F6-3FB0-11D9-AD65-003065A5FDCC@earthlink.net> On Nov 25, 2004, at 5:13 PM, Alexander Griekspoor wrote: >> Right. See my other post, IMO a readXXX file should only parse the >> input, and then pass on the requested info to a class that creates a >> sequence. > > Yes, but again, I don't see why that would require different classes. > The readXXXFile METHOD should indeed only parse the input, but ANOTHER > METHOD in the same class could help return the proper sequence. An > idea of methods to implement: > - a general method to determine (guess) the filetype > - a general method to determine (guess) the sequence type (protein, > dna, rna) > - methods that do the parsing > - based on the sequence type determination > - based on the type as a (user set) argument Yes, it definitely should be a separate method. My point is that there are (or will be) more places in BioCocoa where a sequence is created. So then we should have the guess methods in that class too. If we use an intermediate factory class, *all* sequence creation code goes through one central location. Again, this is easier to maintain, allows to add different file/sequence types, etc. Another solution instead of the factory could be to have a separate guess-the-type class. *** (merging two emails here, for conveniece ;) *** >> You are absolutely right that it is a problem to create an untyped >> BCSequence, that's not what I was trying to say. My point was that >> readFasta cannot always know if it is a protein or nucleotide >> sequence, so we let it just create a BCSequence. > Right, but what I try to make clear is that that is only shifting > ahead the problem... The question is whether we want to make all > methods compatible with untyped sequences as a consequence. I don't > think so, but perhaps you guys think differently. Again, I don't want to create untyped sequences, sorry if I was unclear about that. My point was to have readFasta (and all the others) return a BCSequence so we don't have to hardcode the return type. But we should *always* add an identifier whether it is dna, protein, etc. This is where the guess-the-type code come is place. Come to think of it, symbolsets could be really useful here, and will cover almost any situation, except for the hypothetical AAAAAAAA or CCCCCCCC sequences. That is never solvable, and needs input from the user. >> If we just create a BCSequence, the readFasta method will always work. > Sure, but I still haven't heard a solution of the most important > problem. Those characters that have an equivalent BCSymbol in multiple > types, like A (Alanine and Adenosine). You can only solve this problem > if you also introduce untyped BCSymbols, but as you can't add MW's and > other properties (because you don't know what it represents) to them, > they are merely replacements for characters. Also, what in the world > would you return if you feed such a thing to an object that calculates > it molecular weight? Get the problems we will get ourselves into? I do, don't worry ;-) I still vote to use either a symbolset, or use BCSequenceType to differentiate. Once that is known, we know which subclass of BCSymbol to use, and the untyped BCSymbol problem as you describe above is non existent. >> It's only task IMO should be to parse the file (which should have a >> constant structure, independent of the sequence type, so it works >> always), extract the requested data, and pass it on to the class that >> actually creates a new BCSequence object. > Hmm, ok, if you see it that way that's a possibility yes. Still it > sounds more complicated than necessary. If you read a fasta file, you > want a BCSequence (or a group of them) right? Why do it in two steps? > I think there's plenty to distill in general METHODS within the > sequenceIO class that all readXXX methods can use. It would keep > things limited to one class though. See my argument above, readFile is not the only class that creates sequences > >> I think it is the responsibility of the user/caller to ask for either >> protein or dna or rna, by passing the right sequence type or symbol >> set. > So, then you can just as well ask him to tell us right away, and > instantiate the right BCSequence type immediately! Yes, that would be the first choice (asking the user, I mean) , but stillI would either use BCSequenceType or the appropriate symbol set, not subclass BCSequence. cheers, - Koen. From kvddrift at earthlink.net Sat Nov 27 09:13:09 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sat, 27 Nov 2004 09:13:09 -0500 Subject: [Biococoa-dev] features and annotations Message-ID: <7716632A-407E-11D9-81AC-003065A5FDCC@earthlink.net> Hi, Just came across this page, could be interesting to adapt for BioCocoa: http://bioperl.org/HOWTOs/Feature-Annotation/index.html cheers, - Koen. From kvddrift at earthlink.net Sun Nov 28 19:04:08 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sun, 28 Nov 2004 19:04:08 -0500 Subject: [Biococoa-dev] first non-whitespace character Message-ID: <31244BE5-419A-11D9-9F46-003065A5FDCC@earthlink.net> Hi, Anyone knows how to get the location in a string where the first non-whitespace character is at? I am trying to parse a clustal file (turn on monospaced font): ACT1_FUGRU -----------------------MEDEIAALVVDNGSGMCKAGFAGDDAPRAVFPSIVGR ACT2_FUGRU -----------------------MDDEIAALVVDNGSGMCKAGFAGDDAPRAVFPSIVGR ACT3_FUGRU -----------------------MEDEVASLVVDNGSGMCKAGFAGDDAPRAVFPSIVGR 5H1A_FUGRU MDLRATSSNDSNATSGYSDTAAVDWDEGENATGSGSLPDPELSYQIITSLFLGALILCSI 5H1B_FUGRU -------MEGTNNTTGWT-----HFDSTSNRTSKSFDEEVKLSYQVVTSFLLGALILCSI 5H1D_FUGRU -------MELDNNSLDYFSSN--FTDIPSNTTVAHWTEATLLGLQISVSVVLAIVTLATM * . . : : The first 6 lines are easy to parse. However, the last line which contains the alignment, starts at the same location as the other lines, not at the asterisk. So, I need to figure out where the first character after the name starts in the first lines. Once I have that number, all subsequent lines start at the same number. Unfortunately that number can vary for different clustal files. I already commited a readClustal method earlier today, so you can see what I have sofar. cheers, - Koen. From mek at mekentosj.com Mon Nov 29 08:12:43 2004 From: mek at mekentosj.com (Alexander Griekspoor) Date: Mon, 29 Nov 2004 14:12:43 +0100 Subject: [Biococoa-dev] first non-whitespace character In-Reply-To: <31244BE5-419A-11D9-9F46-003065A5FDCC@earthlink.net> References: <31244BE5-419A-11D9-9F46-003065A5FDCC@earthlink.net> Message-ID: <5A8A46E8-4208-11D9-A769-000D93AE89A4@mekentosj.com> Hi Koen, I believe NSString's rangeOfCharacterFromSet: should get you there (there are more other options if needed in two extended methods). Use [NSCharacterSet alphanumericCharacterSet] as the set, and I believe you should get the range of the first character that it will meet. Check out the other "default sets" for sets that might proove a better or faster (smaller) option. Alex Op 29-nov-04 om 1:04 heeft Koen van der Drift het volgende geschreven: > Hi, > > Anyone knows how to get the location in a string where the first > non-whitespace character is at? I am trying to parse a clustal file > (turn on monospaced font): > > > ACT1_FUGRU > -----------------------MEDEIAALVVDNGSGMCKAGFAGDDAPRAVFPSIVGR > ACT2_FUGRU > -----------------------MDDEIAALVVDNGSGMCKAGFAGDDAPRAVFPSIVGR > ACT3_FUGRU > -----------------------MEDEVASLVVDNGSGMCKAGFAGDDAPRAVFPSIVGR > 5H1A_FUGRU > MDLRATSSNDSNATSGYSDTAAVDWDEGENATGSGSLPDPELSYQIITSLFLGALILCSI > 5H1B_FUGRU > -------MEGTNNTTGWT-----HFDSTSNRTSKSFDEEVKLSYQVVTSFLLGALILCSI > 5H1D_FUGRU > -------MELDNNSLDYFSSN--FTDIPSNTTVAHWTEATLLGLQISVSVVLAIVTLATM > * . . : > : > > The first 6 lines are easy to parse. However, the last line which > contains the alignment, starts at the same location as the other > lines, not at the asterisk. So, I need to figure out where the first > character after the name starts in the first lines. Once I have that > number, all subsequent lines start at the same number. Unfortunately > that number can vary for different clustal files. I already commited a > readClustal method earlier today, so you can see what I have sofar. > > > cheers, > > - Koen. > > _______________________________________________ > Biococoa-dev mailing list > Biococoa-dev at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/biococoa-dev > > ********************************************************* ** Alexander Griekspoor ** ********************************************************* The Netherlands Cancer Institute Department of Tumorbiology (H4) Plesmanlaan 121, 1066 CX, Amsterdam Tel: + 31 20 - 512 2023 Fax: + 31 20 - 512 2029 AIM: mekentosj at mac.com E-mail: a.griekspoor at nki.nl Web: http://www.mekentosj.com LabAssistant - Get your life organized! http://www.mekentosj.com/labassistant ********************************************************* From jtimmer at bellatlantic.net Mon Nov 29 10:59:06 2004 From: jtimmer at bellatlantic.net (John Timmer) Date: Mon, 29 Nov 2004 10:59:06 -0500 Subject: [Biococoa-dev] more ramblings In-Reply-To: <921348F6-3FB0-11D9-AD65-003065A5FDCC@earthlink.net> Message-ID: Okay, I'm cutting out a ton of quotations, because I was beginning to lose track of the discussion (I blame my cold for lack of focus ;). There's a couple sets of ramblings going on, which I'll try to summarize and in lude my thoughts on - The first is the issue of how to handle untyped sequence files. Koen suggests that the method for each untyped file goes through a factory object that handles its lack of clarity, an idea which I like. The question then becomes how to determine which type of sequence to return. The way I would imagine is to have a flag to determine whether to ask for user input - this could put up a standard dialog box. If the flag is false, the factory method could create each type of possible sequence, then use the sequence counted set to look for undefined symbols. Compare the results, and take the one with the fewest undefined symbols. In case of a tie, default to DNA>RNA>protein. Ramble #2 is about the sequence wrapper/bundle, and how to implement that to handle the multiple sequences in an alignment file. I had envisioned the wrapper as holding features, and a bundle as linking related sequences. If this is the way we go, we'd have to implement both in order to handle this circumstance. A short summary of how I expected a bundle to work - Each wrapper would have a unique bundle ID, and a reference to its bundle. Features within the wrapper, features could include a bundle ID. Basically, if code wanted to look at a feature, it would check to make sure that the bundle reference was not nil - if it wasn't, it would take the feature's bundle ID, and ask the bundle for the sequence corresponding to that ID. Given that a feature should have an NSRange, this would allow the two sequences to be aligned. For an alignment, I guess we'd have to define a key sequence, which would be the root level - all other sequences would have to be features of this sequence. Otherwise, it seems like coding it would be very complex - though maybe someone else could see a better way. The last issue seems to be around the quote from Koen: > I agree, but let's then focus on having these one-liners in BCSequence > only, not in the subclasses. I remember this quote as bothering me when I first read it, because there are some one liners that clearly belong in a specific sequence subclass (ie - finding the longest open reading frame should not be available to a protein sequence, and finding the hydrophobicity should not be available to nucleotides or codons). I seem to remember that reading further alleviated my concerns on this, but I can't remember how. Since Alex and I share this concern, could you clarify what you meant here, Koen? I think that's everything Cheers, JT _______________________________________________ This mind intentionally left blank From kvddrift at earthlink.net Mon Nov 29 20:00:38 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Mon, 29 Nov 2004 20:00:38 -0500 Subject: [Biococoa-dev] first non-whitespace character In-Reply-To: <5A8A46E8-4208-11D9-A769-000D93AE89A4@mekentosj.com> References: <31244BE5-419A-11D9-9F46-003065A5FDCC@earthlink.net> <5A8A46E8-4208-11D9-A769-000D93AE89A4@mekentosj.com> Message-ID: <3FA0D0E1-426B-11D9-A1B6-003065A5FDCC@earthlink.net> On Nov 29, 2004, at 8:12 AM, Alexander Griekspoor wrote: > > I believe NSString's rangeOfCharacterFromSet: should get you there > (there are more other options if needed in two extended methods). Use > [NSCharacterSet alphanumericCharacterSet] as the set, and I believe > you should get the range of the first character that it will meet. > Check out the other "default sets" for sets that might proove a better > or faster (smaller) option. > Of course! I was looking at NSScanner, but didn't see anything there that would work. thanks, - Koen. From kvddrift at earthlink.net Mon Nov 29 20:41:24 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Mon, 29 Nov 2004 20:41:24 -0500 Subject: [Biococoa-dev] more ramblings In-Reply-To: References: Message-ID: On Nov 29, 2004, at 10:59 AM, John Timmer wrote: > Okay, I'm cutting out a ton of quotations, because I was beginning to > lose > track of the discussion (I blame my cold for lack of focus ;). Eat some more turkey to cure your cold ;-) > The first is the issue of how to handle untyped sequence files. Koen > suggests that the method for each untyped file goes through a factory > object > that handles its lack of clarity, an idea which I like. Actually, what I suggested is to have a factory that handles the creation of *every* sequence. You feed the factory with a string, array, etc, and a BCSequenceType and/or BCSymbolSet, and the factory returns the right BCSequence. If the type is not specified, then the guess code comes into action. The point I am trying to make is that we should always use BCSequence as a *return type* for the factory as well as within BCSequenceReader. Otherwise we need a factory class for each BCSequence subclass. Internally the factory creates the right subclass, and even though the return type is BCSequence, the actual type will be the created subclass. That's the nice thing of inheritance! So maybe: BCSequenceFactory *myFactory = [[BCSequenceFactory] alloc ] init]; BCSequence *newSequence = [myFactory createSequenceUsingString: @"AACCTTGG" usingType: BCDNASequence]; -(BCSequence *) createSequenceUsingString: (NSString *) string usingTyp: (BCSequenceType) type { switch (type) { case BCDNASequence: { return [BCSequenceDNA DNASequenceWithString: string]; break; } ..... and so on. Note that in the snippet I am actually using BCSequenceDNA ;-). If you guys really want it, it's fine with me if we keep those around for convenience. But I still think that we should put most code in BCSequence, except maybe for the init methods. Because we are using a sequencetype or symbol set we know that the sequence is using the right type of symbols. So there is also no need to do typechecking, such as in setSequenceArray and other methods. > The question then becomes how to determine which type of sequence to > return. > The way I would imagine is to have a flag to determine whether to ask > for > user input - this could put up a standard dialog box. If the flag is > false, > the factory method could create each type of possible sequence, then > use the > sequence counted set to look for undefined symbols. Compare the > results, > and take the one with the fewest undefined symbols. In case of a tie, > default to DNA>RNA>protein. Sounds good, this code can also go in the factory class. However, I don't think we should use a dialog box for the framework. This is the sole responsibility of the developer who uses BioCocoa. > > Ramble #2 is about the sequence wrapper/bundle, and how to implement > that to > handle the multiple sequences in an alignment file. I had envisioned > the > wrapper as holding features, and a bundle as linking related > sequences. If > this is the way we go, we'd have to implement both in order to handle > this > circumstance. > > A short summary of how I expected a bundle to work - > Each wrapper would have a unique bundle ID, and a reference to its > bundle. > Features within the wrapper, features could include a bundle ID. > Basically, > if code wanted to look at a feature, it would check to make sure that > the > bundle reference was not nil - if it wasn't, it would take the > feature's > bundle ID, and ask the bundle for the sequence corresponding to that > ID. > Given that a feature should have an NSRange, this would allow the two > sequences to be aligned. Could you show a more concrete interface? It's still kinda vague to me :( > > The last issue seems to be around the quote from Koen: >> I agree, but let's then focus on having these one-liners in BCSequence >> only, not in the subclasses. > I remember this quote as bothering me when I first read it, because > there > are some one liners that clearly belong in a specific sequence > subclass (ie > - finding the longest open reading frame should not be available to a > protein sequence, and finding the hydrophobicity should not be > available to > nucleotides or codons). I seem to remember that reading further > alleviated > my concerns on this, but I can't remember how. Since Alex and I share > this > concern, could you clarify what you meant here, Koen? If we add code to a wrapper that checks if the type of sequence then I don't see any problem. If the sequence type by accident is the wrong one (which I really don't think is going to happen), the wrapper should return nil, or an error, or an NSNotification. Hope that's more clear. cheers, - Koen. From kvddrift at earthlink.net Tue Nov 30 20:32:05 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Tue, 30 Nov 2004 20:32:05 -0500 Subject: [Biococoa-dev] BCSequenceFactory Message-ID: Hi, I added a new class BCSequenceFactory (in BCTools/BCSequenceTools). For now it can create DNA, RNA and proteins from a string, but the other methods should be fairly easy to fill in. I have not yet added the 'guess-the-type-code". To get an idea how to use it, I have added the factorycode in the readSwissProt file. cheers, - Koen. From kvddrift at earthlink.net Tue Nov 30 20:49:24 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Tue, 30 Nov 2004 20:49:24 -0500 Subject: [Biococoa-dev] first non-whitespace character In-Reply-To: <5A8A46E8-4208-11D9-A769-000D93AE89A4@mekentosj.com> References: <31244BE5-419A-11D9-9F46-003065A5FDCC@earthlink.net> <5A8A46E8-4208-11D9-A769-000D93AE89A4@mekentosj.com> Message-ID: <3A4EB628-433B-11D9-AE33-003065A5FDCC@earthlink.net> On Nov 29, 2004, at 8:12 AM, Alexander Griekspoor wrote: > [NSCharacterSet alphanumericCharacterSet] as the set It took some trial and error, but eventually I came up with [[NSCharacterSet whitespaceCharacterSet] invertedSet]. So in fact everything that's not whitespace. Another solution of course could be to use the union of all BCSymbolSets. However, not all have been filled in so far, so right now I cannot use that. cheers, - Koen.