Konrad Hinsen wrote: > - Strings are compact and benefit from a large range of string operations > (in module "string"). However, elements can only be characters, > and strings are immutable, i.e. cannot be changed once created. > So any modification requires constructing a new string. But being > immutable can be an advantage as well, e.g. you can use strings as > keys in dictionaries. What are the limits on string sizes in Python (too lazy to look it up right now)? If it is 256, as with some languages, I imagine this presents a little problem. String immutabilty does also make sequence manipulation a bit awkward. > - Arrays don't seem to be very useful for non-numerical data, with two > exceptions: they can most easily be accessed from C modules, and > they facilitate certain structural operations. I have used arrays of characters in the past. Using parallel arrays can be a covenient way to index or "markup" sequences, i.e. the second array can be used to indicated where features start and stop. Another thought: Many analysis programs are limited by having to put everything into RAM, all in one shot. I tend to prefer keeping the sequence file open and reading in chunks at a time. BTW, some simple database features of Python allow you to keep and work from a data structure stored as a file, correct? On the same note, system resources are growing enough that they can handle large sequences in RAM. But on the other hand, the sequencing projects are turning out larger sequence files. The human genome will be one of the largest sequences (how big? 100 Gb?), and I think the frog genome is several times larger (go figure). Imagine, seriously because this will be hot stuff in a few years, that someone using Loci/Tulip will want to manipulate parts of the human genome like they can now with BioWish and E. coli. > > In terms of performance, there is not so much difference for basic > operations (creation, indexing, etc.). The main concern should be to > as many built-in operations as possible for typical manipulations; > any piece of Python code is much slower than a simple call to a > built-in function implemented in C! So the first thing to do is to > find out which operations are to be performed on nucleotide sequences, > and which of them occur most frequently. > Right, and just because I keep harping Python, doesn't mean we can't turn to compiled C when we really need it...and we may with sequences ranging in the millions and billions (I sound like Carl Sagan). Jeff -- J.W. Bizzaro Phone: 617-552-3905 Boston College mailto:bizzaro at bc.edu Department of Chemistry http://www.uml.edu/Dept/Chem/Bizzaro/ --