[Pipet Devel] python data structure

Tue Jan 12 07:02:02 EST 1999

Konrad Hinsen wrote:

> - Strings are compact and benefit from a large range of string operations
>   (in module "string"). However, elements can only be characters,
>   and strings are immutable, i.e. cannot be changed once created.
>   So any modification requires constructing a new string. But being
>   immutable can be an advantage as well, e.g. you can use strings as
>   keys in dictionaries.

What are the limits on string sizes in Python (too lazy to look it up right
now)?  If it is 256, as with some languages, I imagine this presents a little
problem.  String immutabilty does also make sequence manipulation a bit awkward.

> - Arrays don't seem to be very useful for non-numerical data, with two
>   exceptions: they can most easily be accessed from C modules, and
>   they facilitate certain structural operations.

I have used arrays of characters in the past.  Using parallel arrays can be a
covenient way to index or "markup" sequences, i.e. the second array can be used
to indicated where features start and stop.

Another thought: Many analysis programs are limited by having to put everything
into RAM, all in one shot.  I tend to prefer keeping the sequence file open and
reading in chunks at a time.  BTW, some simple database features of Python allow
you to keep and work from a data structure stored as a file, correct?

On the same note, system resources are growing enough that they can handle large
sequences in RAM.  But on the other hand, the sequencing projects are turning
out larger sequence files.  The human genome will be one of the largest
sequences (how big? 100 Gb?), and I think the frog genome is several times
larger (go figure).  Imagine, seriously because this will be hot stuff in a few
years, that someone using Loci/Tulip will want to manipulate parts of the human
genome like they can now with BioWish and E. coli.

> 
> In terms of performance, there is not so much difference for basic
> operations (creation, indexing, etc.). The main concern should be to
> as many built-in operations as possible for typical manipulations;
> any piece of Python code is much slower than a simple call to a
> built-in function implemented in C! So the first thing to do is to
> find out which operations are to be performed on nucleotide sequences,
> and which of them occur most frequently.
> 

Right, and just because I keep harping Python, doesn't mean we can't turn to
compiled C when we really need it...and we may with sequences ranging in the
millions and billions (I sound like Carl Sagan).

Jeff
-- 
J.W. Bizzaro                  Phone: 617-552-3905
Boston College                mailto:bizzaro at bc.edu
Department of Chemistry       http://www.uml.edu/Dept/Chem/Bizzaro/
--