[Pipet Devel] python data structure

Konrad Hinsen hinsen at cnrs-orleans.fr
Tue Jan 12 10:07:19 EST 1999


> What are the limits on string sizes in Python (too lazy to look it up right
> now)?  If it is 256, as with some languages, I imagine this presents a little
> problem.  String immutabilty does also make sequence manipulation a bit awkward.

The length of a string must fit into an int variable. So in practice
you shouldn't rely on having strings larger than 2**31 character if
you want your program to be portable. In other words, there is no
serious limitations.

> I have used arrays of characters in the past.  Using parallel arrays can be a
> covenient way to index or "markup" sequences, i.e. the second array can be used
> to indicated where features start and stop.

But you could also use two lists for that, or lists of lists,
depending on requirements. Of course there is nothing wrong with
character arrays, except that you give up many useful string
operations.

> Another thought: Many analysis programs are limited by having to put
> everything into RAM, all in one shot. I tend to prefer keeping the
> sequence file open and reading in chunks at a time. BTW, some simple
> database features of Python allow you to keep and work from a data
> structure stored as a file, correct?

I don't see what you refer to. Python's file handling works much like
C's stdio library; you can read arbitrary parts out of a file. There
are also database interfaces (dbm and variants), which make it easy to
store data in large files, but these are special-format files that are
hardly useable with general programs like editors.

Assuming a modern OS, you can also use memory mapping for large files,
but I am not sure that we can already afford to ignore OS without
memory mapping support.

> Right, and just because I keep harping Python, doesn't mean we can't turn to
> compiled C when we really need it...and we may with sequences ranging in the
> millions and billions (I sound like Carl Sagan).

Of course. But even in that case, all that has to be implemented in C
is one rather small module.

Example: suppose we use strings for nucleotide sequences now, and then
find out next year that we must be able to treat sequences that are
longer than available memory. Then we'll just write a small C module
that implements a special "nucleotide sequence" type. This can look
like a drop-in replacement for strings to Python, and all that will
have to be changed in the Python code is the place where nucleotide
sequence types are created. There are some advantages to a language
without static type checking!

Konrad.
-- 
-------------------------------------------------------------------------------
Konrad Hinsen                            | E-Mail: hinsen at cnrs-orleans.fr
Centre de Biophysique Moleculaire (CNRS) | Tel.: +33-2.38.25.55.69
Rue Charles Sadron                       | Fax:  +33-2.38.63.15.17
45071 Orleans Cedex 2                    | Deutsch/Esperanto/English/
France                                   | Nederlands/Francais
-------------------------------------------------------------------------------



More information about the Pipet-Devel mailing list