[Pipet Devel] Tree Data Structure
J.W. Bizzaro
bizzaro at bc.edu
Tue Mar 23 17:43:47 EST 1999
Thomas,
The B+Tree module for Python (bplustree.py), written by Aaron Watters, is
attached.
I'm not sure if this is Berkeley DB version that Carlos was referring to, but it
is not a binding. It's all Python.
Carlos, have you seen this module before? Is it any good?
There appears to be no license for this module other than this:
This code is provided for arbitrary use, but without warrantee of
any kind. At present it seems to work, but I'll call it an beta
until it's better tested.
Jeff
bizzaro at bc.edu
Carlos Maltzahn wrote:
>
> I agree with Konrad. If you are using the tree data structure for fast
> access to a large amount of data, use the B-tree portion of bsddb
> (www.sleepycat.com). Python has bindings to the (old) 1.85 version
> (somebody might have swigged the newer versions, too). It's faster than
> any Python program and you get persistency for free.
>
> Carlos
>
> On Tue, 23 Mar 1999, Konrad Hinsen wrote:
>
> Thomas.Sicheritz at molbio.uu.se writes:
>
> > What is the best way to implement a tree data structure in pyhton ?
>
> That depends on the operations that are to be performed on the
> data!
>
> Konrad.
> --
> -------------------------------------------------------------------------------
> Konrad Hinsen | E-Mail: hinsen at cnrs-orleans.fr
> Centre de Biophysique Moleculaire (CNRS) | Tel.: +33-2.38.25.55.69
> Rue Charles Sadron | Fax: +33-2.38.63.15.17
> 45071 Orleans Cedex 2 | Deutsch/Esperanto/English/
> France | Nederlands/Francais
> -------------------------------------------------------------------------------
>
--
J.W. Bizzaro Phone: 617-552-3905
Boston College mailto:bizzaro at bc.edu
Department of Chemistry http://www.uml.edu/Dept/Chem/Bizzaro/
--
-------------- next part --------------
"""
B+tree implementation.
======================
B+ trees are an efficient index structure for mapping
a dictionary type object into a disk file. All keys for
these dictionary structures are strings with a fixed
maximum length. The values can be strings or
integers (often representing seek positions in a secondary
file) depending on the implementation.
B+ trees can be useful for storing large mappings on disk
in such a way that a small number of keys/values can be
retrieved very quickly (with very few disk accesses).
B+ trees can also be useful for sorting a very large number
(millions) of records by unique string key values.
In this implementation all keys must
not exceed the maximum length for a
given tree. For string values there is no limitation on
size of content. Note that in my tests updates are
2-3 times slower than retrieves, except for walking
which is much faster than normal retrieves.
As an add-on this module also provides a dbm compatible
interface that permits arbitrary length keys and values.
See below.
Provided here are several implementations:
BplusTree():
defines a mapping from strings to integers.
caching_BPT():
subclass of BplusTree that caches key,value
pairs already seen. This one cannot be updated.
Construct a compatible index file using BplusTree
and for read only access that touches a manageable
number of keys, reopen the file using caching_BPT.
SBplusTree():
defines a mapping from strings to strings.
Updatable, but overwrites or deletions will
leave "unreachable garbage" in the "value space"
of the index file. Use recopy_sbplus() to
recopy the file, eliminating the garbage.
caching_SBPT():
analogous to caching_BPT, but mapping to strings.
File creation:
==============
To create an index file do the following:
file = open(filename, "w+b")
B = SBplusTree(file, seek_position, nodesize, keymax)
B.startup()
where seek_position is the seek_position where to "start"
the tree (usually the start of file, 0), nodesize is the
number of keys to keep at each node of the tree (pick an
even number between 2 and 255), and keymax is the maximum
size for the string keys in the mapping.
When choosing nodesize remember that larger nodesizes
make Python do more work and the file system do less work.
I think 212 is probably a pretty good number. Of course
choose keymax to be as large as you will need. A too large
key size, however, may waste considerable space in the file.
Now that you have a tree you can populate it with values
just like a dictionary.
B["this"] = "that"
B["willy"] = "wonka"
x = B["this"]
del B["this"]
print len(B)
...
f.close()
The supported dictionary operations are indexed retrieval
B[k], indexed assignment B[k] = v, key deletion del B[k] and
length len(B). Retrieval and deletion will raise KeyError
on absent key. Assignment will raise ValueError if the key
is too large.
B.keys(), B.values(), B.items() are not directly
supported, but see "Walking" below.
Note that the "basic" B-plus tree implementations only accept and
return integers as values. The SB-plus implementation will
accept anything as values, but will use the str(x) function
to convert them to a string before storing the value in the
file. The value returned will always be the string value
stored. IE
B["okeydoke"] = 23
print `B["okeydoke"]`
prints "'23'", with the quotes. The controlling
application must control the
serialization/deserialization of values if it needs to store
something other than strings.
Read only file access:
======================
Once an index file exists it can be re-opened in "read only"
mode.
f = open(filename, "rb")
B = caching_SBPT(f)
B.open()
print B["willy"]
Note that the configuration parameters for the tree are
determined from a "file header". Note however that a file
written to store integers using BplusTree should not be opened
for strings using SBplusTree or undefined and undesirable
behaviour will result. Opening an SBplusTree as a BplusTree
is not advisable either.
If the seek position for the start of the tree is anything
other than 0, it must be specified:
B = caching_SBPT(f, position)
or undefined behaviour will result.
In this mode, retrieval and walking are permitted, but attempts
to modify the structure will cause an exception. In this mode the
programmer may prefer to use the "caching" versions if they expect
to retrieve the same keys many times and if the number of keys to
touch is not huge (say, in the millions).
Re-open for modification:
=========================
An existing index file can also be reopened for modification.
f = open(filename, "r+b")
B = SBplusTree(f)
B.open()
B["this"] = "is fun!"
...
f.close()
Again, modifications are disallowed for cached trees.
Walking:
========
One of the neat features of B-plus trees is that they keep
their keys in sorted order. Hence it is easy and efficient
to retrieve the keys/values sorted by the keys, and also to
do range queries.
To support this feature the tree implementations provide
a "walker" interface.
walker = tree.walker(lowerkey, includelower,
upperkey, includeupper)
while walker.valid:
print (walker.current_key(), walker.current_value())
walker.next()
walker.first()
Or to traverse all pairs in key-sorted order
walker = tree.walker()
while walker.valid:
print (walker.current_key(), walker.current_value())
walker.next()
walker.first()
The lowerkey/upperkey parameters indicate where to start/end
walking (interpreted as the beginning/end if they are
omitted or set to None) and includelower indicates whether
to include the lower value if it is present in the tree,
if not the next greater key will be the start position.
For example to walk from key "m" (or just past it if absent)
to the end:
w = tree.walker("m", 1)
or to walk between "mzzz" and "nzzz" not inclusive:
w = tree.walker("mzzz", 0, "nzzz", 0)
or walk from the beginning to "m", not inclusive
w = tree.walker(None, None, "m", 0)
Here w.current_key() and w.current_value() retrieve the current
key and value respectively, w.next() moves to the next pair, if there is one
and w.valid indicates whether there is a current pair, and
w.first() resets the walker to the first pair, if there is one.
At initialization the walker is already at the first pair, if
it exists.
Multiaccess optimizations:
==========================
To make updates and retrievals run faster you can enable/disable
a tree-global least-recently-used fifo mechanism which reduces
reads and writes, but be *sure* to disable it before closing any
BTree file that has been modified, or the tree may well become
corrupt
try:
B.enable_fifo()
do_updates(B)
finally:
B.disable_fifo()
The fifo may also improve performance for read only access,
but it is not important to disable the mechanism later.
The optimizations help most when key accesses are localized.
(ie, a bunch of inserts with keys starting "abc..."
or 10000 inserts in [almost] key-sorted order).
For only one access, it's no help at all! The fifo mechanism
will not help for walking, so don't do it if you will only walk
a portion of the tree once. You might want to try putting
various values as the optional argument to enable_fifo, eg,
B.enable_fifo(1000) (but that's probably past the diminishing returns
point...). Large fifos will consume lots of "core" memory.
Trash compacting
================
The functions recopy_bplus(f1, f2) and recopy_sbplus(f1, f2)
recopy open "rb" file f1 to (open "w+b")
file f2 for BplusTrees and SBplusTrees respectively. The
copy f2 will have no "garbage" and almost all leaf nodes will be
full. This can result in reducing file size by about 1/3.
Both files must have headers at seek 0 and hold nothing but
the tree nodes and tree data. Also look at recopy_tree(t1, t2).
DBM compatibility
=================
As an application of SBplusTree this module also provides
a plug-compatible implementation of the standard python dbm
style functionality, except that the "mode" parameter is not
supported on initialization. See the Python Lib manual entry
on dbm. Both keys and values may be of *arbitrary* length in
this case, but keys are not kept in key-sorted order and
overwrites and key collisions will result in unused garbage
in the file (keys and values occur as SBplustree "values"
using a PORTABLE bucket hashing scheme).
d = dbm(filename, flag)
creates a dictionary like structure with d[key]=value, x=d[key],
d.has_key(key), del d[key], len(d), and d.keys(). Also
after any modification be sure that d gets explicitly
closed d.close() or the file *may* become corrupt.
Also, d.copy(otherfilename, "c") will create a more
compact copy of d in another file with garbage discarded.
The dbm implementation uses a very large fifo, so many accesses
may consume a lot of "core" memory.
DBM comparison
==============
An alternative to this module is gdbm or dbm for file
indexing -- both supported by available Python extension
modules.
Expect dbm to be generally faster than this module, but
remember:
- dbm doesn't do key-sorted walking.
- dbm often isn't portable across machines.
- dbm isn't written in Python (ie, requires an extension
module).
- dbm sometimes doesn't allow arbitrary value lengths
(but gdbm allows arbitrary length keys and values...)
whereas this module does/is. I don't know precisely how
much faster dbm is, but for some types of use it may turn
out to actually be slower, for all I know. Please let
me know! Probably the most compelling advantage is that
the index files generated by this module are portable across
platforms.
Fun
===
For fun or debugging try tree.dump().
There is also a test suite for the module at the
bottom (test() and retest()) which create a test index
called "test" in the current directory. Also testdbm().
Caveats:
========
NOTE: only the standard string ordering is supported for
walking at present. This could be fixed...
WARNING: Never modify a tree while it is being walked. Always
recreate all walkers after a tree modification.
NEVER open the same tree for modification twice!
ALWAYS make sure a modified tree has disabled the fifo and
the file has been closed before reopening the tree.
WARNING: This implementation has no support for concurrent
modification. It is designed for "write once by one process",
"read many by (possibly) several processes, but not with
concurrent modification."
WARNING: If during modification any exception other than a KeyError/ValueError
is not caught, the indexed file structure *may* become corrupt (because
some operations completed and others didn't). Walking all values
of an index or B.dump() may detect some corrupt states (***Note I should write
a sanity-check routine***)
WARNING: As noted above an overwrite or delete for a SBTree (mapping
to strings) will leave unreachable junk in the "value space" of
the index. See above.
This code is provided for arbitrary use, but without warrantee of
any kind. At present it seems to work, but I'll call it an beta
until it's better tested.
Aaron Watters, arw at pythonpros.com
http://starship.skyport.net/crew/aaron_watters
http://www.pythonpros.com
"""
import string
nilseek = -1
from marshal import dumps
sequence_overhead = len(dumps(""))
intsize = len(dumps(1))
# bisect algorithm with bounds (in 1.5 this is in /Lib)
# Insert item x in list a, and keep it sorted assuming a is sorted
def insort(a, x, lo=0, hi=None):
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)/2
if x < a[mid]: hi = mid
else: lo = mid+1
a.insert(lo, x)
# Find the index where to insert item x in list a, assuming a is sorted
def bisect(a, x, lo=0, hi=None):
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)/2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo
NOROOMERROR = "NOROOMERROR"
Rootflag = 1
Interiorflag = 2
Freeflag = 3
Leafflag = 4
LeafandRootflag = 5
Leafflags = (Leafflag, LeafandRootflag)
Interiorflags = (Interiorflag, Rootflag)
class Node_Fifo:
"""fifo of nodes for locality access optimization"""
def __init__(self, size=30):
self.fifo = [] # fifo of active nodes, if used.
self.fifosize = size
self.fifo_dict = {}
def flush_fifo(self):
for node in self.fifo:
if node.dirty:
node.store(1)
self.fifo = []
self.fifo_dict = {}
class Node:
"""B+ tree node.
follows Silberchatz & Korth database intro book closely.
Each node has a number self.validkeys> 1 of valid keys (except for
a tree with only 0 or 1 entries. For leaves each
self.key[i] that is valid is associated with int value
self.indices[i]
For nonleaves nextnode integer reference is at
self.indices[i+1] and
self.indices[0]
is for entries with keys<self.keys[0]
for leaves self.indices[self.size] is "pointer" to
next sequential leaf.
"""
# for update optimization
dirty = 0
fifo = None
def __init__(self, flag, size, keylen, position, infile, cloner=None):
self.flag = flag # one of Rootflag...
self.size = size # num of pointers from this Node
#if size>255: raise ValueError, "size too large: "+`size`
if size<0: #or size%2==1:
raise ValueError, "size must be positive <= 255"
self.position = position # seek position in file
self.infile = infile # open file for storage
self.keylen = keylen # maximum key length (no nulls!)
# seek pointers for descendents (root/interior)
# all but last is a value for a leaf, last is successor seek
self.indices = [-1] * (size+1)
# key storage
# for leaves value for key[i] is at indices[i]
# for others keys[i] is at indices[i+1],
# indices[0] points to keys preceding keys[0].
# for freelist nodes, nodes are stored on
# linked list with indices[0] forward
self.keys = [""] * size
# linearized storage length in file
#self.intstorage = intsize * (size+1)
#self.keystorage = keylen * size
# in debug mode the seek position is prepended
#if debug:
# self.intstorage = self.intstorage + intsize
#self.storage = (2 + # flag, valid
# self.intstorage + self.keystorage)
if cloner is None:
self.storage = (sequence_overhead + # list overhead
2*intsize + # flag, valid
(size+1)*intsize + # indices
size*(sequence_overhead + keylen) # keys
)
else:
self.storage = cloner.storage
self.fifo = cloner.fifo
# note, for interior nodes
# validkey of 0 means one valid pointer, -1 means none
# for leaves validkeys should be positive
if flag in [Interiorflag, Rootflag]:
self.validkeys = -1 # number of valid entries
else:
self.validkeys = 0
def clear(self):
# reinitialize keys, indices for self.
size = self.size
self.keys = [""] * size
self.validkeys = 0
if self.flag in Interiorflags:
# reinit all indices
self.indices = [-1] * (size+1)
self.validkeys = -1
else:
# don't clobber forward pointer
self.indices[:size] = [-1] * size
# interior node operation.
def putnode(self, key, node):
"""place a node for key into self. Raise NOROOMERROR if no room."""
from types import StringType
if type(key)!=StringType:
raise TypeError, "bad key "+`key`
position = node.position
self.putposition(key, position)
def putfirstindex(self, index):
#print "putfirstindex", index
if self.validkeys>=0:
raise ValueError, "putfirstindex on full node"
self.indices[0] = index
self.validkeys = 0
def putposition(self, key, position):
#print "putposition", (key, position), self.indices, self.keys
if self.flag not in Interiorflags:
raise ValueError, "cannot insert into leaf node"
validkeys = self.validkeys
last = validkeys + 1
if self.validkeys>=self.size: raise NOROOMERROR, "no room"
# store the key
if validkeys<0: # no nodes currently
#print "no keys"
self.validkeys = 0
self.indices[0] = position
else:
# yes nodes
keys = self.keys
# is the key there already?
if key in keys:
if keys.index(key)<validkeys:
raise ValueError, "reinsert of node for existing key"
place = bisect(keys, key, 0, validkeys)
keys.insert(place, key)
del keys[last]
# store the index
indices = self.indices
#print "inserting", position, "before", indices
indices.insert( place+1, position)
del indices[last+1]
#print "after", indices
self.validkeys = last
def delnode(self, key):
"""delete node for key, (key==None means "start" node)
key must match exactly."""
if self.flag not in Interiorflags:
raise ValueError, "cannot delete node from leaf node"
if self.validkeys<0: raise KeyError, "no such key (empty)"
validkeys = self.validkeys
indices = self.indices
keys = self.keys
#print "delnode before", key, keys, indices, validkeys
if key is None:
# delete first node (shouldn't happen?
place = 0
indexplace = 0
else:
# delete non-first node
place = keys.index(key)
indexplace = place+1
del indices[indexplace]
indices.append(nilseek)
del keys[place]
keys.append("")
#keys[validkeys-1] = ""
#print "delnode after", keys, indices
self.validkeys = validkeys-1
def get_keys(self):
"""return a list of valid keys for self."""
validkeys = self.validkeys
if validkeys<=0: return []
else: return self.keys[0:validkeys]
def keys_indices(self, leftmost):
"""return [(leftmost, firstindex), (nodekey, nodeindex), ...]"""
keys = self.get_keys()
if self.flag in Interiorflags:
# nonleaf, must add leftmost to complete keys
keys = [leftmost] + keys
indices = self.indices[:len(keys)]
# return pairing
return map(None, keys, indices)
def getnode(self, key):
"""get node that exactly matches key (None for first)"""
if self.flag not in Interiorflags:
raise ValueError, "cannot getnode from leaf node"
if key is None: index = 0
else: index = self.keys.index(key) + 1
place = self.indices[index]
if place<0: raise IndexError, "invalid position! "+`(place, key)`
# short-circuit optimization: check fifo
fifo = self.fifo
if fifo:
ff = fifo.fifo
fd = fifo.fifo_dict
if fd.has_key(place):
node = fd[place]
ff.remove(node)
ff.insert(0, node)
return node
node = self.clone(place)
node = node.materialize()
return node
# leaf mode operations
def next(self):
"""get next node from self in linear leaf sequence, or return None."""
if self.flag not in Leafflags:
raise ValueError, "cannot get next for non-leaf."
place = self.indices[self.size]
if place == nilseek: return None
else:
node = self.clone(place)
node = node.materialize()
return node
def putvalue(self, key, value):
"""put key->value mapping into leaf node.
"""
from types import StringType, IntType
if type(key)!=StringType and type(value)!=IntType:
raise ValueError, "bad key, value"+ `(key,value)`
if self.flag not in Leafflags:
raise ValueError, "cannot get next for non-leaf."
validkeys = self.validkeys
indices = self.indices
keys = self.keys
if validkeys<=0: # empty
# "first entry", (key, value)
indices[0] = value
keys[0] = key
self.validkeys = 1
else:
place=None
if key in keys:
place = keys.index(key)
if place>=validkeys: place=None
if place is not None:
keys[place] = key
indices[place] = value
else:
if validkeys>=self.size:
#print "node out of room"
#for x in self.__dict__.items(): print x
raise NOROOMERROR, "no room"
place = bisect(keys, key, 0, validkeys)
#print "next entry at", place
#next = place+1
last = validkeys+1
del keys[validkeys]
del indices[validkeys]
keys.insert(place, key)
indices.insert(place, value)
self.validkeys = last
def put_all_values(self, keys_indices):
"""optimization for node restructuring."""
self.clear()
indices = self.indices
keys = self.keys
length = self.validkeys = len(keys_indices)
if length>self.size:
raise IndexError, "bad length "+`length`
#if length<self.size/2-1: # not valid for delete (?)
# raise IndexError, "not enough keys"+`length`
for i in xrange(length):
(keys[i], indices[i]) = keys_indices[i]
def put_all_positions(self, first_position, keys_positions):
"""optimization for restructuring."""
self.clear()
indices = self.indices
keys = self.keys
length = self.validkeys = len(keys_positions)
if length>self.size:
raise IndexError, "bad length "+`length`
#if length<self.size/2: # not valid for delete (?)
# raise IndexError, "not enough keys"+`length`
indices[0] = first_position
for i in xrange(length):
(keys[i], indices[i+1]) = keys_positions[i]
def delvalue(self,key):
keys = self.keys
indices = self.indices
if key not in keys:
raise KeyError, "missing key, can't delete"
place = keys.index(key)
validkeys = self.validkeys
#next = place + 1
prev = validkeys -1
#keys[place:prev] = keys[next:validkeys]
#indices[place:prev] = indices[next:validkeys]
del keys[place]
del indices[place]
keys.insert(prev, "")
indices.insert(prev, nilseek)
self.validkeys = validkeys-1
#keys[prev] = ""
#indices[prev] = nilseek
def getvalue(self, key):
"""get value exactly matching key."""
try:
place = self.keys.index(key)
except ValueError:
raise KeyError, "key not found: " + `key`
else:
return self.indices[place]
def newneighbor(self, position):
"""make a new leaf adjacent to self"""
if self.flag not in Leafflags:
raise ValueError, "cannot make leaf neighbor for non-leaf."
neighbor = self.clone(position)
size = self.size
indices = self.indices
neighbor.indices[size] = indices[size]
indices[size] = position
return neighbor
def nextneighbor(self):
"""return next leaf in tree, or None."""
if self.flag not in Leafflags:
raise ValueError, "cannot get leaf neighbor for non-leaf."
size = self.size
position = self.indices[size]
if position==nilseek:
return None
else:
neighbor = self.clone(position)
neighbor = neighbor.materialize()
return neighbor
def delnext(self, next, free):
#print "delnext"
#print self.indices, self.position
#print next.indices, next.position
size = self.size
if self.indices[size]!=next.position:
raise ValueError, "invalid next pointer"
self.indices[size] = next.indices[size]
return next.free(free)
# free list mode operations
def free(self, freenodeposition):
"""add self to free list, return position as new
free position."""
self.flag = Freeflag
self.indices[0] = freenodeposition
self.store()
return self.position
def unfree(self, flag):
"""Assuming self is head of free list,
pop self off freelist, return next free elt position
DOES NOT STORE.
"""
next = self.indices[0]
self.flag = flag
self.validkeys = 0
self.indices[0] = nilseek
self.clear()
return next
def clone(self, position):
"""return a Node of same shape as self."""
if self.fifo:
dict = self.fifo.fifo_dict
if dict.has_key(position):
return dict[position]
return Node(self.flag, self.size, self.keylen,
position, self.infile, self)
def getfreenode(self, freeposition, freenode_callback=None):
"""get free node of same shape as self from self.file
make one if none exists. Assume freeposition is
seek position of next free node.
returns (node, newfreeposition)
if freenode_callback is specified, it is a function to call
with a new free list head, if needed freenode_callback(int)
"""
file = self.infile
if freeposition==nilseek:
# add at last position in file
#save = file.tell()
file.seek(0, 2) # goto eof
position = file.tell()
thenode = self.clone(position)
thenode.store() # write new record
#file.seek(save)
return (thenode, nilseek)
else:
# get node at position
position = freeposition
thenode = self.clone(position)
thenode = thenode.materialize() # get old node
next = thenode.indices[0]
if freenode_callback is not None:
freenode_callback(next)
thenode.__init__(self.flag, self.size,
self.keylen, position, self.infile)
thenode.store() # save reinitialized node
return (thenode, next)
def materialize(self):
"""read self from file."""
#print "materialize", self.position
position = self.position
if self.fifo:
fifo = self.fifo
# look in fifo
dict = fifo.fifo_dict
ff = fifo.fifo
if dict.has_key(position):
#print "using fifo", position
node = dict[position]
if node is not ff[0]:
ff.remove(node)
ff.insert(0, node)
#if len(ff)!=len(dict): raise "whoops"
return node
f = self.infile
#f.flush() # ? needed ?
#save = f.tell()
f.seek(position)
data = f.read(self.storage)
self.delinearize(data)
#f.seek(save) # go back
if self.fifo:
self.add_to_fifo()
return self
def add_to_fifo(self):
fifo = self.fifo
ff = fifo.fifo
dict = fifo.fifo_dict
#if len(dict)!=len(ff): raise "whoops before"
position = self.position
if dict.has_key(position):
old = dict[position]
del dict[position]
ff.remove(old)
dict[self.position] = self
#if self in ff: ff.remove(self)
ff.insert(0, self)
if len(ff)>self.fifo.fifosize:
last = ff[-1]
del ff[-1]
del dict[last.position]
#print "storing", last.position
if last.dirty:
last.store(1)
#if len(dict)!=len(fifo): raise "whoops"
def enable_fifo(self, size = 33):
"you better disable it later!"
if size<5 or size>1000000:
raise ValueError, "size not valid: "+`size`
self.fifo = Node_Fifo(size)
def disable_fifo(self):
#print "disabling fifo", self.fifo_dict.keys()
#global fifo_on
if self.fifo:
self.fifo.flush_fifo()
self.fifo = None
def store(self, force=0):
"""write self to file at self.position
return end of record seek position."""
#print "store", self.position
position = self.position
fifo = self.fifo
if not force and fifo:
fd = fifo.fifo_dict
if fd.has_key(self.position) and fd[position] is self:
self.dirty = 1
return # defer
f = self.infile
#save = f.tell()
f.seek(position)
data = self.linearize()
f.write(data)
last = f.tell()
#f.seek(save)
self.dirty = 0
if not force and self.fifo:
self.add_to_fifo()
return last
def linearize(self):
"""create record format for self."""
from marshal import dumps
all = [self.flag, self.validkeys] + self.indices + self.keys
s = dumps(all)
ls = len(s)
storage = self.storage
if (ls > storage):
raise ValueError, "bad storage: " + `s`
s = s + "X" * (storage-ls)
return s
#indices = self.indices
# in debug prepend seek position
#if debug: indices = [self.position] + indices
#ints = encodeints(indices)
#keys = encodestrs(self.keys, self.keylen)
#validkeys = self.validkeys
#if validkeys<0: v = "*" # dummy purposes only (prewrites)
#else: v = chr(self.validkeys ^ CMASK) # try to make v readable
#return "%s%s%s%s%s" % (self.flag, v, ints, keys, SEPARATOR)
__print__ = linearize
def delinearize(self, str):
"""parse, store from record format from self."""
from marshal import loads
all = loads(str)
[self.flag, self.validkeys] = all[:2]
#self.flag = chr(ordflag)
s = self.size
next = 2+s+1
indices = self.indices = all[2:next]
keys = self.keys = all[next:]
if len(keys) != s:
raise ValueError, "bad keys: " + `keys` + `len(keys)`
def dump(self, indent=""):
flag = self.flag
if flag==Freeflag:
print 'free->', self.position,
nextp = self.indices[0]
if nextp!=nilseek:
next = self.clone(nextp)
next = next.materialize()
next.dump()
else:
print "!last"
return
nextindent = indent + " "
print indent,
if flag == Rootflag: print "root",
elif flag == Interiorflag: print "interior",
elif flag == Leafflag: print "leaf",
elif flag == LeafandRootflag: print "root and leaf",
else: print "invalid flag???", flag,
print self.position, "valid=", self.validkeys
print indent, "keys", self.keys
print indent, "seeks", self.indices
if flag in [Rootflag, Interiorflag]:
# interior
for i in self.indices:
if i != nilseek:
n = self.clone(i)
n = n.materialize()
n.dump(nextindent)
else:
# leaf
pass
print indent, "*****"
class BplusTree:
"""Basic B+tree maps fixed length strings to integers
(could be seek positions)"""
length = None # fill in later
dirty = 0 # default
# length keylen, nodesize, root_seek, free
header_format = "%10d %10d %10d %10d %10d\n"
def __init__(self, infile, position=None, nodesize=None, keylen=None):
"""infile should be open file in "rb" or "w+b" mode.
if optional args are not given they are determined
from first line in file.
"""
#print "BPlusTree(%s, %s, %s)" % (position, nodesize, keylen)
if keylen is not None and keylen<=2:
raise ValueError, "keylen must be greater than 2"
self.root_seek = nilseek # dummy
self.free = nilseek
self.root = None
self.file = infile
self.nodesize = nodesize
self.keylen = keylen
if position is None:
position = 0
self.position = position
#if nodesize is None:
# self.get_parameters()
def walker(self,
keylower=None, includelower=None,
keyupper=None, includeupper=None):
return BplusWalker(self, keylower, includelower,
keyupper, includeupper)
def init_params(self):
return (self.file, self.position, self.nodesize, self.keylen)
def getfile(self):
return self.file
def getroot(self):
return self.root
def update_freelist(self, position):
if self.free!= position:
self.free = position
self.reset_header()
def startup(self):
"""startup the file, write header, set root"""
if self.nodesize is None or self.keylen is None:
raise ValueError, \
"cannot initialize without nodesize, keylen specified"
self.length = 0
self.reset_header()
file = self.file
file.seek(0,2) # goto eof
self.root_seek = file.tell()
self.reset_header()
root = self.root = Node(LeafandRootflag, self.nodesize, self.keylen,
self.root_seek, file)
root.store()
def open(self):
"""get info on existing file."""
file = self.file
self.get_parameters()
self.root = Node(LeafandRootflag, self.nodesize, self.keylen,
self.root_seek, file)
self.root = self.root.materialize()
fifo_enabled = 0
def enable_fifo(self,size=33):
#print "fifo enabled"
self.fifo_enabled = 1
self.root.enable_fifo(size)
def disable_fifo(self):
#print "fifo disabled"
self.fifo_enabled = 0
if self.dirty:
self.reset_header()
self.dirty = 0
self.root.disable_fifo()
def reset_header(self):
"""reset the header of the file"""
if self.fifo_enabled:
self.dirty = 1
return # defer
file = self.file
file.seek(self.position)
#file.write( self.header_format %
# (self.length, self.keylen, self.nodesize, self.root_seek, self.free) )
from marshal import dump
dump( (self.length, self.keylen, self.nodesize, self.root_seek, self.free),
file)
def get_parameters(self):
file = self.file
#save = file.tell()
file.seek(self.position)
from marshal import load
temp = load(file)
#print temp, self.position
(self.length, self.keylen, self.nodesize, self.root_seek, self.free)=\
temp
#file.seek(save)
def __len__(self):
if self.length is None:
self.get_parameters()
return self.length
def __getitem__(self, key):
"""self[key] -- get item associated with key"""
if self.root is None: raise ValueError, "not open!"
return self.find(key, self.root)
def has_key(self, key):
try:
test = self[key]
except KeyError:
return 0
else:
return 1
def __setitem__(self, key, value):
"""self[key]=value -- set map for key to value"""
from types import StringType, IntType
if type(key)!=StringType: raise ValueError, "key must be string"
if type(value)!=IntType: raise ValueError, "value must be int"
if len(key)>self.keylen: raise ValueError, "key too long"
if value<0: raise ValueError, "value must be positive"
current_length = self.length
#if FORBIDDEN in key:
# raise ValueError, "key cannot contain "+`FORBIDDEN`
root = self.root
if root is None: raise ValueError, "not open!"
#global test1 #debug
test1 = self.set(key, value, self.root)
# do we need to split root?
if test1 is not None:
#print "splitting root", `test1`
(leftmost, node) = test1
#print "leftmost", leftmost, node
# make a non-leaf root
(newroot, self.free) = root.getfreenode(self.free)
newroot.flag = Rootflag
if root.flag is LeafandRootflag:
root.flag = Leafflag
else:
root.flag = Interiorflag
newroot.clear()
newroot.putfirstindex(root.position)
newroot.putnode(leftmost, node)
self.root = newroot
self.root_seek = newroot.position
newroot.store()
root.store()
self.reset_header()
else:
if self.length!=current_length:
self.reset_header()
def __delitem__(self, key):
"""del self[key] -- remove map for key to value"""
root = self.root
currentlength = self.length
self.remove(key, root)
if root.flag==Rootflag:
validkeys = root.validkeys
if validkeys<1:
if validkeys<0:
raise ValueError, "invalid empty non-leaf root"
newroot = self.root = root.getnode(None)
self.root_seek = newroot.position
self.free = root.free(self.free)
self.reset_header()
if newroot.flag==Leafflag:
newroot.flag = LeafandRootflag
else:
newroot.flag = Rootflag
newroot.store()
elif self.length!=currentlength:
self.reset_header()
elif root.flag!=LeafandRootflag:
raise ValueError, "invalid flag for root"
elif self.length!=currentlength:
self.reset_header()
def set(self, key, value, node):
"""insert key-->value starting at node.
return None if no split, else return
(leftmostkey, newnode)
"""
keys = node.keys
validkeys = node.validkeys
if node.flag in Interiorflags:
# non leaf
# find the descendent to insert in
place = bisect(keys, key, 0, validkeys)
#print place, key, validkeys, keys
if place>=validkeys or keys[place]>=key:
# insert at previous node
index = place
else:
# index at node
index = place+1
if index==0: nodekey=None
else: nodekey=keys[place-1]
#print "nodekey", nodekey, node.indices
nextnode = node.getnode(nodekey)
test = self.set(key, value, nextnode)
# split?
if test is not None:
(leftmost, insertnode) = test
try:
# insert if room
node.putnode(leftmost, insertnode)
except NOROOMERROR:
# no room, split
insertindex = insertnode.position
(newnode, self.free) = node.getfreenode(
self.free, self.update_freelist)
newnode.flag = Interiorflag
ki = node.keys_indices("dummy")
(dummy, firstindex) = ki[0]
# remove dummy
ki = ki[1:]
# insert new pair
insort(ki, (leftmost, insertindex))
newleftmost = self.divide_entries(firstindex, node, newnode, ki)
node.store()
newnode.store()
return (newleftmost, newnode)
else:
node.store()
return None # no split
else:
# leaf
if key not in keys or keys.index(key)>=validkeys:
newlength = self.length+1
else:
newlength = self.length
try:
# insert if room
node.putvalue(key, value)
except NOROOMERROR:
# no room: split
# get entries (dummy is ignored for leaves)
ki = node.keys_indices("dummy")
insort(ki, (key, value))
(newnode, self.free) = node.getfreenode(
self.free, self.update_freelist)
newnode = node.newneighbor(newnode.position)
newnode.flag = Leafflag
# 0 is dummy firstindex, ignored for leaves
newleftmost = self.divide_entries(0, node, newnode, ki)
node.store()
newnode.store()
self.length = newlength
return (newleftmost, newnode)
else:
node.store()
self.length = newlength
return None
def remove(self, key, node):
"""remove key from tree at node.
raise KeyError if absent.
return (leftmost, size) if leftmost changes.
otherwise return (None, size).
Caller is responsible for restructuring node, if needed.
"""
newnodekey = None
if node.flag in Interiorflags:
# nonleaf
keys = node.keys
validkeys = node.validkeys
place = bisect(keys, key, 0, validkeys)
if place>=validkeys or keys[place]>=key:
# delete at tree before place
index = place
else:
# delete at tree for place
index = place+1
if index==0: nodekey=None
else: nodekey=keys[place-1]
nextnode = node.getnode(nodekey)
# recursively remove from nextnode
(lm, size) = self.remove(key, nextnode)
# is nextnode now too small?
nodesize = self.nodesize
half = nodesize/2
if (size<half):
# restructure, ugly!
# find another node for redistribution
if nodekey is None and validkeys==0:
raise IndexError, "invalid node, only one child!"
if place>=validkeys:
# final node, get previous
rightnode = nextnode
rightkey = nodekey
if validkeys<=1: leftkey = None
else: leftkey = keys[place-2]
leftnode = node.getnode(leftkey)
else:
# non-final, get next
leftnode = nextnode
leftkey = nodekey
if index==0: rightkey=keys[0]
else: rightkey = keys[place]
rightnode = node.getnode(rightkey)
# get all keys, indices
rightki = rightnode.keys_indices(rightkey)
leftki = leftnode.keys_indices(leftkey)
ki = leftki + rightki
# redistribute or merge?
#print "ki, nodesize", ki, nodesize
lki = len(ki)
if lki>nodesize or (leftnode.flag!=Leafflag and lki>=nodesize):
# redistribute
(newleftkey, firstindex) = ki[0]
if leftkey==None:
newleftkey = lm
if leftnode.flag!=Leafflag:
# nuke first ki
ki = ki[1:]
newrightkey = self.divide_entries(
firstindex, leftnode, rightnode, ki)
# delete, reinsert right
node.delnode(rightkey)
node.putnode(newrightkey, rightnode)
# ditto for left if first changed
if (leftkey!=None and leftkey!=newleftkey):
node.delnode(leftkey)
node.putnode(newleftkey, leftnode)
node.store()
leftnode.store()
rightnode.store()
else:
# merge into left, free right
(newleftkey, firstindex) = ki[0]
#leftnode.clear()
if leftnode.flag!=Leafflag:
#leftnode.putfirstindex(firstindex)
#del ki[0]
#for (k,i) in ki:
# leftnode.putposition(k,i)
leftnode.put_all_positions(firstindex, ki[1:])
else:
#for (k,i) in ki:
# leftnode.putvalue(k,i)
leftnode.put_all_values(ki)
if rightnode.flag==Leafflag:
self.free = leftnode.delnext(rightnode, self.free)
else:
self.free = rightnode.free(self.free)
if leftkey is not None and newleftkey!=leftkey:
node.delnode(leftkey)
node.putnode(newleftkey, leftnode)
node.delnode(rightkey)
node.store()
leftnode.store()
self.reset_header()
if leftkey is None: newnodekey = lm
else:
# no restructure
# update leftmost, if needed
if nodekey is None: newnodekey = lm
elif lm is not None:
node.delnode(nodekey)
node.putnode(lm, nextnode)
# end of restructure if
else:
# leaf, base case: just delete it
if node.validkeys<1:
# should only happen for empty root
raise KeyError, "no such key"
first = node.keys[0]
node.delvalue(key)
rest = node.keys[0]
if first!=rest:
newnodekey = rest
node.store()
self.length = self.length - 1
return (newnodekey, node.validkeys)
def divide_entries(self, firstindex, node1, node2, entries):
"""divide presorted entries evenly among node1, node2
return leftmost of node2.
firstindex is ignored for leaves
"""
middle = len(entries)/2 + 1
#node1.clear()
#node2.clear()
if node1.flag in Interiorflags:
#middle = middle+1
left = entries[:middle]
right = entries[middle:]
#print "left, right", left, right
# nonleaf
#node1.putfirstindex(firstindex)
#for (k,i) in left:
# node1.putposition(k,i)
(leftmost, midindex) = right[0]
#node2.putfirstindex(midindex)
#for (k,i) in right[1:]:
# node2.putposition(k, i)
node1.put_all_positions(firstindex, left)
node2.put_all_positions(midindex, right[1:])
return leftmost
else:
# leaf
left = entries[:middle]
right = entries[middle:]
#for (k,i) in left:
# node1.putvalue(k,i)
#for (k,i) in right:
# node2.putvalue(k,i)
node1.put_all_values(left)
node2.put_all_values(right)
return right[0][0]
def find(self, key, node):
"""find key starting at node."""
while node.flag in Interiorflags:
# non-leaf
thesekeys = node.keys
validkeys = node.validkeys
# find place at or just beyond key
place = bisect(thesekeys, key, 0, validkeys)
if place>=validkeys or thesekeys[place]>key:
if place==0: nodekey=None
else: nodekey=thesekeys[place-1]
else:
nodekey = key
node = node.getnode(nodekey)
return node.getvalue(key)
def dump(self):
self.root.dump()
if self.free!=nilseek:
free = self.root.clone(self.free)
free = free.materialize()
free.dump()
def __del__(self):
if self.fifo_enabled:
self.disable_fifo()
class BplusWalker:
"""iterative walker for bplustree leaf nodes."""
def __init__(self, tree,
keylower=None, includelower=None,
keyupper=None, includeupper=None):
"""initialize a walker for tree with key values bounded
by upper/lower, if given, included or excluded as specified.
Tree should never be updated while walker is active,
otherwise behaviour of walker is undefined."""
self.tree = tree
self.keylower = keylower
self.includelower = includelower
self.keyupper = keyupper
self.includeupper = includeupper
if self.tree.getroot() == None:
self.tree.open()
# get the first pertinent leaf in tree
node = self.tree.getroot()
while node.flag in Interiorflags:
# interior node, seek a leaf
if keylower is None:
nkey = None
else:
keys = node.get_keys()
place = bisect(keys, keylower)
if place==0: nkey = None
elif place>len(keys): nkey = keys[-1]
else: nkey = keys[place-1]
node = node.getnode(nkey)
self.node = self.startnode = node
# preinit
self.node_index = None
self.valid = 0 # pessimism
self.first()
def first(self):
"""reset walker to first position, or raise IndexError
if keyrange is empty."""
node = self.node = self.startnode
# is the key in the node?
keys = node.keys
#print "first at", keys
keylower = self.keylower
keyupper = self.keyupper
validkeys = node.validkeys
self.valid = 0
if keylower==None:
self.node_index = 0
self.valid = 1
elif keylower in keys and self.includelower:
index = self.node_index = keys.index(keylower)
if index<validkeys:
self.valid = 1 # hurrah!
if not self.valid:
# look for next
place = bisect(keys, keylower, 0, validkeys)
if place<validkeys:
index = self.node_index = place
testk = keys[index]
if (testk>keylower or
(self.includelower and testk==keylower)):
self.valid = 1
else:
self.valid = 0
else:
# advance to the next node
next = node.nextneighbor()
if next is not None:
self.startnode = next
self.first()
return
else:
self.valid = 0
# test keyupper
if self.valid and keyupper is not None:
key = self.current_key()
if key<keyupper or (self.includeupper and key==keyupper):
self.valid = 1
else:
self.valid = 0
def current_key(self):
"""key the walker currently "points at"."""
if self.valid: return self.node.keys[self.node_index]
else: raise IndexError, "not at valid index"
def current_value(self):
"""value the walker currently "points at"."""
#print "current at", self.node_index, self.node.indices
if self.valid: return self.node.indices[self.node_index]
else: raise IndexError, "not at valid index"
def next(self):
"""advance to next position, or set to invalid."""
nextp = self.node_index+1
node = self.node
if nextp>=node.validkeys:
# goto next node
next = node.nextneighbor()
if next is None:
self.valid = 0
return
node = self.node = next
nextp = 0
#print "next at", node.keys, node.indices, nextp, node.validkeys
if node.validkeys<=nextp:
self.valid = 0
else:
testkey = node.keys[nextp]
keyupper = self.keyupper
if (keyupper is None or
testkey<keyupper or
(self.includeupper and testkey==keyupper)):
self.node_index = nextp
self.valid = 1
else:
self.valid = 0
class caching_BPT(BplusTree):
"""simple caching. No updates allowed."""
def __init__(self, infile, position=None, nodesize=None, keylen=None):
BplusTree.__init__(self, infile, position, nodesize, keylen)
self.cache = {}
def __getitem__(self, key):
try:
return self.cache[key]
except KeyError:
r = self.cache[key] = BplusTree.__getitem__(self, key)
return r
def reset_cache(self):
self.cache = {}
def nope(self, *args):
raise ValueError, "op not permitted for caching_BPT"
__setitem__ = __delitem__ = nope
class SBplusTree:
"""Wrapper for BPlusTree, maps strings-->strings.
Key strings are fixed length as in BPlusTree.
Value strings are arbitrary length but space for
overwritten or deleted values will be wasted in
the file (the aren't GC'd, unlike tree nodes which are.
"""
# can be overridden.
treeclass = BplusTree
def __init__(self, infile, position=None, nodesize=None, keylen=None):
self.infile = infile
self.tree = self.treeclass(infile, position, nodesize, keylen)
def walker(self,
keylower=None, includelower=None,
keyupper=None, includeupper=None):
return SBplusWalker(self, keylower, includelower,
keyupper, includeupper)
def __len__(self):
return len(self.tree)
def init_params(self):
return self.tree.init_params()
def getroot(self):
return self.tree.getroot()
def getfile(self):
return self.infile
def enable_fifo(self, size=33):
self.tree.enable_fifo(size)
def disable_fifo(self):
self.tree.disable_fifo()
def dump(self):
"""ignore real values here, should fix."""
self.tree.dump()
def startup(self):
self.tree.startup()
def open(self):
self.tree.open()
def __getitem__(self, key):
seek = self.tree[key]
return getstring(self.infile, seek)
def __setitem__(self, key, value):
"""Warning: overwrite "loses" old value space."""
#try:
# test = self[key]
#except KeyError:
# go = 1
#else:
# go = (test != key)
#if go:
# assume overwrite (optimize)
seek = putstring(self.infile, value)
self.tree[key] = seek
def __delitem__(self, key):
"""Warning: loses old value storage."""
del self.tree[key]
def has_key(self):
return self.tree.has_key(self)
class caching_SBPT(SBplusTree):
"""string-->string caching b-plus tree."""
treeclass = caching_BPT
class SBplusWalker:
"""iterator for string-->string Bplus tree."""
# can be overridden
walkerclass = BplusWalker
def __init__(self, tree,
keylower=None, includelower=None,
keyupper=None, includeupper=None):
self.walker = self.walkerclass(tree, keylower, includelower,
keyupper, includeupper)
self.file = tree.getfile()
self.valid = self.walker.valid
def first(self):
self.walker.first()
self.valid = self.walker.valid
def current_key(self):
return self.walker.current_key()
def current_value(self):
seek = self.walker.current_value()
return getstring(self.file, seek)
def next(self):
self.walker.next()
self.valid = self.walker.valid
def putstring(infile, s):
"""Add a new string record to eof. return start seek."""
#save = infile.tell()
# seek to eof
infile.seek(0,2)
last = infile.tell()
from marshal import dump
dump(s, infile)
#infile.seek(save)
return last
def getstring(infile, i):
"""get an old string record at i"""
#save = infile.tell()
infile.seek(i)
from marshal import load
s = load(infile)
#infile.seek(save)
return s
def recopy_bplus(fromfile, tofile,
treeclass=BplusTree):
"""copy BplusTree from fromfile to tofile.
from file should be open "rb", tofile "w+b"."""
fromtree = treeclass(fromfile)
fromtree.open()
(f, p, n, k) = fromtree.init_params()
totree = treeclass(tofile, p, n, k)
totree.startup()
return recopy_tree(fromtree, totree)
def recopy_tree(fromtree, totree):
"""copy fromtree contents to totree.
trees must be compatible.
copy attempts to "compactize" totree."""
(f,p,n,k) = totree.init_params()
try:
totree.enable_fifo()
walker = fromtree.walker()
# fill up first node in totree
part1 = n/2 +1
part2 = part1-2
defer = []
while walker.valid:
# pseudooptimization: defer n/2-1 tail elements
# for n even this makes all leaves full (in tests)
for i in xrange(part1):
if not walker.valid: break
totree[ walker.current_key() ] = walker.current_value()
walker.next()
for (k,v) in defer:
totree[k]=v
defer = []
for i in xrange(part2):
if not walker.valid: break
defer.append( (walker.current_key(), walker.current_value()) )
walker.next()
for (k,v) in defer:
totree[k] = v
return (fromtree, totree)
finally:
#print "disabling fifo"
totree.disable_fifo()
def recopy_sbplus(fromfile, tofile,
treeclass=SBplusTree):
"""copy SBplusTree from fromfile to tofile.
from file should be open "rb", tofile "w+b".
this will create a new file without "lost garbage"."""
return recopy_bplus(fromfile, tofile, treeclass)
##### simple dbm compatibility
bignum = 0x7efe77 # 8 million buckets
def myhash(s):
"""portable string hash function.
(because builtin hash isn't portable)."""
o = ord
B = bignum
result = 775 + len(s)*1001
for c in s:
#print result
result = (result*253 + o(c)*113) % B
return result
class dbm:
"""dbm compatible index file with unlimited key/value size.
overwrites, dels and hash collisions leave "junk" in index.
Alternate implementations left to reader, or to future.
Hash indexed into buckets in an SBplusTree.
buckets with marshalled dict of {key: value}
for elements in this bucket.
"""
flagmap = {"r": "rb", "w": "r+b", "c": "w+b"}
openmodes = ("r", "w")
treeclass = SBplusTree
nodesize = 202
def __init__(self, filename, flag="r", mode=None):
#print "init", filename, flag, mode
if mode is not None:
raise ValueError, "sorry mode not supported (portability)"
self.fileflag = flag
rf = self.realflag = self.flagmap[flag]
self.filename = filename
f = self.file = open(filename, rf)
# length record at start of file
if flag in self.openmodes:
from marshal import load
from string import atoi
self.length = load(f)
# parameters determined from header
#print "reopening", self.length, f.tell()
t = self.tree = self.treeclass(f, f.tell())
t.open()
else:
# put length record
from marshal import dump
dump(0, f)
self.length = 0
#print "creating", self.length, f.tell()
t = self.tree = self.treeclass(f, f.tell(), self.nodesize, intsize-1)
t.startup()
self.tree.enable_fifo(self.nodesize+3)
closed = 0
def close(self):
if self.closed: return
self.tree.disable_fifo()
# put length record
if self.length<0:
raise ValueError, "negative len?"+`(self.length, self.filename)`
f = self.file
if self.fileflag in ("c", "w"):
f.seek(0)
from marshal import dump
dump(self.length, f)
f.close()
self.closed = 1
def __del__(self):
self.close()
def __len__(self):
return self.length
def hash(self, key):
from marshal import dumps
h = myhash(key)
hs = dumps(h)[1:] # nuke indicator
return hs
def pairs(self, hash):
try:
spairs = self.tree[hash]
except KeyError:
return {}
from marshal import loads
return loads(spairs)
def setpairs(self, hash, pairs):
from marshal import dumps
spairs = dumps(pairs)
self.tree[hash] = spairs
def __getitem__(self, item):
h = self.hash(item)
pairs = self.pairs(h)
return pairs[item]
def __setitem__(self, item, value):
h = self.hash(item)
pairs = self.pairs(h)
if not pairs.has_key(item):
self.length = self.length+1
pairs[item] = value
self.setpairs(h, pairs)
#print self.length
def __delitem__(self, item):
h = self.hash(item)
pairs = self.pairs(h)
del pairs[item]
if pairs:
self.setpairs(h, pairs)
else:
del self.tree[h]
self.length = self.length-1
#print self.length
def has_key(self, item):
try:
test = self[item]
except KeyError:
return 0
else:
return 1
def keys(self):
"""not terribly efficient! (should optimize?)"""
result = []
w = self.tree.walker()
from marshal import loads
while w.valid:
spairs = w.current_value()
pairs = loads(spairs)
for k in pairs.keys():
result.append(k)
w.next()
if len(result)!=self.length:
raise IndexError, "bad tree length:"+`(len(result), self.length)`
return result
def copy(self, tofilename, flag, mode=None):
if flag=="r":
raise ValueError, "nonsense! can't copy into read only index"
#print "copy", tofilename, flag
other = dbm(tofilename, flag, mode)
if flag=="c":
# create: make optimal
recopy_tree(self.tree, other.tree)
other.length = self.length
other.tree.enable_fifo(other.nodesize+3)
elif flag=="w":
# insert-into: simple traversal (collisions waste space)
w = self.tree.walker()
from marshal import loads
while w.valid:
spairs = w.current_value()
pairs = loads(spairs)
for (k,v) in pairs.items():
other[k] = v
w.next()
return other
def testdbm():
print "creating files test1, 2, 3 for dbm test"
d1 = dbm("test1", "c")
for x in range(10):
key = "hello"*x
d1[key] = "01234567890"[:-x]
print key, d1[key]
print d1.keys()
for x in range(300):
d1[oct(x)] = hex(x)
del d1['']
print len(d1), d1.keys()
print "should be 0:", d1.has_key(""), d1.has_key("abd")
print "copying"
d2 = d1.copy("test2", "c")
beforedel = len(d1)
del d2["hello"]
print len(d2), d2.keys()
d2.close()
d2 = dbm("test2", "r")
print "should be equal", beforedel-1, len(d2)
print "keys", d2.keys()
print "testing copy-into"
d3 = dbm("test3", "c")
d3["willy"] = "wally"
d3.close()
d3 = d2.copy("test3", "w")
print "should be equal", beforedel, len(d3)
print "keys", d3.keys()
### test
def test():
"""test program: creates a bplustree file "test".
try messing with the node size.
"""
print "creating file 'test' in current directory for test data."
f = open("test", "w+b")
B = SBplusTree(f, 0, 202, 10)
B.startup()
B.enable_fifo()
#return B
B["this"] = 0xdad
from string import letters, digits
for x in letters+digits: B[x] = ord(x)
for x in "13579finalmopq": del B[x]
print "final pass"
from time import time
s = time()
for x in range(1000): B[hex(x)] = x; #print x
print "one thousand assigns", time()-s
#B.dump()
B.disable_fifo()
return (B, f)
def retest():
from time import time
f = open("test", "rb")
B = caching_SBPT(f)
B.open()
B.enable_fifo()
print "retesting"
for x in "abcdefghi012345":
try:
print x, "-->", B[x]
except KeyError:
print x, "absent"
print "entering torture chamber"
s = time()
for x in range(1000): l = B[hex(x)]
print "1 thousand retrieves: ", time()-s
return B
print "keys, values between 4 and C (including C)"
W = SBplusWalker(B, "4", 0, "C", 1)
while W.valid:
print (W.current_key(), W.current_value()),
W.next()
print
print "keys, values between 4 (including 4) and C (excluding C)"
W = SBplusWalker(B, "4", 1, "C", 0)
while W.valid:
print (W.current_key(), W.current_value()),
W.next()
print
print "all keys"
W = SBplusWalker(B)
while W.valid:
print W.current_key(),
W.next()
print
print "A to A inclusive (1 elt)"
W = SBplusWalker(B, "A", 1, "A", 1)
while W.valid:
print W.current_key(),
W.next()
print
print "A to A exclusive (0 elt)"
W = SBplusWalker(B, "A", 1, "A", 0)
while W.valid:
print W.current_key(),
W.next()
print
print "AA to AA inclusive (0 elt)"
W = SBplusWalker(B, "AA", 1, "AA", 0)
while W.valid:
print W.current_key(),
W.next()
print
print
B.disable_fifo()
return (W, B, f)
if __name__=="__main__":
(B,f) = test()
B=None
f.close()
retest()
More information about the Pipet-Devel
mailing list