Table of Contents

Biopython API documentation: Martel  
Martel
version 0.8

XML parsing of flat-files

Life's too short to (re)write parsers

SYNOPSIS:

Martel uses regular expression format definition to generate a parser for flat- and semi-structured files. The parse tree information is passed back to the caller using XML SAX events. In other words, Martel lets you parse flat files as if they are in XML.

INSTALLATION:

Martel uses the standard Python distutils installation method. To install it:

python setup.py install

You can install into somewhere other than the default location by using the --prefix option. For example, to install under your home directory you can use:

python setup.py install --prefix $HOME

Make sure your PYTHONPATH includes the correct site-packages under the prefix directory. For example, with Python 2.2 and a prefix of "/usr/local", your PYTHONPATH needs to include

/usr/local/lib/python2.2/site-packages

(The setup.py program will warn you if the correct directory is not in your path.)

To build then test then install you can do

python setup.py build python setup.py test python setup.py install

REQUIREMENTS:

Martel should work with Python 2.0 or newer. It has been tested with 2.1 and 2.2. Python is available from http://www.python.org/ .

You will need mxTextTools, which is available from http://www.lemburg.com/files/python/mxTextTools.html

Martel should work find with versions 1.1.0 and later. If you use eGenix 2.0.2 with Python 2.2 you will need to change mx/Tools/mxTools/xmap.c so the #include "config.h" becomes #include "pyconfig.h" and you will need to comment out the __debug__ = 0 and __debug__ = 1 statements in some of the Python files. (Assignment to __debug__ is illegal, and Python 2.2 is strict about that.)

Martel could be made to work under Python 1.5.2. First, why? Python 2.x has been available for a while and there are very few problems in upgrading. Still, with a bit of work you could make Martel work with that version. Besided modifying some Python 2 specific syntax, you will need to install the xml package distributed by the Python XML-SIG.

TESTING:

Martel includes a pretty complete regression test. To run the tests on the module without installing the package:

python setup.py test

To test out the installed package:

python setup.py installtest

EXAMPLES:

Here are some examples to get you started. More documentation is being written.

Example 1: Parsing /etc/passwd

Suppose you want to parse the /etc/password file. This is a line oriented file with one record on each line. A record contains fields seperated by a ":", and the field names and order is

account:password:UID:GID:GECOS:directory:shell

For example, one record may look like this

dalke:oPA4ic39X3:14488:100::/home/dalke:/bin/tcsh

which says that account = dalke password = oPA4ic39X3 (and no, this isn't a real password) UID = 14488 GID = 100 GECOS = directory = /home/dalke shell = /bin/tcsh

Suppose you want to print the shell used by each account. Here's one way to do it.

========================== import Martel

def ToColon(name): return Martel.ToSep(name, ":")

format = Martel.Rep( Martel.Group("record", ToColon("account") + ToColon("password") + ToColon("UID") + \ ToColon("GID") + ToColon("GECOS") + ToColon("directory") + \ Martel.ToEol("shell")))

for record in format.make_iterator("record").iterateFile(open("/etc/passwd")): print record["account"][0], "uses", record["shell"][0] ==========================

When run on my machine, this produces output including

root uses /bin/bash bin uses sync uses /bin/sync shutdown uses /sbin/shutdown halt uses /sbin/halt xfs uses /bin/false dalke uses /bin/tcsh apache uses /bin/bash

Or you could use this variation

========================== import Martel

format = Martel.Rep( Martel.Group("record", Martel.Re(r"(?P[^:\R]*)(:(?P[^:\R]))\R")))

for record in format.make_iterator("record").iterateFile(open("/etc/passwd")): print record["field"][0], "uses", record["field"][-1] ==========================

Compare these to the standard way to parse that file in Python 2.0, which is

========================== infile = open("/etc/passwd") while 1: line = infile.readline() if not line: break

line = line[:-1] # remove the newline character fields = line.split(":") # get the fields assert len(fields) == 7, "wrong field count"

print fields[0], "uses", fields[-1] ==========================

Using the features in Python 2.2, and removing the need to use special variables like "0", "7", and "-1"

========================== names = ("account", "password", "UID", "GID", "GECOS", "directory", "shell") for line in open("/etc/passwd"): line = line[:-1] fields = line.split(":") assert len(fields) == len(names), "wrong field count" d = dict(zip(names, fields))

print d["account"], "uses", d["shell"] ==========================

Finally, since delimited fields are very common, Martel has a built-in function to parse them:

========================== import Martel

format = Martel.Rep(Martel.Group("record", Martel.DelimitedFields("field", ":")))

for record in format.make_iterator("record").iterateFile(open("/etc/passwd")): print record["field"][0], "uses", record["field"][-1] ==========================

(Note: the first Martel solution allows a ":" in the directory name, which the other solutions do not allow.)

Where does the XML come in? The examples above use "make_iterator", which builds on the underlying parser, which generates SAX events, and LAX, the default SAX ContentHandler used by make_iterator. To see how things work at the SAX level, here's how to convert the SAX events into XML text.

========================== import Martel from xml.sax import saxutils

def ToColon(name): return Martel.ToSep(name, ":")

format = Martel.Rep( Martel.Group("record", ToColon("account") + ToColon("password") + ToColon("UID") + \ ToColon("GID") + ToColon("GECOS") + ToColon("directory") + \ Martel.ToEol("shell")))

parser = format.make_parser() parser.setContentHandler(saxutils.XMLGenerator())

parser.parseFile(open("/etc/passwd")) ==========================

When run, the code above produces output like this (some newlines added to make it the results fold nicely).

root:x:0: 0:root:/root:/bin/bash bin:x: 1:1:bin:/bin: dalke:x:14488:100::/home/dalke< /directory>:/bin/tcsh

========================== from xml.sax import handler

class PrintShell(handler.ContentHandler): def startDocument(self): self.capture = 0 def startElement(self, name, attrs): if name == "record": self.account = "" self.shell = "" self.s = "" elif name in ("account", "shell"): self.capture = 1 def characters(self, s): if self.capture: self.s = self.s + s def endElement(self, name): if name == "account": self.account = self.s elif name == "shell": self.shell = self.s print self.account, "uses", self.shell self.capture = 0

# assuming parser is defined as above parser.setContentHandler(PrintShell()) parser.parseFile(open("/etc/passwd")) ==========================

This prints exactly the same results - it's just harder to write because it's uses a lower level interface.

However, Python includes a way to convert this low-level interface to a higher level data XML structure called DOM. Suppose you wanted to print the account name and shell.

==========================

# Hmmm, DOM requires there be only a single top-level element. # XXX fixup documentation to reflect this requirement format = Martel.Group("passwd", Martel.Rep( Martel.Group("record", ToColon("account") + ToColon("password") + ToColon("UID") + \ ToColon("GID") + ToColon("GECOS") + ToColon("directory") + \ Martel.ToEol("shell"))))

parser = format.make_parser()

from xml.dom import pulldom sax2dom = pulldom.SAX2DOM() parser.setContentHandler(sax2dom) doc = sax2dom.document doc.normalize() for record in doc.getElementsByTagName("record"): account = record.getElementsByTagName("account")[0].firstChild.nodeValue shell = record.getElementsByTagName("shell")[0].firstChild.nodeValue print account, "uses", shell

==========================

(There's probably better ways to do this, but I have little experience with DOM. I would like to have an example where fields are changed, if I can find an easy way to write a minidom without the tags.)

Example 2: parse INI file

Martel's strength starts to show when parsing multi-line formats. Consider parsing a ".ini" file, which is a configuration file format popular under MS Windows. There are multiple section definitions in a file, and each section can have 0 or more key=value lines. Comments start with a ;.

For example, here is part of my system.ini definition, with a few changes to show some of the variations allowed.

========================== ; Keyboard driver [keyboard] keyboard.dll= oemansi.bin= subtype= type=4

[boot.description] system.drv=Standard PC keyboard.typ=Standard 101/102-Key or Microsoft Natural Keyboard mouse.drv=Standard mouse aspect=100,96,96 display.drv=NeoMagic MagicMedia 256AV

[Password Lists] UNKNOWN USER=C:\WINDOWS\UNKNOWNU.PWL ; See the space in the key name? DALKE=C:\WINDOWS\DALKE.PWL ==========================

========================== import Martel comment = Martel.Str(";") + Martel.ToEol() blank = Martel.AnyEol()

# [Password Lists] section_title = Martel.Re(r"\[(?P[^\]\R]+)\]\R")

# keyboard.typ=Standard 101/102-Key or Microsoft Natural Keyboard key_value = Martel.Re(r"\s*(?P[^;=\R]+)\s=\s(?P[^;\R]+)") + \ (Martel.AnyEol() | comment)

# Section can include blank and comment lines, and can be empty section = Martel.Group("section", section_title + \ Martel.Rep(Martel.Group("key_value", key_value) | blank | comment))

format = Martel.Rep(comment | blank | section)

infile = open("/mnt/windows/windows/system.ini") for record in format.make_iterator("section").iterateFile(infile): print "[%s]" % record["section_title"][0] for name, value in zip(record.get("key", []), record.get("value", [])): print "%s=%s" % (name, value) print

==========================

#### XXX Need to fix up things beyond here so they aren't as #### bioinformatics specific. Perhaps for 0.9 release.

BACKGROUND:

There are a lot of bioinformatics file formats. It's annoying to have to write parsers for them all the time, and when you do they rarely read all the data elements. And even in those cases, it doesn't preserve enough physical layout information for things like marking up a file for HTML.

It would be nice to give the computer a description of a file format and have it generate a parser. Indeed, there are tools for that, like lex and yacc. Unfortunately, the bioinformatics formats aren't easy to parse with those tools. It isn't that they aren't powerful enough. The problem is the formats are very stateful.

That is, a lexer likes being able to recognize a word just by looking at it. But consider the word "GENE". It could be a keyword, the submitter's first name or part of a sequence, depending on where you are in the file. For the lexer to handle it, the parser (yacc) has to tell flex the current position. This requires explicit communications between the two components, which gets complex and tedious.

Instead, I'm taking advantage of some nice properties of these formats; they are generally - ASCII line-oriented relatively easy to write a parser - written by hand - and in FORTAN almost no look-ahead needed - nearly always only one line regular (!)

The last is the most interesting since because of Perl's history in bioinformatics, a lot of people in this field know how to use regular expressions. Why not use a regular expression for parsing these files?

Regular expressions as available in Perl and Python aren't good enough. Consider the following SwissProt line:

AC P93209; P42651; P12345;

where there is one or more accession id. The regular expression of this might be "AC (\w+);(\s+(\w+);)*" but when you do the match, $1 is set to P93209 and $3 is set to P12345. There's no way to get the middle id.

The regular expression engines store all the matches until the end, which is why new matches override older ones. Another way to do this is to pass in a callback object, and have it called for each match. In the above case there would be 5 calls:

"$1", "P93209" "$2", " P42651;" "$3", "P42651" "$2", " P12345;" "$3", "P12345"

Python has a nice variant on regular expressions, named groups, which makes this easier to understand. Suppose the pattern was:

"AC (?P\w+);(\s+(?P\w+);)*"

and suppose non-named fields are ignored. Then the calls would be:

"ac", "P93209" "ac", "P42651" "ac", "P12345"

Since the current engines don't support callbacks like, I wrote one which does. But how should the callbacks work?

A hot buzzword for the last few years has been XML. There are a couple of common ways to parse XML. One of them is SAX, which is callback based. If the interface for my regular expression scanner generator is the same as SAX, then there are a slew of tools available to work with the resultant data (eg, put it into a DOM).

Even better, XML is based off of SGML, where they've put work into making sure both the semantic and the physical information can be preserved, so you still can produce converters for HTML without loosing formatting information.

The mapping from regular expressions to SAX events is actually pretty simple since I'm using named groups. It looks like:

(?Ppattern) ^-- callback.endElement("name") ^-- callback.characters(contents of group) ^-- callback.startElement("name")

And parsing the "AC" line above gets turned into the following method calls of the callback object (which SAX calls a handler):

characters("AC ") startElement("ac_number") characters("P93209" endElment("ac_number") characters("; ") startElement("ac_number") characters("P42651") endElement("ac_number") startElement("ac_number") characters("P12345") endElement("ac_number") characters(";")

Debugging:

Debugging large patterns is very hard. It's best to break things down into more manageable chunks. Take a look at my test code in the "test/" subdirectory for how I do it. Especially useful is the "dump" method which lets you see roughly how far the parsing progressed.

The implementation has a downside. It requires a named regular expression group to match fully before the startElement and any of the internal characters are sent to the callback. If there is a failure somewhere inside the group, you won't be told how far it went, but only that it stopped at the start of the group.

When debugging fails at a named group, try commenting it out so you can see what happened inside. You might also use Martel.select_names to create a new expression with no named groups. Or copy the text that failed and try just that subpattern. Also, the make_parser method has an optional parameter debug_level which is 0 by default. When set to 1 it reports better error locations. When set to 2 it also prints a large amount of debugging information to stdout.

Performance:

(Note: Timings are pre-0.4, which has a faster RecordReader and should trim about a minute off of the timings.)

I have a 233MHz PII laptop. I did some tests with swissprot. With a dataset of 10,023 lines (491,659 bytes), it took about 0.82 seconds to parse and about 5.4MB. With 50,000 lines (2,498,664 bytes) it took 7.9 seconds and 19,488K.

Extroplating linearly, Swissprot is 4,521,693 lines and 227,311,276 bytes (80,000 records), so should take about 12 minutes to parse and 1.7GB of RAM.

Following Jeff Chang's lead, I wrote a RecordReader class which is smart enough to read a file a record at a time. It takes advantage of the fact that large files are composed of lots of small files. I used this to read each of the records in SWISS-PROT 38. That took 6.9 minutes to parse the records counting just around the parse call, and 10.5 minutes from the output of the shell's built-in time call. It was running 99.9% of the time, so almost no system overhead and no page faults. If I use a callback, the parsing time is 16.6 minutes (10 extra minutes in method call overhead!) and the total time is 20:20 minutes. Only three seconds spent in system time.

BUGS:

Backtracking does not always work the same way as regular expressions. Don't do "\s*\n" since the \s eats the \n but doesn't backtrack.

There's a bug in Alt where (x|y|z) doesn't work but (x|z|y) does work. Only occurs when the first part of x is same as the first part of y (?). Use negative lookahead assertions as a work around or force the different parts to be in a named group

The conversion from an expression to a regular expression doesn't always work. It's a debugging tool and isn't yet needed for normal operation.

Modules and Packages   

Martel/

LICENSE.html

Biopython License Agreement

README.html
Martel
version 0.8
TODO.html
More formats.
Dispatch
Expression

Classes for nodes in the Expression tree

Generate
Iterator

Iterate over records of a XML parse tree

LAX
LAX.py
a simple way to read lists of fields from flat XML records
Parser

implement Martel parsers

RecordReader
Time
Time.py
parse a time or date string
__init__
convert_re

Converts a regular expression pattern string into an Expression tree

formats

Biopython API documentation: Martel.formats

msre_constants
msre_parse
optimize

Optimize an expression tree

setup

Distutils based setup script for Martel.

test

Biopython API documentation: Martel.test


Table of Contents

This document was automatically generated on Mon Jul 1 12:03:20 2002 by HappyDoc version 2.0.1