Biopython API documentation: Martel | |||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SYNOPSIS: Martel uses regular expression format definition to generate a parser for flat- and semi-structured files. The parse tree information is passed back to the caller using XML SAX events. In other words, Martel lets you parse flat files as if they are in XML. INSTALLATION: Martel uses the standard Python python setup.py install You can install into somewhere other than the default location by using the --prefix option. For example, to install under your home directory you can use: python setup.py install --prefix $HOME Make sure your PYTHONPATH includes the correct site-packages under the prefix directory. For example, with Python 2.2 and a prefix of "/usr/local", your PYTHONPATH needs to include /usr/local/lib/python2.2/site-packages (The setup.py program will warn you if the correct directory is not in your path.) To build then test then install you can dopython setup.py build python setup.py test python setup.py install REQUIREMENTS: Martel should work with Python 2.0 or newer. It has been tested with 2.1 and 2.2. Python is available from http://www.python.org/ . You will need mxTextTools, which is available from http://www.lemburg.com/files/python/mxTextTools.html Martel should work find with versions 1.1.0 and later. If you use
eGenix 2.0.2 with Python 2.2 you will need to change
mx/Tools/mxTools/xmap.c
so the Martel could be made to work under Python 1.5.2. First, why? Python
2.x has been available for a while and there are very few problems in
upgrading. Still, with a bit of work you could make Martel work with
that version. Besided modifying some Python 2 specific syntax, you
will need to install the TESTING: Martel includes a pretty complete regression test. To run the tests on the module without installing the package: python setup.py test To test out the installed package:python setup.py installtest EXAMPLES: Here are some examples to get you started. More documentation is being written. Example 1: Parsing /etc/passwd Suppose you want to parse the /etc/password file. This is a line oriented file with one record on each line. A record contains fields seperated by a ":", and the field names and order is account:password:UID:GID:GECOS:directory:shell For example, one record may look like thisdalke:oPA4ic39X3:14488:100::/home/dalke:/bin/tcsh which says that account = dalke password = oPA4ic39X3 (and no, this isn't a real password) UID = 14488 GID = 100 GECOS = directory = /home/dalke shell = /bin/tcsh Suppose you want to print the shell used by each account. Here's one way to do it. ========================== import Martel def ToColon(name): return Martel.ToSep(name, ":") format = Martel.Rep( Martel.Group("record", ToColon("account") + ToColon("password") + ToColon("UID") + \ ToColon("GID") + ToColon("GECOS") + ToColon("directory") + \ Martel.ToEol("shell"))) for record in format.make_iterator("record").iterateFile(open("/etc/passwd")): print record["account"][0], "uses", record["shell"][0] ========================== When run on my machine, this produces output including root uses /bin/bash bin uses sync uses /bin/sync shutdown uses /sbin/shutdown halt uses /sbin/halt xfs uses /bin/false dalke uses /bin/tcsh apache uses /bin/bash Or you could use this variation ========================== import Martel format = Martel.Rep(
Martel.Group("record",
Martel.Re(r"(?P for record in format.make_iterator("record").iterateFile(open("/etc/passwd")): print record["field"][0], "uses", record["field"][-1] ========================== Compare these to the standard way to parse that file in Python 2.0, which is ========================== infile = open("/etc/passwd") while 1: line = infile.readline() if not line: break line = line[:-1] # remove the newline character fields = line.split(":") # get the fields assert len(fields) == 7, "wrong field count" print fields[0], "uses", fields[-1] ========================== Using the features in Python 2.2, and removing the need to use special variables like "0", "7", and "-1" ========================== names = ("account", "password", "UID", "GID", "GECOS", "directory", "shell") for line in open("/etc/passwd"): line = line[:-1] fields = line.split(":") assert len(fields) == len(names), "wrong field count" d = dict(zip(names, fields)) print d["account"], "uses", d["shell"] ========================== Finally, since delimited fields are very common, Martel has a built-in function to parse them: ========================== import Martel format = Martel.Rep(Martel.Group("record", Martel.DelimitedFields("field", ":"))) for record in format.make_iterator("record").iterateFile(open("/etc/passwd")): print record["field"][0], "uses", record["field"][-1] ========================== (Note: the first Martel solution allows a ":" in the directory name, which the other solutions do not allow.) Where does the XML come in? The examples above use "make_iterator", which builds on the underlying parser, which generates SAX events, and LAX, the default SAX ContentHandler used by make_iterator. To see how things work at the SAX level, here's how to convert the SAX events into XML text. ========================== import Martel from xml.sax import saxutils def ToColon(name): return Martel.ToSep(name, ":") format = Martel.Rep( Martel.Group("record", ToColon("account") + ToColon("password") + ToColon("UID") + \ ToColon("GID") + ToColon("GECOS") + ToColon("directory") + \ Martel.ToEol("shell"))) parser = format.make_parser() parser.setContentHandler(saxutils.XMLGenerator()) parser.parseFile(open("/etc/passwd")) ========================== When run, the code above produces output like this (some newlines added to make it the results fold nicely).
========================== from xml.sax import handler class PrintShell(handler.ContentHandler): def startDocument(self): self.capture = 0 def startElement(self, name, attrs): if name == "record": self.account = "" self.shell = "" self.s = "" elif name in ("account", "shell"): self.capture = 1 def characters(self, s): if self.capture: self.s = self.s + s def endElement(self, name): if name == "account": self.account = self.s elif name == "shell": self.shell = self.s print self.account, "uses", self.shell self.capture = 0 # assuming This prints exactly the same results - it's just harder to write because it's uses a lower level interface. However, Python includes a way to convert this low-level interface to a higher level data XML structure called DOM. Suppose you wanted to print the account name and shell. ========================== # Hmmm, DOM requires there be only a single top-level element. # XXX fixup documentation to reflect this requirement format = Martel.Group("passwd", Martel.Rep( Martel.Group("record", ToColon("account") + ToColon("password") + ToColon("UID") + \ ToColon("GID") + ToColon("GECOS") + ToColon("directory") + \ Martel.ToEol("shell")))) parser = format.make_parser() from xml.dom import pulldom sax2dom = pulldom.SAX2DOM() parser.setContentHandler(sax2dom) doc = sax2dom.document doc.normalize() for record in doc.getElementsByTagName("record"): account = record.getElementsByTagName("account")[0].firstChild.nodeValue shell = record.getElementsByTagName("shell")[0].firstChild.nodeValue print account, "uses", shell ========================== (There's probably better ways to do this, but I have little experience with DOM. I would like to have an example where fields are changed, if I can find an easy way to write a minidom without the tags.) Example 2: parse INI file Martel's strength starts to show when parsing multi-line formats.
Consider parsing a ".ini" file, which is a configuration file format
popular under MS Windows. There are multiple section definitions in a
file, and each section can have 0 or more key=value lines. Comments
start with a For example, here is part of my ========================== ; Keyboard driver [keyboard] keyboard.dll= oemansi.bin= subtype= type=4 [boot.description] system.drv=Standard PC keyboard.typ=Standard 101/102-Key or Microsoft Natural Keyboard mouse.drv=Standard mouse aspect=100,96,96 display.drv=NeoMagic MagicMedia 256AV [Password Lists] UNKNOWN USER=C:\WINDOWS\UNKNOWNU.PWL ; See the space in the key name? DALKE=C:\WINDOWS\DALKE.PWL ========================== ========================== import Martel comment = Martel.Str(";") + Martel.ToEol() blank = Martel.AnyEol() # [Password Lists]
section_title = Martel.Re(r"\[(?P # keyboard.typ=Standard 101/102-Key or Microsoft Natural Keyboard
key_value = Martel.Re(r"\s*(?P # Section can include blank and comment lines, and can be empty section = Martel.Group("section", section_title + \ Martel.Rep(Martel.Group("key_value", key_value) | blank | comment)) format = Martel.Rep(comment | blank | section) infile = open("/mnt/windows/windows/system.ini") for record in format.make_iterator("section").iterateFile(infile): print "[%s]" % record["section_title"][0] for name, value in zip(record.get("key", []), record.get("value", [])): print "%s=%s" % (name, value) print ========================== #### XXX Need to fix up things beyond here so they aren't as #### bioinformatics specific. Perhaps for 0.9 release. BACKGROUND:There are a lot of bioinformatics file formats. It's annoying to have to write parsers for them all the time, and when you do they rarely read all the data elements. And even in those cases, it doesn't preserve enough physical layout information for things like marking up a file for HTML. It would be nice to give the computer a description of a file format and have it generate a parser. Indeed, there are tools for that, like lex and yacc. Unfortunately, the bioinformatics formats aren't easy to parse with those tools. It isn't that they aren't powerful enough. The problem is the formats are very stateful. That is, a lexer likes being able to recognize a word just by looking at it. But consider the word "GENE". It could be a keyword, the submitter's first name or part of a sequence, depending on where you are in the file. For the lexer to handle it, the parser (yacc) has to tell flex the current position. This requires explicit communications between the two components, which gets complex and tedious. Instead, I'm taking advantage of some nice properties of these formats; they are generally - ASCII line-oriented relatively easy to write a parser - written by hand - and in FORTAN almost no look-ahead needed - nearly always only one line regular (!) The last is the most interesting since because of Perl's history in bioinformatics, a lot of people in this field know how to use regular expressions. Why not use a regular expression for parsing these files? Regular expressions as available in Perl and Python aren't good enough. Consider the following SwissProt line: AC P93209; P42651; P12345; where there is one or more accession id. The regular expression of this might be "AC (\w+);(\s+(\w+);)*" but when you do the match, $1 is set to P93209 and $3 is set to P12345. There's no way to get the middle id. The regular expression engines store all the matches until the end, which is why new matches override older ones. Another way to do this is to pass in a callback object, and have it called for each match. In the above case there would be 5 calls: "$1", "P93209" "$2", " P42651;" "$3", "P42651" "$2", " P12345;" "$3", "P12345" Python has a nice variant on regular expressions, named groups, which makes this easier to understand. Suppose the pattern was: "AC (?P and suppose non-named fields are ignored. Then the calls would be:"ac", "P93209" "ac", "P42651" "ac", "P12345" Since the current engines don't support callbacks like, I wrote one which does. But how should the callbacks work? A hot buzzword for the last few years has been XML. There are a couple of common ways to parse XML. One of them is SAX, which is callback based. If the interface for my regular expression scanner generator is the same as SAX, then there are a slew of tools available to work with the resultant data (eg, put it into a DOM). Even better, XML is based off of SGML, where they've put work into making sure both the semantic and the physical information can be preserved, so you still can produce converters for HTML without loosing formatting information. The mapping from regular expressions to SAX events is actually pretty simple since I'm using named groups. It looks like: (?P And parsing the "AC" line above gets turned into the following method calls of the callback object (which SAX calls a handler): characters("AC ") startElement("ac_number") characters("P93209" endElment("ac_number") characters("; ") startElement("ac_number") characters("P42651") endElement("ac_number") startElement("ac_number") characters("P12345") endElement("ac_number") characters(";") Debugging:Debugging large patterns is very hard. It's best to break things down into more manageable chunks. Take a look at my test code in the "test/" subdirectory for how I do it. Especially useful is the "dump" method which lets you see roughly how far the parsing progressed. The implementation has a downside. It requires a named regular expression group to match fully before the startElement and any of the internal characters are sent to the callback. If there is a failure somewhere inside the group, you won't be told how far it went, but only that it stopped at the start of the group. When debugging fails at a named group, try commenting it out so you
can see what happened inside. You might also use Martel.select_names
to create a new expression with no named groups. Or copy the text that
failed and try just that subpattern. Also, the make_parser method has
an optional parameter Performance:(Note: Timings are pre-0.4, which has a faster RecordReader and should trim about a minute off of the timings.) I have a 233MHz PII laptop. I did some tests with swissprot. With a dataset of 10,023 lines (491,659 bytes), it took about 0.82 seconds to parse and about 5.4MB. With 50,000 lines (2,498,664 bytes) it took 7.9 seconds and 19,488K. Extroplating linearly, Swissprot is 4,521,693 lines and 227,311,276 bytes (80,000 records), so should take about 12 minutes to parse and 1.7GB of RAM. Following Jeff Chang's lead, I wrote a BUGS: Backtracking does not always work the same way as regular expressions. Don't do "\s*\n" since the \s eats the \n but doesn't backtrack. There's a bug in Alt where (x|y|z) doesn't work but (x|z|y) does work. Only occurs when the first part of x is same as the first part of y (?). Use negative lookahead assertions as a work around or force the different parts to be in a named group The conversion from an expression to a regular expression doesn't always work. It's a debugging tool and isn't yet needed for normal operation.
|