[Biodevelopers] XML for huge DB?

Alex Milowski alex at milowski.com
Thu Jul 31 12:39:28 EDT 2003


On Thursday, July 31, 2003, at 09:02 AM, Dan Bolser wrote:

> Hello,
>
> How can I use XML efficiently to parse multiple blast results
> files?
>
> I want to parse them on a multi processor environment, without
> hitting the system memory limit.
>
> This is likely to happen, as big files take the most time, so the
> processes tend to work on big files at the same time, leading
> to a system memory outage....

You need to parse your XML in a "streaming" fashion.  If you are using
Java, for most people, that means using SAX.  You should write a 
ContentHandler
(org.xml.sax package) that gathers your results.  The SAX 
ContentHandler is
a call-back style API and so programming can get complicated--but that 
isn't necessarily
true.

Many C/C++ APIs have a similar call-back style APIs.  Basically, you 
want to interface
the parser directly and get the essential information as efficiently as 
possible.

If you plan to use Java 2, check out version 1.4.x and the 
javax.xml.parsers and
org.xml.sax packages.

Alex Milowski                FAX: (707) 598-7649                        
  alex at milowski.com

"The excellence of grammar as a guide is proportional to the paucity of 
the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics





More information about the Biodevelopers mailing list