[Biodevelopers] XML for huge DB?

Michael Gruenberger mgruenb at gmx.net
Thu Jul 31 13:52:28 EDT 2003


I agree with the other posters, but if you want to continue using your
XML::Simple package, a quick 'hack' might be to check if you are already
parsing a large file in one of your other processes.
And only parse files larger than a certain size when there is enough
memory and no other process parsing a large file....

As you have a .cam.ac.uk address ... is there anything you could use on
mole.bio.cam.ac.uk ? Maybe they would be willing to share some code?!

Michael.

On Thu, 2003-07-31 at 16:39, Alex Milowski wrote:
> On Thursday, July 31, 2003, at 09:02 AM, Dan Bolser wrote:
> 
> > Hello,
> >
> > How can I use XML efficiently to parse multiple blast results
> > files?
> >
> > I want to parse them on a multi processor environment, without
> > hitting the system memory limit.
> >
> > This is likely to happen, as big files take the most time, so the
> > processes tend to work on big files at the same time, leading
> > to a system memory outage....
> 
> You need to parse your XML in a "streaming" fashion.  If you are using
> Java, for most people, that means using SAX.  You should write a 
> ContentHandler
> (org.xml.sax package) that gathers your results.  The SAX 
> ContentHandler is
> a call-back style API and so programming can get complicated--but that 
> isn't necessarily
> true.
> 
> Many C/C++ APIs have a similar call-back style APIs.  Basically, you 
> want to interface
> the parser directly and get the essential information as efficiently as 
> possible.
> 
> If you plan to use Java 2, check out version 1.4.x and the 
> javax.xml.parsers and
> org.xml.sax packages.
> 
> Alex Milowski                FAX: (707) 598-7649                        
>   alex at milowski.com
> 
> "The excellence of grammar as a guide is proportional to the paucity of 
> the
> inflexions, i.e. to the degree of analysis effected by the language
> considered."
> 
> Bertrand Russell in a footnote of Principles of Mathematics
> 
> 
> _______________________________________________
> Biodevelopers mailing list
> Biodevelopers at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biodevelopers
-- 
Michael Gruenberger
Computer Officer, University of Cambridge
Developer, Pathbase, http://www.pathbase.net
PGP-Public Key ID: 278E1DFF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://bioinformatics.org/pipermail/biodevelopers/attachments/20030731/25ee814f/attachment.bin


More information about the Biodevelopers mailing list