[Biodevelopers] XML for huge DB?

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Thu Jul 31 18:42:15 EDT 2003


Wow, thanks for the ideas

On 31 Jul 2003, Joseph Landman wrote:

> On Thu, 2003-07-31 at 12:26, Dan Bolser wrote:
> > No, the problem is that a big results file can grab 50% of the 4GB
> > memory on the system. When I run 4 processes (and a file of this
> > size takes about 1 hour to process with XML::Simple) then as soon
> > as more that one process encounters a big file I am skuppered.
> 
> Have a look at XML::Twig
> 
> "XML::Twig - A perl module for processing huge XML documents in tree
> mode."
> 
> http://search.cpan.org/author/MIROD/XML-Twig-3.10/Twig.pm


Cheers, I will have a look. 


> > I am looking for a memory lite way of parsing the blast results
> > files from XML, I.E. one HST at a time with a print event
> > for each, rather than whole file at a time processing from
> > XML::Simple....
> 
> You might also look at Bioperl to handle this.  They have a neat
> interface to exactly this.

Yup, I saw a neat interface with optional 'html plugins' which is 
exactly the kind of thing that I love. 

I would like to see an integrated bioinformatics database based 
around this principal of data / display independance. 

Once you derive complex enough queries, analysis becomes essential, 
we use custom software and (maby) eventually implement our findings
back into a web page. I would love to see a seemless approach to 
this whole buisness, with a 'modular' but integrated datbase with
web-api access and plugable 'display/analysis' modules. 

How much of your day to day 'research' is actually data integration?

The problem with pure CS approaches is that the datamodeling must be
based on biological concepts, and thus is best left to distributed
experts.

> 
> XML::Simple slurps the entire file into memory for parsing.  This is not
> a good idea for big documents.  XML::SAX is possible, but you have to
> work harder to write your callbacks and parsers.  The callbacks under
> Twig are easy to write as closures.

Yup, I was planning to implement event handler sub routines with perl
XML::Parser, 

> 
> The XML::Twig->next_sibling() may be useful for this.

But I will give this a go. 

I am so nearly finished I am reluctant to look at Bioperl right
now, but I know I will need to display results sooner or later. 


Thanks very much, 
Dan.

> 
> 




More information about the Biodevelopers mailing list