[Biodevelopers] XML for huge DB?

Jason Stajich jason at cgt.duhs.duke.edu
Thu Jul 31 12:53:49 EDT 2003


Probably don't want to use XML::Simple for this.

We use a SAX parser - XML::Parser::PerlSAX to parse these in Bioperl.

(install Bioperl)
use Bio::SearchIO;
my $in = new Bio::SearchIO(-file => 'blastresult.xml',
                           -format=> 'blastxml');

while( my $result = $in->next_result ) {
  # get each query as a Search::Result object
}

There are some slight nuance problems with later versions of BLAST From
NCBI which makes XML::Parser not able to parse without fixing the DTD line
to not have a trailing newline...  I've not had the patience to try and
diagose it any more than that, but we'll be happy to put the preprocessing
fix in the SearchIO code if it someone does.

-jason

On Thu, 31 Jul 2003, Dan Bolser wrote:

> No, the problem is that a big results file can grab 50% of the 4GB
> memory on the system. When I run 4 processes (and a file of this
> size takes about 1 hour to process with XML::Simple) then as soon
> as more that one process encounters a big file I am skuppered.
>
> I am looking for a memory lite way of parsing the blast results
> files from XML, I.E. one HST at a time with a print event
> for each, rather than whole file at a time processing from
> XML::Simple....
>
> Dan.
>
> Michael Gruenberger wrote:
>
> >Hello,
> >
> >if you are on Unix, does 'nice' do what you are asking?!
> >See:
> >http://www.phys.ksu.edu/~esry/Computing/Nicing.html
> >And:
> >man nice
> >
> >I don't know if it affects memory usage, but you can give your parsing
> >process a lower priority so it shouldn't take your whole system down...
> >
> >Cheers,
> >
> >Michael
> >On Thu, 2003-07-31 at 16:02, Dan Bolser wrote:
> >
> >
> >>Hello,
> >>
> >>How can I use XML efficiently to parse multiple blast results
> >>files?
> >>
> >>I want to parse them on a multi processor environment, without
> >>hitting the system memory limit.
> >>
> >>This is likely to happen, as big files take the most time, so the
> >>processes tend to work on big files at the same time, leading
> >>to a system memory outage....
> >>
> >>Cheers,
> >>Dan.
> >>
> >>_______________________________________________
> >>Biodevelopers mailing list
> >>Biodevelopers at bioinformatics.org
> >>https://bioinformatics.org/mailman/listinfo/biodevelopers
> >>
> >>
>
>
> _______________________________________________
> Biodevelopers mailing list
> Biodevelopers at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biodevelopers
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu



More information about the Biodevelopers mailing list