[Biophp-dev] XML parser
Dan Bolser
biophp-dev@bioinformatics.org
Mon, 25 Aug 2003 13:21:26 +0100 (BST)
I had exactly that problem!
My bug was not 'unsetting' the $currentTag pointer
at the 'endOfTag' event, this way I couldn't
$data{$currentTag} .= $characterData
on the 'characterData' event, instead I was doing something
like
$data{$currentTag} = $characterData if !$data{$currentTag};
As sometimes not all the characterData comes in one event,
something to do with the characterData buffer, the above
approach sometimes gave truncated data.
Once I properly unset $currentTag, I could append all the
caracterData for each tag properly!
Here is my script, it uses a few optimizations and some
specific tricks for my needs ($group), but most of this
is behind the scenes...
__SKIP__
Skiping preamble,
in breif...
PIPE = named pipe (fifo) for 'load data infile'.
$group = custom HSP grouping object.
@file = list of results files to parse.
$DIR = results files directory.
use PDB_ISL; = custom data / the group object.
back to the action...
__RESUME__
use XML::Parser;
#------------------------------------------------
#
# Initalise parser.
#
my $p = XML::Parser->new(
Handlers => {
Start => \&startEvent,
Char => \&charEvent,
End => \&endEvent,
}
);
#------------------------------------------------
#
# Set Globals for event handler communication.
#
my ( $pos, %que, %itr, %hit, %hsp ); # NB: $pos == $currentTag
# Here I decide which fields I want data from...
my %QUE = %PDB_ISL::QUE; # Query sequence data fields
my %ITR = %PDB_ISL::ITR; # Iteration data fields
my %HIT = %PDB_ISL::HIT; # Hit sequence data fields
my %HSP = %PDB_ISL::HSP; # High Scoring Segment Pair data fields.
my @SCHEMA = @PDB_ISL::SCHEMA; # TABLE SCHEMA
#------------------------------------------------
#
# Begin.
#
foreach ( @file ){
warn "Processing $DIR/$_\n";
unless (-s "$DIR/$_"){
warn "No such file\n";
next;
}
$group = PDB_ISL->group(); # Get new HSP group object.
$p->parsefile( "$DIR/$_" ); # For details, see Event handlers.
}
print "OK\n";
#------------------------------------------------
#
# Event handlers.
#
sub startEvent{ # <open_tag>
my ( $self, $elem, %attr ) = @_;
$pos = $elem; # Set currentTag!
# NB: CASE order = frequency of tag occurence!
#print "OPEN $elem\n";
if ($pos eq 'Hsp'){ # CASE <HSP>
#print "\nNEW HSP\n";
%hsp = %HSP; # Reset HSP data
}
elsif ($pos eq 'Hit'){ # CASE <HIT>
#print "NEW HIT\n";
%hit = %HIT; # Reset HIT data
}
elsif ($pos eq 'Iteration'){ # CASE <ITERATION>
#print "NEW ITR\n";
%itr = %ITR; # Reset ITR data
}
elsif ($pos eq 'BlastOutput'){# CASE <OUTPUT> (one query per file)
#print "NEW OUT\n";
%que = %QUE; # Reset QUE data
}
}
sub charEvent{ # <>between tags</>
my ( $expat, $text ) = @_;
return unless $pos; # Very important!
# NB: Only parse given fields. Ignore other data!
# NB: CASE order as above!
if ( exists $hsp{$pos} ){
$hsp{$pos} .= $text; # Save HSP field data
#print "HSP:$pos:$text\n";
}
elsif ( exists $hit{$pos} ){
$hit{$pos} .= $text; # Save HIT field data
#print "HIT:$pos:$text\n";
}
elsif ( exists $itr{$pos} ){
$itr{$pos} .= $text; # Save ITR field data
#print "ITR:$pos:$text\n";
}
elsif ( exists $que{$pos} ){
$que{$pos} .= $text; # Save QUE field data
#print "QUE:$pos:$text\n";
}
}
sub endEvent{ # </close_tag>
my ( $self, $elem ) = @_;
$pos = undef; # Unset currentTag. Very important!
#print "CLOSE $elem\n";
if ($elem eq 'Hsp'){ # CASE </HSP>
# TAKE A COPY!
my %data = ( %que, %itr, %hit, %hsp );
#print join("\t", map { $data{$_} } @SCHEMA),"\n";
$group->add( \%data ); # ADD TO GROUP!
}
elsif($elem eq 'Hit'){ # CASE </HIT>
# Hello Mum!
}
elsif($elem eq 'Iteration'){ # CASE </ITR>
print "ITR:",
$itr{'Iteration_iter-num'},"\n";
print "MSG:",
$itr{'Iteration_message'}, "\n" if $itr{'Iteration_message'};
}
elsif ($elem eq 'BlastOutput'){ # CASE </OUTPUT>
my $data = $group->getBest;
flock(PIPE, 2) or die "$!:Can't lock pipe $PIPE\n";
for(my $i=0; $i<@$data; $i++){
print PIPE join("\t", map { $data->[$i]->{$_} } @SCHEMA),"\n";
}
flock(PIPE, 8) or die "$!:Can't free pipe $PIPE\n";
#exit;
}
}
__END__
yvan said:
> Hi all,
>
> I am finishing up a parser for the xml output format of blast using the expat
> library. When i collect the data returned by the dataHandler function, some of
> them are truncated or a end of line is added, inducing a duplication. Did you
> have already observed a something similar? As it doesn't happen always, I don't
> suspect a script error. I am using the 1.95.1 version of expat, does a upgrade
> will solve this problem?
>
> cheers
>
> yvan
>
>
> _______________________________________________
> Biophp-dev mailing list
> Biophp-dev@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biophp-dev