[Biophp-dev] XML parser

Dan Bolser biophp-dev@bioinformatics.org
Mon, 25 Aug 2003 13:21:26 +0100 (BST)


I had exactly that problem!

My bug was not 'unsetting' the $currentTag pointer
at the 'endOfTag' event, this way I couldn't

$data{$currentTag} .= $characterData

on the 'characterData' event, instead I was doing something
like

$data{$currentTag} = $characterData if !$data{$currentTag};

As sometimes not all the characterData comes in one event,
something to do with the characterData buffer, the above
approach sometimes gave truncated data.

Once I properly unset $currentTag, I could append all the
caracterData for each tag properly!

Here is my script, it uses a few optimizations and some
specific tricks for my needs ($group), but most of this
is behind the scenes...


__SKIP__

Skiping preamble,
in breif...

PIPE   = named pipe (fifo) for 'load data infile'.
$group = custom HSP grouping object.
@file  = list of results files to parse.
$DIR   = results files directory.

use PDB_ISL;  = custom data / the group object.

back to the action...

__RESUME__

use XML::Parser;

#------------------------------------------------
#
# Initalise parser.
#

my $p = XML::Parser->new(
  Handlers => {
    Start =>    \&startEvent,
    Char  =>    \&charEvent,
    End   =>    \&endEvent,
  }
);

#------------------------------------------------
#
# Set Globals for event handler communication.
#

my ( $pos, %que, %itr, %hit, %hsp );    # NB: $pos == $currentTag

# Here I decide which fields I want data from...

my %QUE = %PDB_ISL::QUE;         # Query sequence data fields
my %ITR = %PDB_ISL::ITR;         # Iteration data fields
my %HIT = %PDB_ISL::HIT;         # Hit sequence data fields
my %HSP = %PDB_ISL::HSP;         # High Scoring Segment Pair data fields.


my @SCHEMA = @PDB_ISL::SCHEMA;   # TABLE SCHEMA

#------------------------------------------------
#
# Begin.
#

foreach ( @file ){
  warn "Processing $DIR/$_\n";
  unless (-s "$DIR/$_"){
    warn "No such file\n";
    next;
  }
  $group = PDB_ISL->group();    # Get new HSP group object.

  $p->parsefile( "$DIR/$_" );   # For details, see Event handlers.
}

print "OK\n";

#------------------------------------------------
#
# Event handlers.
#

sub startEvent{                 # <open_tag>
  my ( $self, $elem, %attr ) = @_;

  $pos = $elem;                 # Set currentTag!

  # NB: CASE order = frequency of tag occurence!

  #print "OPEN $elem\n";

  if    ($pos eq 'Hsp'){        # CASE <HSP>
    #print "\nNEW HSP\n";
    %hsp = %HSP;                # Reset HSP data
  }
  elsif ($pos eq 'Hit'){        # CASE <HIT>
    #print "NEW HIT\n";
    %hit = %HIT;                # Reset HIT data
  }
  elsif ($pos eq 'Iteration'){  # CASE <ITERATION>
    #print "NEW ITR\n";
    %itr = %ITR;                # Reset ITR data
  }
  elsif ($pos eq 'BlastOutput'){# CASE <OUTPUT> (one query per file)
    #print "NEW OUT\n";
    %que = %QUE;                # Reset QUE data
  }
}

sub charEvent{                  # <>between tags</>
  my ( $expat, $text ) = @_;

  return unless $pos;           # Very important!

  # NB: Only parse given fields. Ignore other data!

  # NB: CASE order as above!

  if    ( exists $hsp{$pos} ){
    $hsp{$pos} .= $text;        # Save HSP field data
    #print "HSP:$pos:$text\n";
  }
  elsif ( exists $hit{$pos} ){
    $hit{$pos} .= $text;        # Save HIT field data
    #print "HIT:$pos:$text\n";
  }
  elsif ( exists $itr{$pos} ){
    $itr{$pos} .= $text;        # Save ITR field data
    #print "ITR:$pos:$text\n";
  }
  elsif ( exists $que{$pos} ){
    $que{$pos} .= $text;        # Save QUE field data
    #print "QUE:$pos:$text\n";
  }
}

sub endEvent{                   # </close_tag>
  my ( $self, $elem ) = @_;

  $pos = undef;                 # Unset currentTag. Very important!

  #print "CLOSE $elem\n";

  if   ($elem eq 'Hsp'){        # CASE </HSP>

    # TAKE A COPY!
    my %data = ( %que, %itr, %hit, %hsp );

    #print join("\t", map { $data{$_} } @SCHEMA),"\n";

    $group->add( \%data );      # ADD TO GROUP!
  }

  elsif($elem eq 'Hit'){        # CASE </HIT>
    # Hello Mum!

  }

  elsif($elem eq 'Iteration'){  # CASE </ITR>

    print "ITR:",
      $itr{'Iteration_iter-num'},"\n";
    print "MSG:",
      $itr{'Iteration_message'}, "\n" if $itr{'Iteration_message'};

  }

  elsif ($elem eq 'BlastOutput'){ # CASE </OUTPUT>

    my $data = $group->getBest;

    flock(PIPE, 2)              or die "$!:Can't lock pipe $PIPE\n";

    for(my $i=0; $i<@$data; $i++){
      print PIPE join("\t", map { $data->[$i]->{$_} } @SCHEMA),"\n";
    }

    flock(PIPE, 8)              or die "$!:Can't free pipe $PIPE\n";

    #exit;
  }
}

__END__

yvan said:
> Hi all,
>
> I am finishing up a parser for the xml output format of blast using the  expat
> library. When i collect the data returned by the dataHandler  function, some of
> them are truncated or a end of line is added, inducing  a duplication. Did you
> have already observed a something similar? As it  doesn't happen always, I don't
> suspect a script error. I am using the  1.95.1 version of expat, does a upgrade
> will solve this problem?
>
> cheers
>
> yvan
>
>
> _______________________________________________
> Biophp-dev mailing list
> Biophp-dev@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biophp-dev