[Biophp-dev] XML parser

yvan biophp-dev@bioinformatics.org
Wed, 27 Aug 2003 09:39:28 +1000


This is a multi-part message in MIME format.

--Boundary_(ID_Fs4osyRc+qetxgTEVrbYwA)
Content-type: text/plain; format=flowed; charset=us-ascii
Content-transfer-encoding: 7BIT

Thanks Dan, just using the concatanation and emptying the array at the 
best moment, fix the problem.



Dan Bolser wrote:

>I had exactly that problem!
>
>My bug was not 'unsetting' the $currentTag pointer
>at the 'endOfTag' event, this way I couldn't
>
>$data{$currentTag} .= $characterData
>
>on the 'characterData' event, instead I was doing something
>like
>
>$data{$currentTag} = $characterData if !$data{$currentTag};
>
>As sometimes not all the characterData comes in one event,
>something to do with the characterData buffer, the above
>approach sometimes gave truncated data.
>
>Once I properly unset $currentTag, I could append all the
>caracterData for each tag properly!
>
>Here is my script, it uses a few optimizations and some
>specific tricks for my needs ($group), but most of this
>is behind the scenes...
>
>
>__SKIP__
>
>Skiping preamble,
>in breif...
>
>PIPE   = named pipe (fifo) for 'load data infile'.
>$group = custom HSP grouping object.
>@file  = list of results files to parse.
>$DIR   = results files directory.
>
>use PDB_ISL;  = custom data / the group object.
>
>back to the action...
>
>__RESUME__
>
>use XML::Parser;
>
>#------------------------------------------------
>#
># Initalise parser.
>#
>
>my $p = XML::Parser->new(
>  Handlers => {
>    Start =>    \&startEvent,
>    Char  =>    \&charEvent,
>    End   =>    \&endEvent,
>  }
>);
>
>#------------------------------------------------
>#
># Set Globals for event handler communication.
>#
>
>my ( $pos, %que, %itr, %hit, %hsp );    # NB: $pos == $currentTag
>
># Here I decide which fields I want data from...
>
>my %QUE = %PDB_ISL::QUE;         # Query sequence data fields
>my %ITR = %PDB_ISL::ITR;         # Iteration data fields
>my %HIT = %PDB_ISL::HIT;         # Hit sequence data fields
>my %HSP = %PDB_ISL::HSP;         # High Scoring Segment Pair data fields.
>
>
>my @SCHEMA = @PDB_ISL::SCHEMA;   # TABLE SCHEMA
>
>#------------------------------------------------
>#
># Begin.
>#
>
>foreach ( @file ){
>  warn "Processing $DIR/$_\n";
>  unless (-s "$DIR/$_"){
>    warn "No such file\n";
>    next;
>  }
>  $group = PDB_ISL->group();    # Get new HSP group object.
>
>  $p->parsefile( "$DIR/$_" );   # For details, see Event handlers.
>}
>
>print "OK\n";
>
>#------------------------------------------------
>#
># Event handlers.
>#
>
>sub startEvent{                 # <open_tag>
>  my ( $self, $elem, %attr ) = @_;
>
>  $pos = $elem;                 # Set currentTag!
>
>  # NB: CASE order = frequency of tag occurence!
>
>  #print "OPEN $elem\n";
>
>  if    ($pos eq 'Hsp'){        # CASE <HSP>
>    #print "\nNEW HSP\n";
>    %hsp = %HSP;                # Reset HSP data
>  }
>  elsif ($pos eq 'Hit'){        # CASE <HIT>
>    #print "NEW HIT\n";
>    %hit = %HIT;                # Reset HIT data
>  }
>  elsif ($pos eq 'Iteration'){  # CASE <ITERATION>
>    #print "NEW ITR\n";
>    %itr = %ITR;                # Reset ITR data
>  }
>  elsif ($pos eq 'BlastOutput'){# CASE <OUTPUT> (one query per file)
>    #print "NEW OUT\n";
>    %que = %QUE;                # Reset QUE data
>  }
>}
>
>sub charEvent{                  # <>between tags</>
>  my ( $expat, $text ) = @_;
>
>  return unless $pos;           # Very important!
>
>  # NB: Only parse given fields. Ignore other data!
>
>  # NB: CASE order as above!
>
>  if    ( exists $hsp{$pos} ){
>    $hsp{$pos} .= $text;        # Save HSP field data
>    #print "HSP:$pos:$text\n";
>  }
>  elsif ( exists $hit{$pos} ){
>    $hit{$pos} .= $text;        # Save HIT field data
>    #print "HIT:$pos:$text\n";
>  }
>  elsif ( exists $itr{$pos} ){
>    $itr{$pos} .= $text;        # Save ITR field data
>    #print "ITR:$pos:$text\n";
>  }
>  elsif ( exists $que{$pos} ){
>    $que{$pos} .= $text;        # Save QUE field data
>    #print "QUE:$pos:$text\n";
>  }
>}
>
>sub endEvent{                   # </close_tag>
>  my ( $self, $elem ) = @_;
>
>  $pos = undef;                 # Unset currentTag. Very important!
>
>  #print "CLOSE $elem\n";
>
>  if   ($elem eq 'Hsp'){        # CASE </HSP>
>
>    # TAKE A COPY!
>    my %data = ( %que, %itr, %hit, %hsp );
>
>    #print join("\t", map { $data{$_} } @SCHEMA),"\n";
>
>    $group->add( \%data );      # ADD TO GROUP!
>  }
>
>  elsif($elem eq 'Hit'){        # CASE </HIT>
>    # Hello Mum!
>
>  }
>
>  elsif($elem eq 'Iteration'){  # CASE </ITR>
>
>    print "ITR:",
>      $itr{'Iteration_iter-num'},"\n";
>    print "MSG:",
>      $itr{'Iteration_message'}, "\n" if $itr{'Iteration_message'};
>
>  }
>
>  elsif ($elem eq 'BlastOutput'){ # CASE </OUTPUT>
>
>    my $data = $group->getBest;
>
>    flock(PIPE, 2)              or die "$!:Can't lock pipe $PIPE\n";
>
>    for(my $i=0; $i<@$data; $i++){
>      print PIPE join("\t", map { $data->[$i]->{$_} } @SCHEMA),"\n";
>    }
>
>    flock(PIPE, 8)              or die "$!:Can't free pipe $PIPE\n";
>
>    #exit;
>  }
>}
>
>__END__
>
>yvan said:
>  
>
>>Hi all,
>>
>>I am finishing up a parser for the xml output format of blast using the  expat
>>library. When i collect the data returned by the dataHandler  function, some of
>>them are truncated or a end of line is added, inducing  a duplication. Did you
>>have already observed a something similar? As it  doesn't happen always, I don't
>>suspect a script error. I am using the  1.95.1 version of expat, does a upgrade
>>will solve this problem?
>>
>>cheers
>>
>>yvan
>>
>>
>>_______________________________________________
>>Biophp-dev mailing list
>>Biophp-dev@bioinformatics.org
>>https://bioinformatics.org/mailman/listinfo/biophp-dev
>>    
>>
>
>
>
>_______________________________________________
>Biophp-dev mailing list
>Biophp-dev@bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/biophp-dev
>  
>


--Boundary_(ID_Fs4osyRc+qetxgTEVrbYwA)
Content-type: text/html; charset=us-ascii
Content-transfer-encoding: 7BIT

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
  <title></title>
</head>
<body text="#000000" bgcolor="#ffffff">
Thanks Dan, just using the concatanation and emptying the array at the
best moment, fix the problem.<br>
<br>
<br>
<br>
Dan Bolser wrote:<br>
<blockquote type="cite"
 cite="mid33349.80.1.204.180.1061814086.squirrel@www.mrc-dunn.cam.ac.uk">
  <pre wrap="">I had exactly that problem!

My bug was not 'unsetting' the $currentTag pointer
at the 'endOfTag' event, this way I couldn't

$data{$currentTag} .= $characterData

on the 'characterData' event, instead I was doing something
like

$data{$currentTag} = $characterData if !$data{$currentTag};

As sometimes not all the characterData comes in one event,
something to do with the characterData buffer, the above
approach sometimes gave truncated data.

Once I properly unset $currentTag, I could append all the
caracterData for each tag properly!

Here is my script, it uses a few optimizations and some
specific tricks for my needs ($group), but most of this
is behind the scenes...


__SKIP__

Skiping preamble,
in breif...

PIPE   = named pipe (fifo) for 'load data infile'.
$group = custom HSP grouping object.
@file  = list of results files to parse.
$DIR   = results files directory.

use PDB_ISL;  = custom data / the group object.

back to the action...

__RESUME__

use XML::Parser;

#------------------------------------------------
#
# Initalise parser.
#

my $p = XML::Parser-&gt;new(
  Handlers =&gt; {
    Start =&gt;    \&amp;startEvent,
    Char  =&gt;    \&amp;charEvent,
    End   =&gt;    \&amp;endEvent,
  }
);

#------------------------------------------------
#
# Set Globals for event handler communication.
#

my ( $pos, %que, %itr, %hit, %hsp );    # NB: $pos == $currentTag

# Here I decide which fields I want data from...

my %QUE = %PDB_ISL::QUE;         # Query sequence data fields
my %ITR = %PDB_ISL::ITR;         # Iteration data fields
my %HIT = %PDB_ISL::HIT;         # Hit sequence data fields
my %HSP = %PDB_ISL::HSP;         # High Scoring Segment Pair data fields.


my @SCHEMA = @PDB_ISL::SCHEMA;   # TABLE SCHEMA

#------------------------------------------------
#
# Begin.
#

foreach ( @file ){
  warn "Processing $DIR/$_\n";
  unless (-s "$DIR/$_"){
    warn "No such file\n";
    next;
  }
  $group = PDB_ISL-&gt;group();    # Get new HSP group object.

  $p-&gt;parsefile( "$DIR/$_" );   # For details, see Event handlers.
}

print "OK\n";

#------------------------------------------------
#
# Event handlers.
#

sub startEvent{                 # &lt;open_tag&gt;
  my ( $self, $elem, %attr ) = @_;

  $pos = $elem;                 # Set currentTag!

  # NB: CASE order = frequency of tag occurence!

  #print "OPEN $elem\n";

  if    ($pos eq 'Hsp'){        # CASE &lt;HSP&gt;
    #print "\nNEW HSP\n";
    %hsp = %HSP;                # Reset HSP data
  }
  elsif ($pos eq 'Hit'){        # CASE &lt;HIT&gt;
    #print "NEW HIT\n";
    %hit = %HIT;                # Reset HIT data
  }
  elsif ($pos eq 'Iteration'){  # CASE &lt;ITERATION&gt;
    #print "NEW ITR\n";
    %itr = %ITR;                # Reset ITR data
  }
  elsif ($pos eq 'BlastOutput'){# CASE &lt;OUTPUT&gt; (one query per file)
    #print "NEW OUT\n";
    %que = %QUE;                # Reset QUE data
  }
}

sub charEvent{                  # &lt;&gt;between tags&lt;/&gt;
  my ( $expat, $text ) = @_;

  return unless $pos;           # Very important!

  # NB: Only parse given fields. Ignore other data!

  # NB: CASE order as above!

  if    ( exists $hsp{$pos} ){
    $hsp{$pos} .= $text;        # Save HSP field data
    #print "HSP:$pos:$text\n";
  }
  elsif ( exists $hit{$pos} ){
    $hit{$pos} .= $text;        # Save HIT field data
    #print "HIT:$pos:$text\n";
  }
  elsif ( exists $itr{$pos} ){
    $itr{$pos} .= $text;        # Save ITR field data
    #print "ITR:$pos:$text\n";
  }
  elsif ( exists $que{$pos} ){
    $que{$pos} .= $text;        # Save QUE field data
    #print "QUE:$pos:$text\n";
  }
}

sub endEvent{                   # &lt;/close_tag&gt;
  my ( $self, $elem ) = @_;

  $pos = undef;                 # Unset currentTag. Very important!

  #print "CLOSE $elem\n";

  if   ($elem eq 'Hsp'){        # CASE &lt;/HSP&gt;

    # TAKE A COPY!
    my %data = ( %que, %itr, %hit, %hsp );

    #print join("\t", map { $data{$_} } @SCHEMA),"\n";

    $group-&gt;add( \%data );      # ADD TO GROUP!
  }

  elsif($elem eq 'Hit'){        # CASE &lt;/HIT&gt;
    # Hello Mum!

  }

  elsif($elem eq 'Iteration'){  # CASE &lt;/ITR&gt;

    print "ITR:",
      $itr{'Iteration_iter-num'},"\n";
    print "MSG:",
      $itr{'Iteration_message'}, "\n" if $itr{'Iteration_message'};

  }

  elsif ($elem eq 'BlastOutput'){ # CASE &lt;/OUTPUT&gt;

    my $data = $group-&gt;getBest;

    flock(PIPE, 2)              or die "$!:Can't lock pipe $PIPE\n";

    for(my $i=0; $i&lt;@$data; $i++){
      print PIPE join("\t", map { $data-&gt;[$i]-&gt;{$_} } @SCHEMA),"\n";
    }

    flock(PIPE, 8)              or die "$!:Can't free pipe $PIPE\n";

    #exit;
  }
}

__END__

yvan said:
  </pre>
  <blockquote type="cite">
    <pre wrap="">Hi all,

I am finishing up a parser for the xml output format of blast using the  expat
library. When i collect the data returned by the dataHandler  function, some of
them are truncated or a end of line is added, inducing  a duplication. Did you
have already observed a something similar? As it  doesn't happen always, I don't
suspect a script error. I am using the  1.95.1 version of expat, does a upgrade
will solve this problem?

cheers

yvan


_______________________________________________
Biophp-dev mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Biophp-dev@bioinformatics.org">Biophp-dev@bioinformatics.org</a>
<a class="moz-txt-link-freetext" href="https://bioinformatics.org/mailman/listinfo/biophp-dev">https://bioinformatics.org/mailman/listinfo/biophp-dev</a>
    </pre>
  </blockquote>
  <pre wrap=""><!---->


_______________________________________________
Biophp-dev mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Biophp-dev@bioinformatics.org">Biophp-dev@bioinformatics.org</a>
<a class="moz-txt-link-freetext" href="https://bioinformatics.org/mailman/listinfo/biophp-dev">https://bioinformatics.org/mailman/listinfo/biophp-dev</a>
  </pre>
</blockquote>
<br>
</body>
</html>

--Boundary_(ID_Fs4osyRc+qetxgTEVrbYwA)--