This is a multi-part message in MIME format. --Boundary_(ID_Fs4osyRc+qetxgTEVrbYwA) Content-type: text/plain; format=flowed; charset=us-ascii Content-transfer-encoding: 7BIT Thanks Dan, just using the concatanation and emptying the array at the best moment, fix the problem. Dan Bolser wrote: >I had exactly that problem! > >My bug was not 'unsetting' the $currentTag pointer >at the 'endOfTag' event, this way I couldn't > >$data{$currentTag} .= $characterData > >on the 'characterData' event, instead I was doing something >like > >$data{$currentTag} = $characterData if !$data{$currentTag}; > >As sometimes not all the characterData comes in one event, >something to do with the characterData buffer, the above >approach sometimes gave truncated data. > >Once I properly unset $currentTag, I could append all the >caracterData for each tag properly! > >Here is my script, it uses a few optimizations and some >specific tricks for my needs ($group), but most of this >is behind the scenes... > > >__SKIP__ > >Skiping preamble, >in breif... > >PIPE = named pipe (fifo) for 'load data infile'. >$group = custom HSP grouping object. >@file = list of results files to parse. >$DIR = results files directory. > >use PDB_ISL; = custom data / the group object. > >back to the action... > >__RESUME__ > >use XML::Parser; > >#------------------------------------------------ ># ># Initalise parser. ># > >my $p = XML::Parser->new( > Handlers => { > Start => \&startEvent, > Char => \&charEvent, > End => \&endEvent, > } >); > >#------------------------------------------------ ># ># Set Globals for event handler communication. ># > >my ( $pos, %que, %itr, %hit, %hsp ); # NB: $pos == $currentTag > ># Here I decide which fields I want data from... > >my %QUE = %PDB_ISL::QUE; # Query sequence data fields >my %ITR = %PDB_ISL::ITR; # Iteration data fields >my %HIT = %PDB_ISL::HIT; # Hit sequence data fields >my %HSP = %PDB_ISL::HSP; # High Scoring Segment Pair data fields. > > >my @SCHEMA = @PDB_ISL::SCHEMA; # TABLE SCHEMA > >#------------------------------------------------ ># ># Begin. ># > >foreach ( @file ){ > warn "Processing $DIR/$_\n"; > unless (-s "$DIR/$_"){ > warn "No such file\n"; > next; > } > $group = PDB_ISL->group(); # Get new HSP group object. > > $p->parsefile( "$DIR/$_" ); # For details, see Event handlers. >} > >print "OK\n"; > >#------------------------------------------------ ># ># Event handlers. ># > >sub startEvent{ # <open_tag> > my ( $self, $elem, %attr ) = @_; > > $pos = $elem; # Set currentTag! > > # NB: CASE order = frequency of tag occurence! > > #print "OPEN $elem\n"; > > if ($pos eq 'Hsp'){ # CASE <HSP> > #print "\nNEW HSP\n"; > %hsp = %HSP; # Reset HSP data > } > elsif ($pos eq 'Hit'){ # CASE <HIT> > #print "NEW HIT\n"; > %hit = %HIT; # Reset HIT data > } > elsif ($pos eq 'Iteration'){ # CASE <ITERATION> > #print "NEW ITR\n"; > %itr = %ITR; # Reset ITR data > } > elsif ($pos eq 'BlastOutput'){# CASE <OUTPUT> (one query per file) > #print "NEW OUT\n"; > %que = %QUE; # Reset QUE data > } >} > >sub charEvent{ # <>between tags</> > my ( $expat, $text ) = @_; > > return unless $pos; # Very important! > > # NB: Only parse given fields. Ignore other data! > > # NB: CASE order as above! > > if ( exists $hsp{$pos} ){ > $hsp{$pos} .= $text; # Save HSP field data > #print "HSP:$pos:$text\n"; > } > elsif ( exists $hit{$pos} ){ > $hit{$pos} .= $text; # Save HIT field data > #print "HIT:$pos:$text\n"; > } > elsif ( exists $itr{$pos} ){ > $itr{$pos} .= $text; # Save ITR field data > #print "ITR:$pos:$text\n"; > } > elsif ( exists $que{$pos} ){ > $que{$pos} .= $text; # Save QUE field data > #print "QUE:$pos:$text\n"; > } >} > >sub endEvent{ # </close_tag> > my ( $self, $elem ) = @_; > > $pos = undef; # Unset currentTag. Very important! > > #print "CLOSE $elem\n"; > > if ($elem eq 'Hsp'){ # CASE </HSP> > > # TAKE A COPY! > my %data = ( %que, %itr, %hit, %hsp ); > > #print join("\t", map { $data{$_} } @SCHEMA),"\n"; > > $group->add( \%data ); # ADD TO GROUP! > } > > elsif($elem eq 'Hit'){ # CASE </HIT> > # Hello Mum! > > } > > elsif($elem eq 'Iteration'){ # CASE </ITR> > > print "ITR:", > $itr{'Iteration_iter-num'},"\n"; > print "MSG:", > $itr{'Iteration_message'}, "\n" if $itr{'Iteration_message'}; > > } > > elsif ($elem eq 'BlastOutput'){ # CASE </OUTPUT> > > my $data = $group->getBest; > > flock(PIPE, 2) or die "$!:Can't lock pipe $PIPE\n"; > > for(my $i=0; $i<@$data; $i++){ > print PIPE join("\t", map { $data->[$i]->{$_} } @SCHEMA),"\n"; > } > > flock(PIPE, 8) or die "$!:Can't free pipe $PIPE\n"; > > #exit; > } >} > >__END__ > >yvan said: > > >>Hi all, >> >>I am finishing up a parser for the xml output format of blast using the expat >>library. When i collect the data returned by the dataHandler function, some of >>them are truncated or a end of line is added, inducing a duplication. Did you >>have already observed a something similar? As it doesn't happen always, I don't >>suspect a script error. I am using the 1.95.1 version of expat, does a upgrade >>will solve this problem? >> >>cheers >> >>yvan >> >> >>_______________________________________________ >>Biophp-dev mailing list >>Biophp-dev@bioinformatics.org >>https://bioinformatics.org/mailman/listinfo/biophp-dev >> >> > > > >_______________________________________________ >Biophp-dev mailing list >Biophp-dev@bioinformatics.org >https://bioinformatics.org/mailman/listinfo/biophp-dev > > --Boundary_(ID_Fs4osyRc+qetxgTEVrbYwA) Content-type: text/html; charset=us-ascii Content-transfer-encoding: 7BIT <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"> <title></title> </head> <body text="#000000" bgcolor="#ffffff"> Thanks Dan, just using the concatanation and emptying the array at the best moment, fix the problem.<br> <br> <br> <br> Dan Bolser wrote:<br> <blockquote type="cite" cite="mid33349.80.1.204.180.1061814086.squirrel@www.mrc-dunn.cam.ac.uk"> <pre wrap="">I had exactly that problem! My bug was not 'unsetting' the $currentTag pointer at the 'endOfTag' event, this way I couldn't $data{$currentTag} .= $characterData on the 'characterData' event, instead I was doing something like $data{$currentTag} = $characterData if !$data{$currentTag}; As sometimes not all the characterData comes in one event, something to do with the characterData buffer, the above approach sometimes gave truncated data. Once I properly unset $currentTag, I could append all the caracterData for each tag properly! Here is my script, it uses a few optimizations and some specific tricks for my needs ($group), but most of this is behind the scenes... __SKIP__ Skiping preamble, in breif... PIPE = named pipe (fifo) for 'load data infile'. $group = custom HSP grouping object. @file = list of results files to parse. $DIR = results files directory. use PDB_ISL; = custom data / the group object. back to the action... __RESUME__ use XML::Parser; #------------------------------------------------ # # Initalise parser. # my $p = XML::Parser->new( Handlers => { Start => \&startEvent, Char => \&charEvent, End => \&endEvent, } ); #------------------------------------------------ # # Set Globals for event handler communication. # my ( $pos, %que, %itr, %hit, %hsp ); # NB: $pos == $currentTag # Here I decide which fields I want data from... my %QUE = %PDB_ISL::QUE; # Query sequence data fields my %ITR = %PDB_ISL::ITR; # Iteration data fields my %HIT = %PDB_ISL::HIT; # Hit sequence data fields my %HSP = %PDB_ISL::HSP; # High Scoring Segment Pair data fields. my @SCHEMA = @PDB_ISL::SCHEMA; # TABLE SCHEMA #------------------------------------------------ # # Begin. # foreach ( @file ){ warn "Processing $DIR/$_\n"; unless (-s "$DIR/$_"){ warn "No such file\n"; next; } $group = PDB_ISL->group(); # Get new HSP group object. $p->parsefile( "$DIR/$_" ); # For details, see Event handlers. } print "OK\n"; #------------------------------------------------ # # Event handlers. # sub startEvent{ # <open_tag> my ( $self, $elem, %attr ) = @_; $pos = $elem; # Set currentTag! # NB: CASE order = frequency of tag occurence! #print "OPEN $elem\n"; if ($pos eq 'Hsp'){ # CASE <HSP> #print "\nNEW HSP\n"; %hsp = %HSP; # Reset HSP data } elsif ($pos eq 'Hit'){ # CASE <HIT> #print "NEW HIT\n"; %hit = %HIT; # Reset HIT data } elsif ($pos eq 'Iteration'){ # CASE <ITERATION> #print "NEW ITR\n"; %itr = %ITR; # Reset ITR data } elsif ($pos eq 'BlastOutput'){# CASE <OUTPUT> (one query per file) #print "NEW OUT\n"; %que = %QUE; # Reset QUE data } } sub charEvent{ # <>between tags</> my ( $expat, $text ) = @_; return unless $pos; # Very important! # NB: Only parse given fields. Ignore other data! # NB: CASE order as above! if ( exists $hsp{$pos} ){ $hsp{$pos} .= $text; # Save HSP field data #print "HSP:$pos:$text\n"; } elsif ( exists $hit{$pos} ){ $hit{$pos} .= $text; # Save HIT field data #print "HIT:$pos:$text\n"; } elsif ( exists $itr{$pos} ){ $itr{$pos} .= $text; # Save ITR field data #print "ITR:$pos:$text\n"; } elsif ( exists $que{$pos} ){ $que{$pos} .= $text; # Save QUE field data #print "QUE:$pos:$text\n"; } } sub endEvent{ # </close_tag> my ( $self, $elem ) = @_; $pos = undef; # Unset currentTag. Very important! #print "CLOSE $elem\n"; if ($elem eq 'Hsp'){ # CASE </HSP> # TAKE A COPY! my %data = ( %que, %itr, %hit, %hsp ); #print join("\t", map { $data{$_} } @SCHEMA),"\n"; $group->add( \%data ); # ADD TO GROUP! } elsif($elem eq 'Hit'){ # CASE </HIT> # Hello Mum! } elsif($elem eq 'Iteration'){ # CASE </ITR> print "ITR:", $itr{'Iteration_iter-num'},"\n"; print "MSG:", $itr{'Iteration_message'}, "\n" if $itr{'Iteration_message'}; } elsif ($elem eq 'BlastOutput'){ # CASE </OUTPUT> my $data = $group->getBest; flock(PIPE, 2) or die "$!:Can't lock pipe $PIPE\n"; for(my $i=0; $i<@$data; $i++){ print PIPE join("\t", map { $data->[$i]->{$_} } @SCHEMA),"\n"; } flock(PIPE, 8) or die "$!:Can't free pipe $PIPE\n"; #exit; } } __END__ yvan said: </pre> <blockquote type="cite"> <pre wrap="">Hi all, I am finishing up a parser for the xml output format of blast using the expat library. When i collect the data returned by the dataHandler function, some of them are truncated or a end of line is added, inducing a duplication. Did you have already observed a something similar? As it doesn't happen always, I don't suspect a script error. I am using the 1.95.1 version of expat, does a upgrade will solve this problem? cheers yvan _______________________________________________ Biophp-dev mailing list <a class="moz-txt-link-abbreviated" href="mailto:Biophp-dev@bioinformatics.org">Biophp-dev@bioinformatics.org</a> <a class="moz-txt-link-freetext" href="https://bioinformatics.org/mailman/listinfo/biophp-dev">https://bioinformatics.org/mailman/listinfo/biophp-dev</a> </pre> </blockquote> <pre wrap=""><!----> _______________________________________________ Biophp-dev mailing list <a class="moz-txt-link-abbreviated" href="mailto:Biophp-dev@bioinformatics.org">Biophp-dev@bioinformatics.org</a> <a class="moz-txt-link-freetext" href="https://bioinformatics.org/mailman/listinfo/biophp-dev">https://bioinformatics.org/mailman/listinfo/biophp-dev</a> </pre> </blockquote> <br> </body> </html> --Boundary_(ID_Fs4osyRc+qetxgTEVrbYwA)--