I had exactly that problem! My bug was not 'unsetting' the $currentTag pointer at the 'endOfTag' event, this way I couldn't $data{$currentTag} .= $characterData on the 'characterData' event, instead I was doing something like $data{$currentTag} = $characterData if !$data{$currentTag}; As sometimes not all the characterData comes in one event, something to do with the characterData buffer, the above approach sometimes gave truncated data. Once I properly unset $currentTag, I could append all the caracterData for each tag properly! Here is my script, it uses a few optimizations and some specific tricks for my needs ($group), but most of this is behind the scenes... __SKIP__ Skiping preamble, in breif... PIPE = named pipe (fifo) for 'load data infile'. $group = custom HSP grouping object. @file = list of results files to parse. $DIR = results files directory. use PDB_ISL; = custom data / the group object. back to the action... __RESUME__ use XML::Parser; #------------------------------------------------ # # Initalise parser. # my $p = XML::Parser->new( Handlers => { Start => \&startEvent, Char => \&charEvent, End => \&endEvent, } ); #------------------------------------------------ # # Set Globals for event handler communication. # my ( $pos, %que, %itr, %hit, %hsp ); # NB: $pos == $currentTag # Here I decide which fields I want data from... my %QUE = %PDB_ISL::QUE; # Query sequence data fields my %ITR = %PDB_ISL::ITR; # Iteration data fields my %HIT = %PDB_ISL::HIT; # Hit sequence data fields my %HSP = %PDB_ISL::HSP; # High Scoring Segment Pair data fields. my @SCHEMA = @PDB_ISL::SCHEMA; # TABLE SCHEMA #------------------------------------------------ # # Begin. # foreach ( @file ){ warn "Processing $DIR/$_\n"; unless (-s "$DIR/$_"){ warn "No such file\n"; next; } $group = PDB_ISL->group(); # Get new HSP group object. $p->parsefile( "$DIR/$_" ); # For details, see Event handlers. } print "OK\n"; #------------------------------------------------ # # Event handlers. # sub startEvent{ # <open_tag> my ( $self, $elem, %attr ) = @_; $pos = $elem; # Set currentTag! # NB: CASE order = frequency of tag occurence! #print "OPEN $elem\n"; if ($pos eq 'Hsp'){ # CASE <HSP> #print "\nNEW HSP\n"; %hsp = %HSP; # Reset HSP data } elsif ($pos eq 'Hit'){ # CASE <HIT> #print "NEW HIT\n"; %hit = %HIT; # Reset HIT data } elsif ($pos eq 'Iteration'){ # CASE <ITERATION> #print "NEW ITR\n"; %itr = %ITR; # Reset ITR data } elsif ($pos eq 'BlastOutput'){# CASE <OUTPUT> (one query per file) #print "NEW OUT\n"; %que = %QUE; # Reset QUE data } } sub charEvent{ # <>between tags</> my ( $expat, $text ) = @_; return unless $pos; # Very important! # NB: Only parse given fields. Ignore other data! # NB: CASE order as above! if ( exists $hsp{$pos} ){ $hsp{$pos} .= $text; # Save HSP field data #print "HSP:$pos:$text\n"; } elsif ( exists $hit{$pos} ){ $hit{$pos} .= $text; # Save HIT field data #print "HIT:$pos:$text\n"; } elsif ( exists $itr{$pos} ){ $itr{$pos} .= $text; # Save ITR field data #print "ITR:$pos:$text\n"; } elsif ( exists $que{$pos} ){ $que{$pos} .= $text; # Save QUE field data #print "QUE:$pos:$text\n"; } } sub endEvent{ # </close_tag> my ( $self, $elem ) = @_; $pos = undef; # Unset currentTag. Very important! #print "CLOSE $elem\n"; if ($elem eq 'Hsp'){ # CASE </HSP> # TAKE A COPY! my %data = ( %que, %itr, %hit, %hsp ); #print join("\t", map { $data{$_} } @SCHEMA),"\n"; $group->add( \%data ); # ADD TO GROUP! } elsif($elem eq 'Hit'){ # CASE </HIT> # Hello Mum! } elsif($elem eq 'Iteration'){ # CASE </ITR> print "ITR:", $itr{'Iteration_iter-num'},"\n"; print "MSG:", $itr{'Iteration_message'}, "\n" if $itr{'Iteration_message'}; } elsif ($elem eq 'BlastOutput'){ # CASE </OUTPUT> my $data = $group->getBest; flock(PIPE, 2) or die "$!:Can't lock pipe $PIPE\n"; for(my $i=0; $i<@$data; $i++){ print PIPE join("\t", map { $data->[$i]->{$_} } @SCHEMA),"\n"; } flock(PIPE, 8) or die "$!:Can't free pipe $PIPE\n"; #exit; } } __END__ yvan said: > Hi all, > > I am finishing up a parser for the xml output format of blast using the expat > library. When i collect the data returned by the dataHandler function, some of > them are truncated or a end of line is added, inducing a duplication. Did you > have already observed a something similar? As it doesn't happen always, I don't > suspect a script error. I am using the 1.95.1 version of expat, does a upgrade > will solve this problem? > > cheers > > yvan > > > _______________________________________________ > Biophp-dev mailing list > Biophp-dev@bioinformatics.org > https://bioinformatics.org/mailman/listinfo/biophp-dev