[Bioclusters] FYI: minimal XML output fixer for NCBI BLAST

Wed May 11 23:10:32 EDT 2005

Simple problem:  take NCBI BLAST XML output and parse it.  It is an XML 
document after all, so it should be easy ... right?

Sort of ...

The NCBI XML output file is really a container of XML documents.  You 
cannot hand the container to be parsed to an XML Parser, as it (the 
container) is not a valid XML document (a valid XML document has exactly 
one <?xml version=""?> tag in it according to the standards on w3c.org).

So here is my (perl based) "solution" (read as hack).

	# assume entire document in $all, though this is Bad(TM)
	# for huge documents, very wasteful of memory resouces.
	#
	@sub_documents  = split(/\<\?xml version=\"1.0\"\?>/,$all);
	shift @sub_documents;

Now, each sub_document is in fact a valid XML document, that you can 
happily and easily parse.

	foreach (@sub_document)
	 {
	  # do stuff with $_ which is now a valid XML document
	 }

If there are any NCBI folks lurking here, is there a nice way to make 
the -m 7 output generate a single large valid XML document so we can use 
the  huge document parsers, rather than using hacks like the above?

As XML documents can be containers themselves, it seems to make sense to 
  make the entire output parseable without giving xmllint (and other XML 
parsers) fits

[landman at crunch-r.scalableinformatics.com:/big] 
                                                           137 >xmllint 
tomato_test1.1
tomato_test1.1:7365: parser error : XML declaration allowed only at the 
start of the document
<?xml version="1.0"?>
      ^
tomato_test1.1:7366: parser error : Extra content at the end of the document
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" 
"NCBI_BlastOutput.dt
^

Thanks.

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615