[Bioclusters] FYI: minimal XML output fixer for NCBI BLAST
Joe Landman
landman at scalableinformatics.com
Wed May 11 23:10:32 EDT 2005
Simple problem: take NCBI BLAST XML output and parse it. It is an XML
document after all, so it should be easy ... right?
Sort of ...
The NCBI XML output file is really a container of XML documents. You
cannot hand the container to be parsed to an XML Parser, as it (the
container) is not a valid XML document (a valid XML document has exactly
one <?xml version=""?> tag in it according to the standards on w3c.org).
So here is my (perl based) "solution" (read as hack).
# assume entire document in $all, though this is Bad(TM)
# for huge documents, very wasteful of memory resouces.
#
@sub_documents = split(/\<\?xml version=\"1.0\"\?>/,$all);
shift @sub_documents;
Now, each sub_document is in fact a valid XML document, that you can
happily and easily parse.
foreach (@sub_document)
{
# do stuff with $_ which is now a valid XML document
}
If there are any NCBI folks lurking here, is there a nice way to make
the -m 7 output generate a single large valid XML document so we can use
the huge document parsers, rather than using hacks like the above?
As XML documents can be containers themselves, it seems to make sense to
make the entire output parseable without giving xmllint (and other XML
parsers) fits
[landman at crunch-r.scalableinformatics.com:/big]
137 >xmllint
tomato_test1.1
tomato_test1.1:7365: parser error : XML declaration allowed only at the
start of the document
<?xml version="1.0"?>
^
tomato_test1.1:7366: parser error : Extra content at the end of the document
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN"
"NCBI_BlastOutput.dt
^
Thanks.
Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452
cell : +1 734 612 4615
More information about the Bioclusters
mailing list