[Pipet Devel] XML

Fri May 28 15:25:55 EDT 1999

Locians,

The following is a reply I got from Guy Hulbert on the xml-mol mailing list.  I
would strongly suggest the Loci XML X-perts subscribe to this list, since this
is a legitimate attempt to come up with some standards for XML and
bioinformatics.  It would be nice if we could participate and improve Loci's
compatibility with future projects.

    http://ala.vsms.nottingham.ac.uk/biodom/xml-mol/

Jeff
-----------------8<-------------------

On Fri, 28 May 1999, J.W. Bizzaro wrote:

        <snip>

JWB> Anyway, I am coordinating an ambitious GNU project for UNIX-type systems

I'll have to check out your website then.

        <snip>

JWB> some ideas:
JWB> 
JWB>     Sequence definition:  BioML + BSML
JWB>     Structure def:        mmCIF/XML + some CML
JWB>     Phylogeny def:        ??? probably make our own
JWB>     Database query def:   maybe from BLAST/XML      
JWB>     Workflow def:         make our own
JWB>     GUI def:              from GLADE/XML
JWB>     Graphics def:         maybe from some KDE programs

        <snip>

Previously visual genomics had some restrictions on use of BSML.  They still
regard this as their intellectual property.  See:
        http://www.visualgenomics.com/bsml/index.html
for their current ideas.  I don't think this is suitable for a "GNU project".

I don't like either BioML or BSML.  It seems to me that they are much too
large --- trying to provide a complete 'bio-html'.   With XML namespaces 
      [ see:  http://www.xml.com/xml/pub/1999/01/namespaces.html ]
one ought to be able to to put together small DTDs for specialized data.

Consider DNA sequences.  All one needs is <dna> which is a string of the
characters CTAG.  One might allow ignorable whitespace and base-numbers:
<dna>
  1 tcgattcca gca...
 51 gcctacaac acg...
 ...
</dna>
which is understood by many present applications (without the tags).  There is
also a standard alphabet which allows ambiguous bases to be included, e.g. 'N'
stands for any of A,T,C,G etc.  It may be desirable to represent these
sequences as <dna-X> where X is the alphabet name.  However, to manage DNA
sequences, one doesn't need much more than this.

Now, this is a bit too small but it would be really nice to have a standard
Nucleic acid DTD --- or perhaps "Sequence" DTD.  It would have <dna>, <rna>
<protein>, and perhaps variations for generalized sequence alphabets.  If
everyone would use this then the problem of data-interchange between databases
is much simplified.

Suppose bio??? is some mythical organization which coordinates the standard
DTDs and everyone agreed to use them then XML namespaces would allow us to
represent (for example) Genbank data like this:

   [I stole a bit of this from Tim Bray's page on namespaces referenced above]

  <?xml ... ?>
  <h:html xmlns:s="http://www.bio???.org/DTD/sequence"
          xmlns:g="http://www.ncbi.nlm.nih.gov/DTD/genbank"
          xmlns:h="http://www.w3.org/HTML/1998/html4">
  <h:head><h:title>My Sequence</h:title></h:head>
 <body>
        <g:LOCUS>blah blah blah ... </g:LOCUS>
        ...
        <s:dna>
          1 tcgattcca gca...
         51 gcctacaac acg...
         ...
        </s:dna>
 </body>
 </h:html>

and with approiate style sheets, Mozilla and Internet Expoit^H^Hrer
would be able to display them.

I'm keen to work with anyone on getting these small things set up.  As
an experiment,  I'm planning to play with the Genbank data to create the
basic facility to create documents like the above.

----
Guy Hulbert, Systems Manager            Bioinformatics Supercomputing Centre
(416) 813-8876                          555 University Avenue
email: guy at bioinfo.sickkids.on.ca       The Hospital for Sick Children
http:  www.bioinfo.sickkids.on.ca       Toronto, ON, M5G 1X8, CANADA.