ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
root/Genquire/tigrxml.dtd
Revision: 1.1
Committed: Fri Jun 25 16:40:53 2004 UTC (11 years, 11 months ago) by skchan
Branch: MAIN
CVS Tags: HEAD
Log Message:
first version

Line File contents
1 <!--
2 ######################################################################################################################
3
4 tigrxml.dtd
5 DTD for XML format presented by TIGR to release the genome annotation data
6 to the scientific community.
7
8 Brian Haas 02/13/2001
9
10 This DTD continues to be under development. The current formatting described will be retained in future versions of this DTD.
11 Future additions to the DTD will be made to extend the information content and will be specified below.
12
13 -update 08/09/2001
14 -ALT_LOCUS added under GENE_INFO to identify alternate bac-based identifiers for a gene.
15 -FROM_OVERLAP_TYPE and TO_OVERLAP_TYPE added under TILING_PATH to describe the quality of the overlaps between adjacent assemblies in the tiling path.
16 -SEQ_LAST_TOUCHED added as an element under HEADER to identify the date in which the assembly sequence was last manipulated.
17
18 -update 06/11/2001 (began tracking updates).
19 -CLONE_NAME added as an attribute of ASMBL_ID, LEFT_ASMBL, RIGHT_ASMBL. It preexists as an element under HEADER, and that will remain.
20 -GENE_SYNONYM and CHROMO_LINK added as an element under TU and MODEL.
21 -TRANSCRIPT_SEQUENCE (as well as CDS_SEQUENCE and PROTEIN_SEQUENCE) are optional.
22 ######################################################################################################################
23 -->
24
25
26 <!--
27 Root element for XML is TIGR. TIGR contains at least one ASSEMBLY element.
28 -->
29
30 <!ELEMENT TIGR (PSEUDOCHROMOSOME | ASSEMBLY)* >
31
32 <!ELEMENT PSEUDOCHROMOSOME (SCAFFOLD, ASSEMBLY) >
33
34 <!--
35 The ASSEMBLY element is the parent element referring to an individual nucleotide assembly.
36 Often, the nucleotide assembly represents a single BAC (bacterial artificial chromosome) sequence.
37 This element houses the annotation for the sequence unit.
38
39 The unique index to the TIGR annotation database is the ASMBL_ID.
40 CLONE_ID is for TIGR's tracking purposes only.
41 DATABASE references the TIGR annotation database name. ie. ATH1:Arabidopsis, OSA1:Rice.
42 CURRENT_DATE : the date the xml was created.
43 COORDSET: represents the coordinates for which information is provided for the assembly. If the
44 entire assembly is described, then the coordset will be from position 1 to the length of the assembly.
45
46 -->
47
48 <!ELEMENT ASSEMBLY ( ASMBL_ID, COORDSET, HEADER, TILING_PATH?, GENE_LIST, MISC_INFO?, REPEAT_LIST?, ASSEMBLY_SEQUENCE ) >
49
50
51 <!ATTLIST ASSEMBLY CLONE_ID NMTOKEN #REQUIRED >
52 <!ATTLIST ASSEMBLY DATABASE NMTOKEN #REQUIRED >
53 <!ATTLIST ASSEMBLY CHROMOSOME NMTOKEN #IMPLIED >
54 <!ATTLIST ASSEMBLY CURRENT_DATE CDATA #REQUIRED >
55
56 <!ELEMENT ASMBL_ID (#PCDATA) >
57 <!ATTLIST ASMBL_ID CLONE_NAME CDATA #IMPLIED>
58
59
60 <!--
61 GENE_LIST contains all gene features broken down into two parent nodes: the protein coding
62 genes and the RNA genes.
63 -->
64
65 <!ELEMENT GENE_LIST (PROTEIN_CODING, RNA_GENES)>
66
67
68
69
70 <!--
71 Element RNA_GENES contains each of the non-protein coding genes that TIGR may provide annotation for.
72 These include tRNAs (see PRE-TRNA), small nuclear RNAs (see SNRNA), small nucleolar RNAs (see SNORNA),
73 and ribosomal RNAs (see RRNA).
74 -->
75
76 <!ELEMENT RNA_GENES (PRE-TRNA*, SNRNA*, SNORNA*, RRNA*) >
77
78
79
80 <!--
81 FEAT_NAME represents a temporary identifier assigned to each gene component. The only stable reference
82 to a gene is the LOCUS or PUB_LOCUS (see GENE_INFO).
83 -->
84
85 <!ELEMENT FEAT_NAME (#PCDATA) >
86
87
88
89 <!--
90 DATE represents the date in which a feature was created or modified. This element is useful for synchronization
91 of the annotation data with external databases.
92 -->
93
94 <!ELEMENT DATE (#PCDATA) >
95
96
97
98 <!--
99 PROTEIN_CODING genes are represented by at least four components: TU, MODEL, EXON, CDS.
100 The TU represents the transcriptional unit and is the highest order component of the gene.
101 A TU can encode multiple gene MODELs only in cases where alternative splicing exists.
102 A gene MODEL encapsulates all of the coding and non-coding structures of an individual splicing isoform.
103 Each gene MODEL can encode several mRNA EXONS and represent the spliced, intronless portions of the gene.
104 An mRNA EXON may only partially code for a protein; exactly the case where upstream or downstream untranslated
105 regions exist. The protein coding portion of an individual EXON is represented by the CDS element. The CDS element
106 will also encode the stop codon. The gene components are not ordered based on their coordinates.
107 For regions in which untranslated regions exist, UTR(s) will present. UTR(s) represent the non-protein-coding portions
108 of the RNA EXON(s). UTRs are not currently supported TIGR data types outside of this DTD and they exist here only
109 to facilitate external data analysis.
110
111 Each gene component has a coordinate set associated with it (see COORDS). The following illustration should clarify
112 the role of each element and its coordinates:
113
114 TU {=============================================================================}
115 | |
116 MODEL | {============================================================} |
117 | | | |
118 EXON(s) {=============} {========================} {========================}
119 | || | | | || |
120 CDS(s) | |{=====} {========================} {===============}| |
121 | | | |
122 UTR(s) {======} {=======}
123
124 -->
125
126
127 <!ELEMENT PROTEIN_CODING (TU*) >
128
129 <!ELEMENT TU (FEAT_NAME, GENE_SYNONYM*, CHROMO_LINK*, DATE, GENE_INFO, COORDSET, MODEL+, TRANSCRIPT_SEQUENCE?) >
130
131 <!ELEMENT MODEL (FEAT_NAME, GENE_SYNONYM*, CHROMO_LINK*, DATE, COORDSET, EXON+, CDS_SEQUENCE?, PROTEIN_SEQUENCE?) >
132 <!ATTLIST MODEL COMMENT CDATA #IMPLIED >
133
134 <!ELEMENT EXON (FEAT_NAME, DATE, COORDSET, CDS?, UTRS?) >
135
136 <!ELEMENT CDS (FEAT_NAME, DATE, COORDSET) >
137
138
139 <!-- GENE_SYNONYM explained:
140 Within the overlapping regions of bacs, a gene may exist either compeletely or partially on both bacs,
141 and represent the single canonical gene only after a merging operation has taken place. For the purpose
142 of generating a non-redundant gene set, all gene synonyms are merged with their sibling gene synonyms.
143 -->
144
145 <!ELEMENT GENE_SYNONYM (#PCDATA) >
146
147
148 <!-- CHROMO_LINK explained:
149 Pseudo-chromosomes are constructed based on the tiling path containing individual bacs. The annotation along the
150 bac sequences are propagated from the bac to the pseudochromosome. The genes on the pseudo-chromosome are given
151 new temporary feat_name identifiers that differ from the bac genes from which they were derived. The CHROMO_LINK
152 provides a link between the temporary feat_name of the pseudochromosome gene set to the feat_names of the genes
153 along the bacs.
154 -->
155
156
157 <!ELEMENT CHROMO_LINK (#PCDATA) >
158
159
160
161
162 <!--
163 UTRS specify each UTR or untranslated region.
164 There can be more than one if it's a single exon gene: ie.
165 5' 3'
166 EXON: {===============================================}
167 CDS : | |{============================}| |
168 LEFT_UTR: {===} | |
169 RIGHT_UTR: {============}
170
171
172
173 If no portion of the EXON is translated, then we have an EXTENDED_UTR, which
174 is includes the full length of the EXON.
175
176
177
178 -->
179
180 <!ELEMENT UTRS (LEFT_UTR | RIGHT_UTR | EXTENDED_UTR)* >
181
182 <!ELEMENT LEFT_UTR (COORDSET) >
183
184 <!ELEMENT RIGHT_UTR (COORDSET) >
185
186 <!ELEMENT EXTENDED_UTR (COORDSET) >
187
188
189 <!--
190 Gene Sequences Described:
191 TRANSCRIPT_SEQUENCE: provides the unspliced genomic nucleotide sequence representing the entire transcribed
192 region of the gene.
193 CDS_SEQUENCE: The nucleotide sequence which encodes the protein sequence directly.
194 PROTEIN_SEQUENCE: the peptide sequence representing the translation of the CDS_SEQUENCE.
195 -->
196
197 <!ELEMENT TRANSCRIPT_SEQUENCE (#PCDATA) >
198
199 <!ELEMENT CDS_SEQUENCE (#PCDATA) >
200
201 <!ELEMENT PROTEIN_SEQUENCE (#PCDATA) >
202
203
204 <!--
205 COORDSET contains child elements END5 and END3 and provides the sequence-based (see ASSEMBLY_SEQUENCE) coordinates for all elements
206 containing it. The sequence begins at position 1. END5 and END3 represent the exact coordinates of the feature within the
207 sequence provided (positive orientation). If END5 < END3, then the positive strand orientation is specified; therefore,
208 if END5 > END3, the negative strand orientation is referenced.
209
210 -->
211
212 <!ELEMENT COORDSET (END5, END3) >
213
214 <!ELEMENT END5 (#PCDATA)>
215
216 <!ELEMENT END3 (#PCDATA)>
217
218
219
220
221
222 <!--
223 GENE_INFO contains the gene name, locus, and functional category role assignment information. The LOCUS in many
224 instances represents the assembly (ie. BAC)-based gene identifier. The PUB_LOCUS represents a publication-based
225 locus; possibly representing a chromosomal locus identifier. In overlapping regions of assembly tiling paths, the
226 same gene may be represented on both overlapping assemblies with different LOCUS values but identical PUB_LOCUS values.
227 In chromosome-based data sets, this gene in overlapping assemblies will be represented singly under the PUB_LOCUS identifier,
228 the LOCUS identifiers from the derived genes are presented in the LOCUS and ALT_LOCUS fields for the chromosomal gene.
229
230 This is confusing, so here's an illustration:
231
232 (gene on BAC-1) Chromosome
233 pub_locus = At1g1 pub_locus = At1g1
234 locus = A locus = A
235 {====} alt_locus = B
236 ________________________ {====}
237 ______________________________ In the assembled chromosome sequence: _________________________________
238 {====}
239 (gene on BAC-2)
240 pub_locus = At1g1
241 locus = B
242
243
244
245 EC_NUM provides an enzyme commission number.
246 GENE_SYM provides the gene symbol conventionally given by experimentalists; ie. alcohol dehydrogenase: ADH
247 COM_NAME represents the gene name. IS_PSEUDOGENE is a toggle to indicate whether or not the gene is a pseudogene{1=true|0=false}
248
249 -->
250
251 <!ELEMENT GENE_INFO (LOCUS, ALT_LOCUS?, PUB_LOCUS?, COM_NAME, PUB_COMMENT?, EC_NUM?, GENE_SYM?, IS_PSEUDOGENE, ASSIGN_ACC?, DATE, ROLE_LIST?, EVIDENCE?) >
252
253 <!ELEMENT LOCUS (#PCDATA) >
254
255 <!ELEMENT PUB_LOCUS (#PCDATA) >
256
257 <!ELEMENT ALT_LOCUS (#PCDATA) >
258
259 <!ELEMENT COM_NAME (#PCDATA) >
260
261 <!ELEMENT PUB_COMMENT (#PCDATA) >
262
263 <!ELEMENT EC_NUM (#PCDATA) >
264
265 <!ELEMENT GENE_SYM (#PCDATA) >
266
267 <!ELEMENT IS_PSEUDOGENE (#PCDATA) >
268
269 <!--
270 ASSIGN_ACC is the database accession of a characterized protein from which the functional assignment is based.
271 Attribute ASSIGN_TYPE indicates the type of assignment made: manually curated (CURATED) or automatically assigned (AUTO)
272 -->
273
274
275 <!ELEMENT ASSIGN_ACC (#PCDATA) >
276 <!ATTLIST ASSIGN_ACC ASSIGN_TYPE NMTOKEN #REQUIRED >
277
278
279 <!--
280 ROLE_LIST contains each of the functional role category assignments for the gene.
281 COMPARTMENT indicates the role assignment class being used; examples include microbial, plant, GO (gene ontology), etc.
282 The roles are classifications that become more specific via the SUBROLE_* elements.
283 -->
284
285 <!ELEMENT ROLE_LIST (ROLE_INFO+) >
286
287 <!ELEMENT ROLE_INFO (COMPARTMENT, DATE, MAIN_ROLE, SUBROLE_1?, SUBROLE_2?, SUBROLE_3?, SUBROLE_4?) >
288
289 <!ELEMENT COMPARTMENT (#PCDATA) >
290
291 <!ELEMENT MAIN_ROLE (#PCDATA) >
292
293 <!ELEMENT SUBROLE_1 (#PCDATA) >
294
295 <!ELEMENT SUBROLE_2 (#PCDATA) >
296
297 <!ELEMENT SUBROLE_3 (#PCDATA) >
298
299 <!ELEMENT SUBROLE_4 (#PCDATA) >
300
301
302
303
304 <!--
305 EVIDENCE simply provides data indicating the type of evidence that is available
306 that may support the existence of the corresponding gene.
307 The attributes are toggles set to 0 or 1 to indicate the presence of that
308 evidence type.
309 -->
310
311 <!ELEMENT EVIDENCE EMPTY >
312
313
314 <!ATTLIST EVIDENCE GENE_PREDICTIONS NMTOKEN #REQUIRED>
315
316 <!ATTLIST EVIDENCE PROTEIN_MATCHES NMTOKEN #REQUIRED >
317
318 <!ATTLIST EVIDENCE GENE_INDEX_MATCHES NMTOKEN #REQUIRED >
319
320
321
322
323 <!--
324 REPEAT_LIST contains REPEAT elements. A repeat is a repetitive nucleotide sequence and could represent
325 simple repeats (AT-rich regions) to complex repeats (retroelements, rRNA sequences). Currently, rRNA
326 sequences are being specified here. Eventually, they will be specified in the RRNA element (see RRNA).
327 -->
328
329 <!ELEMENT REPEAT_LIST (REPEAT*) >
330
331 <!ELEMENT REPEAT (FEAT_NAME, DATE, COORDSET, REPEAT_TYPE) >
332
333 <!ELEMENT REPEAT_TYPE (#PCDATA) >
334
335
336
337 <!--
338 RRNA encompasses ribosomal RNA genes.
339 -->
340
341 <!ELEMENT RRNA (FEAT_NAME, DATE, COORDSET, COM_NAME) >
342
343
344 <!--
345 SNRNA encompasses small nuclear RNA genes.
346 -->
347
348 <!ELEMENT SNRNA (FEAT_NAME, DATE, COORDSET, COM_NAME) >
349
350
351 <!--
352 SNORNA encompasses small nucleolar RNA genes.
353 -->
354
355 <!ELEMENT SNORNA (FEAT_NAME, DATE, COORDSET, COM_NAME) >
356
357
358
359 <!--
360 TRNA genes are represented by multiple components. The structure is analogous to that
361 provided for protein coding genes (see PROTEIN_CODING). The major difference is the lack
362 of a CDS, since no protein is encoded by tRNA genes.
363 The analogies are presented as follows:
364 PRE-TRNA ~ TU
365 TRNA ~ MODEL
366 RNA-EXON ~ EXON
367
368 -->
369
370 <!ELEMENT PRE-TRNA (FEAT_NAME, DATE, COORDSET, TRNA) >
371
372 <!ELEMENT TRNA (FEAT_NAME, DATE, COORDSET, COM_NAME, RNA-EXON+) >
373 <!ATTLIST TRNA ANTICODON NMTOKEN #REQUIRED >
374
375
376 <!ELEMENT RNA-EXON (FEAT_NAME, DATE, COORDSET)>
377
378
379
380 <!--
381 TILING_PATH provides all of the information required to position the current ASSEMBLY in the
382 context of it's neighboring ASSEMBLY(s). Each element and attribute is described as follows:
383 ORIENTATION : [+|-] the strand orientation in the pseudo-chromosome.
384 LEFT_ASMBL : identifies the ASMBL_ID (see ASSEMBLY) of the preceding neighbor in the tiling path.
385 RIGHT_ASMBL : identifies the ASMBL_ID of the succeeding neighbor in the tiling path.
386 FROM_CONNECT : [1|0] toggle which identifies if there is a sequence joining between the preceding sequence and current sequence.
387 TO_CONNECT : [1|0] toggle ... analogous to FROM_CONNECT except that it refers to the joining of the current assemlby and the succeeding one.
388 FROM_OVERLAP_SIZE : indicates the number of nucleotides that the current bac overlaps with the preceding bac.
389 FROM_OVERHANG_SIZE : indicates the sequence length of non-overlapping sequence of the current sequence with the preceding sequence.
390 TO_OVERHANG_SIZE : indicates the length of non-overlapping sequence of the current bac with the preceding bac.
391 FROM_OVERLAP_TYPE and TO_OVERLAP_TYPE : describes the quality of the overlap between the adjacent assemblies.
392
393
394
395 Interpretation of this data: The data presented above is essentialy a single node in a linked list. To build the
396 pseudo-chromosome, the first step is to identify the head-asmbl_id which should have a FROM_OVERLAP = 0. From that
397 element, you can identify the TO_ASMBL that overlaps it.
398 Once an overlapping assembly is identified, prior to doing anything else, you must flip the assembly to the proper orientation (+).
399 Then, you can align the assembly to the previous one via the overlap information.
400
401 Here's an illustration:
402 ASMBL_ID 23
403
404 \__________________________________/
405 ___________________
406 \
407 ASMBL_ID 1
408
409
410 properties of ASMBL_ID 1 (ORIENTATION = '+', FROM_CONNECT = 0, TO_CONNECT = 1, RIGHT_ASMBL = 23, TO_OVERHANG_SIZE = 50).
411 properties of ASMBL_ID 23 (ORIENTATION = '+', FROM_CONNECT = 1, FROM_OVERHANG_SIZE = 120, FROM_OVERLAP_SIZE = 1000, TO_OVERHANG_SIZE = 140)
412
413 The FROM_OVERHANG_SIZE indicates the (\) portion of ASMBL_ID 23 in which non-overlapping sequence exists (ie. untrimmed vector).
414 The FROM_OVERLAP_SIZE indicates that 1000 nt's overlap ASMBL_ID 1. Summing up both pieces of information for ASMBL_ID 23, coordinates 1 to 120
415 do not overlap, coordinates 121 to 1121 do overlap ASMBL_ID 1.
416 If size N = length (ASMBL_ID 1), then ASMBL_ID 23 overlaps ASMBL_ID 1 between (N-50-1000) to (N-50), taking into account
417 the non-overlapping sequence of ASMBL_ID 1.
418
419 If either assembly was in the reverse orientation (ORIENTATION = '-'), then the first step would be to reverse complement the sequence. The
420 remainder of the protocol remains identical.
421
422 MOST OF THE TIME, the non-overlapping end sequences OVERHANG_SIZE(s) will be = 0 because the assemblies should be trimmed of vector prior
423 to entering either genbank or TIGR's annotation database. Although, there may be some exceptions, and this specification prepares for it.
424
425 -->
426
427
428 <!ELEMENT TILING_PATH (LEFT_ASMBL, RIGHT_ASMBL, FROM_CONNECT, TO_CONNECT, ORIENTATION, FROM_OVERLAP_SIZE, FROM_OVERHANG_SIZE, TO_OVERHANG_SIZE, FROM_OVERLAP_TYPE, TO_OVERLAP_TYPE, DATE) >
429
430
431
432 <!ELEMENT LEFT_ASMBL (#PCDATA) >
433 <!ATTLIST LEFT_ASMBL CLONE_NAME CDATA #IMPLIED>
434
435 <!ELEMENT RIGHT_ASMBL (#PCDATA) >
436 <!ATTLIST RIGHT_ASMBL CLONE_NAME CDATA #IMPLIED>
437
438 <!ELEMENT FROM_CONNECT (#PCDATA)>
439
440 <!ELEMENT ORIENTATION (#PCDATA) >
441
442 <!ELEMENT TO_CONNECT (#PCDATA) >
443
444 <!ELEMENT FROM_OVERLAP_SIZE (#PCDATA) >
445
446 <!ELEMENT FROM_OVERHANG_SIZE (#PCDATA) >
447
448 <!ELEMENT TO_OVERHANG_SIZE (#PCDATA) >
449
450 <!ELEMENT FROM_OVERLAP_TYPE (#PCDATA) >
451
452 <!ELEMENT TO_OVERLAP_TYPE (#PCDATA) >
453
454
455
456 <!--
457 SCAFFOLD is composed of SCAFFOLD_COMPONENT(s). Each SCAFFOLD_COMPONENT indicates the portion of a given nucleotide
458 assembly (ie. BAC) from which a segment of the pseudochromosome was constructed. By joining each of the SCAFFOLD_COMPONENT(s),
459 the entire pseudo-chromosome nucleotide sequence can be constructed.
460
461 -->
462
463
464
465 <!ELEMENT SCAFFOLD (SCAFFOLD_COMPONENT+) >
466
467 <!ELEMENT SCAFFOLD_COMPONENT (ASMBL_ID, CHR_LEFT_COORD, CHR_RIGHT_COORD, ASMBL_LEFT_COORD, ASMBL_RIGHT_COORD, ORIENTATION, DATE) >
468
469 <!ELEMENT CHR_LEFT_COORD (#PCDATA) >
470
471 <!ELEMENT CHR_RIGHT_COORD (#PCDATA) >
472
473 <!ELEMENT ASMBL_LEFT_COORD (#PCDATA) >
474
475 <!ELEMENT ASMBL_RIGHT_COORD (#PCDATA) >
476
477
478
479
480 <!--
481
482 MISC_INFO is the component in which we can store any comments regarding the ASSEMBLY. The FEATURE_DESC element
483 contains the feature description text, and a COORDSET element identifies the position the comment is referring to.
484 -->
485
486 <!ELEMENT MISC_INFO ( MISC_FEATURE+ ) >
487
488 <!ELEMENT MISC_FEATURE (COORDSET, DATE, FEATURE_DESC) >
489
490 <!ELEMENT FEATURE_DESC (#PCDATA) >
491
492
493
494 <!--
495 The HEADER element contains some basic attributes of the nucleotide assembly, including the identity of the
496 organism from which it was derived, the lineage, the group that sequenced the assembly, and information that is
497 provided to genbank within TIGR's annotation submissions.
498 The SEQ_LAST_TOUCHED field contains the date in which this sequence was last manipulated, which may or may not
499 include actual sequence changes.
500
501 -->
502
503
504 <!ELEMENT HEADER ( CLONE_NAME, SEQ_LAST_TOUCHED, GB_ACCESSION, ORGANISM, LINEAGE, SEQ_GROUP, KEYWORDS*, GB_DESCRIPTION*, GB_COMMENT*, AUTHOR_LIST ) >
505
506 <!ELEMENT CLONE_NAME (#PCDATA) >
507
508 <!ELEMENT SEQ_LAST_TOUCHED (DATE) >
509
510 <!ELEMENT GB_ACCESSION (#PCDATA) >
511
512 <!ELEMENT ORGANISM (#PCDATA) >
513
514 <!ELEMENT LINEAGE (#PCDATA) >
515
516 <!ELEMENT SEQ_GROUP (#PCDATA)>
517
518 <!ELEMENT KEYWORDS (#PCDATA) >
519
520 <!ELEMENT GB_DESCRIPTION (#PCDATA) >
521
522 <!ELEMENT GB_COMMENT (#PCDATA) >
523
524 <!ELEMENT AUTHOR_LIST ( AUTHOR*) >
525 <!ATTLIST AUTHOR_LIST CONTACT CDATA #IMPLIED >
526
527 <!ELEMENT AUTHOR EMPTY>
528 <!ATTLIST AUTHOR FNAME CDATA #IMPLIED >
529 <!ATTLIST AUTHOR LNAME CDATA #REQUIRED >
530 <!ATTLIST AUTHOR MNAME CDATA #IMPLIED >
531 <!ATTLIST AUTHOR SUFFIX CDATA #IMPLIED >
532
533
534 <!--
535 ASSEMBLY_SEQUENCE contains the entire nucleotide sequence of the ASSEMBLY. The sequence begins at position 1 in our coordinate space and is
536 assumed to exist in the positive strand orientation. No whitespace should interrupt the sequence; it should exist
537 as one loooooooong string.
538 -->
539
540 <!ELEMENT ASSEMBLY_SEQUENCE (#PCDATA) >