1 |
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
2 |
<html><head> |
3 |
|
4 |
|
5 |
|
6 |
<meta content="text/html; charset=ISO-8859-1" http-equiv="content-type"><title>Layout File Format</title></head> |
7 |
<body> |
8 |
<h3>Layout file format specification (proposal)<br> |
9 |
</h3> |
10 |
|
11 |
|
12 |
|
13 |
|
14 |
<b><br> |
15 |
</b>Besides the usual ACE files, <b>clview </b>also recognizes a proprietary "layout" format (stored in file usually having the extension <tt>.lyt</tt>) |
16 |
for representing multiple sequence alignment (MSA) layouts -- where |
17 |
typically smaller "component"sequences (henceforth called "<b>reads</b>") are aligned to (or are making up) a larger sequence (henceforth called "the <b>contig</b>").<br> |
18 |
|
19 |
|
20 |
|
21 |
|
22 |
<br> |
23 |
|
24 |
|
25 |
|
26 |
|
27 |
These layout files are text files having a pseudo-FASTA format-- each |
28 |
FASTA record representing one contig's "layout", like this:<br> |
29 |
|
30 |
|
31 |
|
32 |
|
33 |
<br> |
34 |
|
35 |
|
36 |
|
37 |
|
38 |
<big><b>></b></big><i><small><</small>contigName<small>></small></i> <small><</small><i>number_of_reads<small>></small> </i><small><</small><i>contig_start_coord<small>></small> </i><small><</small><i>contig_end_coord<small>></small> [<sequence></i><i>]<br> |
39 |
</i><i><small><</small>readName<small>></small></i> <small><</small><i>orientation<small>></small> </i><small><</small><i>read_length<small>></small> </i><small><</small><i>read_start_coord<small>></small></i> <small><</small><i>clip_left<small>> </small></i><small><</small><i>clip_right<small>> </small></i><i>[ <extra_attributes...></i><i>]</i><i><br> |
40 |
</i>.<br> |
41 |
|
42 |
|
43 |
|
44 |
|
45 |
.<br> |
46 |
|
47 |
|
48 |
|
49 |
|
50 |
All the fields on every line are space delimited (tab or plain space). Therefore no spaces are allowed <i>within</i> the fields (so contig and read names are not allowed to contain spaces).<br> |
51 |
|
52 |
|
53 |
|
54 |
|
55 |
Each FASTA-like record in such multi-layout file represents one layout |
56 |
definition (a multiple alignment space). Every such layout definition |
57 |
must start with line beginning with the '<b>></b>' character and |
58 |
containing some general contig/layout data (contig name, number of |
59 |
component reads, the start/end coordinates for the layout space and |
60 |
optionally the actual contig sequence, if any). This first |
61 |
contig/layout general info line must be followed by exactly <small><</small><i>number_of_reads<small>></small></i> lines containing component/read information (one line per read). For each <i>read</i> line , that fields are as follows:<br> |
62 |
|
63 |
|
64 |
|
65 |
|
66 |
<table bgcolor="#ccccff" border="0" cellpadding="2" cellspacing="2" width="100%"> |
67 |
|
68 |
|
69 |
|
70 |
|
71 |
<tbody> |
72 |
<tr> |
73 |
<td bgcolor="#ffffff" valign="top">1.<br> |
74 |
</td> |
75 |
<td bgcolor="#ffffff" valign="top"><i><small><</small>readName<small>></small></i></td> |
76 |
<td bgcolor="#ffffff" valign="top">a sequence identifier, unique within the current contig/layout</td> |
77 |
</tr> |
78 |
<tr> |
79 |
<td bgcolor="#ffffff" valign="top">2.<br> |
80 |
</td> |
81 |
<td bgcolor="#ffffff" valign="top"><small><</small><i>orientation<small>></small> </i></td> |
82 |
<td bgcolor="#ffffff" valign="top">one character: '+' or '-', representing the forward or reverse orientation of the read in the current layout<br> |
83 |
</td> |
84 |
</tr> |
85 |
<tr> |
86 |
<td bgcolor="#ffffff" valign="top">3.<br> |
87 |
</td> |
88 |
<td bgcolor="#ffffff" valign="top"><small><</small><i>read_length<small>></small> </i></td> |
89 |
<td bgcolor="#ffffff" valign="top">the actual length of the read (including the clipped ends). If segmented (see the <b>G:</b> option of <i><extra_attributes></i>), the intra-segment gaps are not considered as part of read length in the layout. <br> |
90 |
</td> |
91 |
</tr> |
92 |
<tr> |
93 |
<td bgcolor="#ffffff" valign="top">4.<br> |
94 |
</td> |
95 |
<td bgcolor="#ffffff" valign="top"><small><</small><i>read_start_coord<small>></small></i></td> |
96 |
<td bgcolor="#ffffff" valign="top">the leftmost (lowest) |
97 |
coordinate of this read in the current layout. The position could be |
98 |
"virtual" if the read is clipped at that left end. The orientation of |
99 |
the read does not matter for this assessment.<br> |
100 |
</td> |
101 |
</tr> |
102 |
<tr> |
103 |
<td bgcolor="#ffffff" valign="top">5.<br> |
104 |
</td> |
105 |
<td bgcolor="#ffffff" valign="top"><small><</small><i>clip_left<small>> </small></i></td> |
106 |
<td bgcolor="#ffffff" valign="top">the number of nucleotides trimmed at the left end. (Orientation doesn't matter)<br> |
107 |
</td> |
108 |
</tr> |
109 |
<tr> |
110 |
<td bgcolor="#ffffff" valign="top">6.<br> |
111 |
</td> |
112 |
<td bgcolor="#ffffff" valign="top"><i><small> </small></i><small><</small><i>clip_right<small>> </small></i></td> |
113 |
<td bgcolor="#ffffff" valign="top">the number of nucleotides trimmed at the right end of this read.<br> |
114 |
</td> |
115 |
</tr> |
116 |
<tr> |
117 |
<td bgcolor="#ffffff" valign="top">7.-...<br> |
118 |
</td> |
119 |
<td bgcolor="#ffffff" valign="top"><i>[<extra_attributes...>]</i></td> |
120 |
<td bgcolor="#ffffff" valign="top">One ore more space delimited<i> optional</i> |
121 |
attributes for this read may follow. Their order is not enforced and |
122 |
they should all start with a letter code followed by the ':' character |
123 |
and then followed by attribute specific text data (spaces not allowed). |
124 |
General format is: <small><</small><i>attr_code<small>></small></i><small><big><b>:</b></big></small><small><</small><i>attr_data<small>><br> |
125 |
<br> |
126 |
</small></i>Attributes recognized by the current specification:<br> |
127 |
<table bgcolor="#cccccc" border="0" cellpadding="4" cellspacing="3" width="100%"> |
128 |
<tbody> |
129 |
<tr> |
130 |
<td bgcolor="#ffffff" valign="top"><b>C:</b><small><</small><i>grp#<small>></small></i></td> |
131 |
<td bgcolor="#ffffff" valign="top">group color for the current read. This instructs the viewer to |
132 |
draw this read with a specific color uniquely associated to the given <small><</small><i>grp#<small>></small> |
133 |
</td> |
134 |
</tr> |
135 |
<tr> |
136 |
<td bgcolor="#ffffff" valign="top"><b>L:</b><small><</small><i>other_readName<small>,..></small></i></td> |
137 |
<td bgcolor="#ffffff" valign="top">[clone] link to another |
138 |
read in the layout. This instructs the viewer to place the other read |
139 |
on the same vertical line in the layout display (if possible), with |
140 |
perhaps a dotted line connecting such reads; a comma delimited list of |
141 |
read names can be given if such links extend to more than one other |
142 |
read. Only one read in such a linked list needs to have such a L: |
143 |
entry in order to declare the linked list/group of reads (that is, the |
144 |
other linked reads do not need to reciprocate by having the |
145 |
corresponding, but redundant L: attribute).<br> |
146 |
</td> |
147 |
</tr> |
148 |
<tr> |
149 |
<td bgcolor="#ffffff" valign="top"><b>S:</b><small><</small><i>sequence<small>></small></i></td> |
150 |
<td bgcolor="#ffffff" valign="top">The nucleotide sequence |
151 |
of the read exactly as it is included in the alignment. It must include |
152 |
the clipped ends and the small gaps (indels) introduced by the MSA |
153 |
(represented as '-' or '*' in the<small> <</small><i>sequence<small>></small></i>) -- so the length of such <small><</small><i>sequence<small>></small></i> string must be equal to <small><</small><i>read_length<small>></small> </i></td> |
154 |
</tr> |
155 |
<tr> |
156 |
<td bgcolor="#ffffff" valign="top"><b>G:</b><small><</small><i>seg1_end[</i><big><tt><b>c</b></tt></big><i><small><</small>seq1clipright<small>></small>]</i><i>[</i><big><tt><b>s</b><i>|</i><b>S</b></tt></big><i><small></small>]</i><tt><b>-</b></tt><i><br> |
157 |
seg2_start</i><i>[</i><big><tt><b>c</b></tt></big><i><small><</small>seg2clipleft<small>></small>]</i><i>[</i><big><tt><b>s</b><i>|</i><b>S</b></tt></big><i>]</i><b>,</b><i>...></i></td> |
158 |
<td bgcolor="#ffffff" valign="top">Segmented alignment |
159 |
(e.g. EST to genome). This is an indication that the read contains |
160 |
large internal gaps -- which should be displayed as <i>segments</i> |
161 |
connected by lines. The data for this attribute consists of a comma |
162 |
delimited list of coordinate pairs for the inter-segment gaps. |
163 |
Coordinates in a pair are separated by the '-' character. For each pair |
164 |
the first coordinate is the <i>end</i> position in the layout of the <i>previous</i> segment, while the second coordinate is the <i>start</i> position of the <i>next </i>segment in the layout.<br> |
165 |
<br> |
166 |
Example: say we have a "read" (e.g. a mRNA) called "MRNA244" which |
167 |
aligns onto the "contig" (e.g. genomic sequence) of length 34000 as 3 |
168 |
distinct segments (e.g. 3 exons), aligned at genomic coordinates: 300 |
169 |
to 500 (first segment), 800 to 1100 (second segment) and 1500 to 1900 |
170 |
(third segment) respectively. Assuming there is a 30nt clipping at the |
171 |
left end and 20nt clipping at the right end, and that the alignment has |
172 |
a "forward" orientation, the contig and the sequence line for MRNA244 |
173 |
in the layout file would look like this (let's assume there are 281 |
174 |
"reads" total in this imagined layout):<br> |
175 |
<br> |
176 |
<tt>>contig1 281 1 34000<br> |
177 |
MRNA244 + 953 270 30 20 G:500-800,1100-1500<br> |
178 |
</tt><br> |
179 |
The actual length of the "read" accounts for the length of each segment |
180 |
(201, 301 and 401 respectively) plus the clipping lengths at each end |
181 |
(20 and 30), so the total is 201+301+401+20+30=953<br> |
182 |
The left coordinate of the sequence in the alignment (270) is equal to |
183 |
the position of the first (leftmost) segment (300) minus the left-end |
184 |
clipping (30). <br> |
185 |
<br> |
186 |
There is a possibility to have clipping for each segment. This can be |
187 |
specified for each segment's end by appending the character <big><b><tt>c</tt></b></big> |
188 |
followed by the amount of clipping at that end. If in the example |
189 |
segment alignment above we had the 1st exon clipped 10 nucleotides at |
190 |
the right end, the 2nd exon clipped 5nt at the left end and 7nt at the |
191 |
right end, with the 3rd exon having 9nt clipped at the left end, the |
192 |
above read line may look like this:<br> |
193 |
<br> |
194 |
<tt> |
195 |
MRNA244 + 953 270 30 20 G:490c10-805c5,1093c7-1509c9<br> |
196 |
</tt><br> |
197 |
Please note that in this last example the actual coordinates of the |
198 |
alignment of the 3 segments (exons) to the genomic sequence are |
199 |
300-490, 805-1093 and 1509-1900 respectively. The way the clipping is |
200 |
specified in this G: attribute differs from the way the leftmost and |
201 |
rightmost clipping of the whole read is given. The difference is that |
202 |
the <big><b><tt>c</tt></b></big> |
203 |
clipping lengths in the G: attribute lie OUTSIDE the coordinates given |
204 |
for the segment ends in the same G: attibute, while the global leftmost |
205 |
clipping (30 in the example above) is included in the offset coordinate |
206 |
for the whole read (270 here).<br> |
207 |
<br> |
208 |
For EST to genome alignments, an optional 'S' (or 's') character may |
209 |
follow the inter-segment ends, indicating that a splice consensus |
210 |
(major or minor, respectively) was found on that side of the intron |
211 |
corresponding to that inter-segment gap.<br> |
212 |
<br> |
213 |
</td> |
214 |
</tr> |
215 |
<tr> |
216 |
<td bgcolor="#ffffff" valign="top"><b>D:</b><small><</small><i>se</i><i>q_diffs<small>..><br> |
217 |
</small></i> or for segmented (G:) reads:<i><small><br> |
218 |
</small></i><b>D:</b><small><</small><i>se</i><i>g1_diffs<small>..></small></i><big><b>/</b></big><small><</small><i>se</i><i>g2_diffs<small>..></small></i><i><small><br> |
219 |
<br> |
220 |
</small></i><br> |
221 |
</td> |
222 |
<td bgcolor="#ffffff" valign="top">If the contig sequence is given, this read attribute is the way to provide the display application with only a list of point-<i>differences</i> between this read's sequence and the contig sequence (so S: is not be needed). The <small><</small><i>seq_diffs<small>></small></i> is a concatenation of elements of this format:<br> |
223 |
<small><</small><i>incremental_coordinate<small>></small></i><small><</small><i>character<small>></small></i><i><br> |
224 |
<br> |
225 |
</i>..where <small><</small><i>incremental_coordinate<small>></small></i> is the numeric position of such a difference <i>relative </i>to |
226 |
the previous difference -- or if no such previous difference exists, relative to |
227 |
the first (leftmost) non-clipped nucleotide of the read. This incremental |
228 |
coordinate (which is always at least 1) must be followed by a <small><</small><i>character<small>></small></i> |
229 |
code. This character can be either an actual DNA base letter ('A', 'G', 'C', 'T', 'N', etc.) -- which |
230 |
indicates a nucleotide mismatch at that position, or the "dash" |
231 |
character ('-') indicating a gap in the alignment or this read to the |
232 |
contig sequence. <br> |
233 |
<br> |
234 |
Example: the following alignment is there between contig sequence and the read sequence:<br> |
235 |
<br> |
236 |
<tt>contig: ..A G T T G C T - C C T A - C T A C A G A C C N G...<br> |
237 |
read: ..A G T <font color="#990000">-</font> <font color="#990000">C</font> C T <font color="#990000">T</font> C C <font color="#990000">A</font> A <font color="#990000">N</font> C T <font color="#990000">- -</font> A <font color="#990000">T</font> A C C <font color="#990000">A</font> G...<br> |
238 |
(increments: 4 1 3 |
239 |
3 2 3 1 2 |
240 |
4 ... )<br> |
241 |
</tt><br> |
242 |
Assuming that the above alignment starts at position 200 in the contig |
243 |
and that the read called RDAAA of length 620 has 20 bp clipping right |
244 |
before this alignment (so the read left end coordinate in the layout is |
245 |
181), the following line description would apply to this read in the |
246 |
layout file (the ending ellipsis ... is not part of the actual text but |
247 |
just e placeholder for possible other differences to report):<br> |
248 |
<br> |
249 |
<tt>RDAAA + 620 181 20 0 D:4-1C3T3A2N3-1-2T4A...<br> |
250 |
</tt><br> |
251 |
Note that in this compact MSA representation there is no information |
252 |
provided about the nucleotide content of the clipped ends of the read. |
253 |
The viewer application may choose to represent such clipped regions as |
254 |
empty or gray boxes (rectangles), with the actual nucleotides only |
255 |
displayed in the non-clipped regions.<br> |
256 |
<br> |
257 |
For segmented alignments (i.e. those reads having a <b>G:</b> |
258 |
attribute), multiple such lists should be given (one for each segment), |
259 |
separated by the '/' (slash) character. For each such |
260 |
segment-differences list, the first incremental coordinate will be the |
261 |
distance from the beginning of the first (leftmost) non-clipped |
262 |
nucleotide of that segment.<br> |
263 |
<br> |
264 |
</td> |
265 |
</tr><tr> |
266 |
<td bgcolor="#ffffff" valign="top"><b>I:</b><small><</small><i>se</i><i>q_indels<small>..><br> |
267 |
</small></i>or for segmented (G:) reads:<i><small><br> |
268 |
</small></i><b>I:</b><small><</small><i>se</i><i>g1_indels<small>..></small></i><big><b>/</b></big><small><</small><i>se</i><i>g2_indels<small>..></small></i><br> |
269 |
</td> |
270 |
<td bgcolor="#ffffff" valign="top">Similar to the D: attribute, only that the <i>original</i> |
271 |
sequence for the read is assumed known from other sources (e.g. an |
272 |
indexed multi-FASTA file) and only gaps and deletions are reported as |
273 |
the operations needed to make that sequence fit into the current MSA. <br> |
274 |
The coordonate system is now entirely based on the <i>original</i>, raw |
275 |
read sequence, but with the same adjustment of the start coordinate |
276 |
based on the left clipping (i.e. all coordinates are relative to the |
277 |
first (leftmost) base in the read that is not clipped but actually used |
278 |
in the MSA).<br> |
279 |
The <small><</small><i>seq_indels<small>></small></i> is a concatenation of elements of this format:<br> |
280 |
<br> |
281 |
|
282 |
<small><</small><i>incremental_coordinate<small>></small></i><small><</small><i>indel_char<small>></small></i> <br> |
283 |
<br> |
284 |
..where <small><</small><i>indel_char<small>> </small></i>can |
285 |
be either '-' (gap) or 'd' (deletion) at the specific base |
286 |
position in the original read sequence. The actual base position can be |
287 |
obtained by this iterative formula:<br> |
288 |
<br> |
289 |
<i><base_position></i> = <i><incremental_coordinate></i> + <i><prev_base_position></i><br> |
290 |
<br> |
291 |
..where <i><prev_base_position></i> = <i><clip_left></i> for the first iteration (first element of <i><seq_indels></i>)<br> |
292 |
<br> |
293 |
<i><small></small></i></td> |
294 |
</tr> |
295 |
<tr> |
296 |
<td bgcolor="#ffffff" valign="top"><b>R:</b><i><small><</small>gaplist<small>></small></i><b><big>/</big></b><i><small><</small>contig_gaplist<small>></small></i><br> |
297 |
</td> |
298 |
<td bgcolor="#ffffff" valign="top">a special attribute for |
299 |
the Assembly-on-Reference procedure, providing all gap information for |
300 |
the alignment of this read to the parent contig. The gapping |
301 |
information (as produced by mgblast with -D5 option) is stored directly |
302 |
in this attribute data: the gaps in the read are in the first list <i><small><</small>gaplist<small>></small></i> and the gaps in the contig (reference) sequence are stored in <i><small><</small>contig_gaplist<small>></small></i>. The two lists are separated by the '/' (slash) character. Just like |
303 |
mgblast's -D5 output, the gap list has the format:<br> |
304 |
<br> |
305 |
<i><small><</small>gap1pos<small>></small>[+<small><</small>gap1length<small>></small>]</i><b>,</b><i><small><</small>gap2pos<small>></small>[+<small><</small>gap2length<small>></small>],...</i><br> |
306 |
<br>The <i>nrcl</i> |
307 |
(non-redundandification clustering) program automatically writes this |
308 |
attribute in the layout file produced when the -y option is given (if |
309 |
the gap information is available in the parsed mgblast hits).<br> |
310 |
<br> |
311 |
The <i>mblaor</i> (mgblast assembly-on-reference) program requires this |
312 |
attribute to be present when parsing an input layout file in order to |
313 |
generate a full MSA, transforming this info into indel operations |
314 |
applied to the read.<br> |
315 |
|
316 |
</td> |
317 |
</tr> |
318 |
|
319 |
</tbody> |
320 |
</table> |
321 |
<br> |
322 |
</td> |
323 |
</tr> |
324 |
</tbody> |
325 |
</table> |
326 |
|
327 |
|
328 |
|
329 |
|
330 |
<br> |
331 |
|
332 |
|
333 |
|
334 |
|
335 |
<br> |
336 |
|
337 |
|
338 |
|
339 |
|
340 |
<br> |
341 |
|
342 |
|
343 |
|
344 |
|
345 |
</body></html> |