ViewVC Help
View File | Revision Log | Show Annotations | View Changeset | Root Listing
root/gclib/gfview/Layout_File_Format.html
Revision: 45
Committed: Tue Sep 6 18:19:20 2011 UTC (8 years, 6 months ago) by gpertea
File size: 18397 byte(s)
Log Message:
added gfview files

Line File contents
1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2 <html><head>
3
4
5
6 <meta content="text/html; charset=ISO-8859-1" http-equiv="content-type"><title>Layout File Format</title></head>
7 <body>
8 <h3>Layout file format specification (proposal)<br>
9 </h3>
10
11
12
13
14 <b><br>
15 </b>Besides the usual ACE files, <b>clview </b>also recognizes a proprietary "layout" format (stored in file usually having the extension <tt>.lyt</tt>)
16 for representing multiple sequence alignment (MSA) layouts -- where
17 typically smaller "component"sequences (henceforth called "<b>reads</b>") are aligned to (or are making up) a larger sequence (henceforth called "the <b>contig</b>").<br>
18
19
20
21
22 <br>
23
24
25
26
27 These layout files are text files having a pseudo-FASTA format-- each
28 FASTA record representing one contig's "layout", like this:<br>
29
30
31
32
33 <br>
34
35
36
37
38 <big><b>&gt;</b></big><i><small>&lt;</small>contigName<small>&gt;</small></i> <small>&lt;</small><i>number_of_reads<small>&gt;</small> </i><small>&lt;</small><i>contig_start_coord<small>&gt;</small> </i><small>&lt;</small><i>contig_end_coord<small>&gt;</small> [&lt;sequence&gt;</i><i>]<br>
39 </i><i><small>&lt;</small>readName<small>&gt;</small></i> <small>&lt;</small><i>orientation<small>&gt;</small> </i><small>&lt;</small><i>read_length<small>&gt;</small> </i><small>&lt;</small><i>read_start_coord<small>&gt;</small></i> <small>&lt;</small><i>clip_left<small>&gt; </small></i><small>&lt;</small><i>clip_right<small>&gt; </small></i><i>[ &lt;extra_attributes...&gt;</i><i>]</i><i><br>
40 </i>.<br>
41
42
43
44
45 .<br>
46
47
48
49
50 All the fields on every line are space delimited (tab or plain space). Therefore no spaces are allowed <i>within</i> the fields (so contig and read names are not allowed to contain spaces).<br>
51
52
53
54
55 Each FASTA-like record in such multi-layout file represents one layout
56 definition (a multiple alignment space). Every such layout definition
57 must start with line beginning with the '<b>&gt;</b>' character and
58 containing some general contig/layout data (contig name, number of
59 component reads, the start/end coordinates for the layout space and
60 optionally the actual contig sequence, if any). This first
61 contig/layout general info line must be followed by exactly <small>&lt;</small><i>number_of_reads<small>&gt;</small></i> lines containing component/read information (one line per read). For each <i>read</i> line , that fields are as follows:<br>
62
63
64
65
66 <table bgcolor="#ccccff" border="0" cellpadding="2" cellspacing="2" width="100%">
67
68
69
70
71 <tbody>
72 <tr>
73 <td bgcolor="#ffffff" valign="top">1.<br>
74 </td>
75 <td bgcolor="#ffffff" valign="top"><i><small>&lt;</small>readName<small>&gt;</small></i></td>
76 <td bgcolor="#ffffff" valign="top">a sequence identifier, unique within the current contig/layout</td>
77 </tr>
78 <tr>
79 <td bgcolor="#ffffff" valign="top">2.<br>
80 </td>
81 <td bgcolor="#ffffff" valign="top"><small>&lt;</small><i>orientation<small>&gt;</small> </i></td>
82 <td bgcolor="#ffffff" valign="top">one character: '+' or '-', representing the forward or reverse orientation of the read in the current layout<br>
83 </td>
84 </tr>
85 <tr>
86 <td bgcolor="#ffffff" valign="top">3.<br>
87 </td>
88 <td bgcolor="#ffffff" valign="top"><small>&lt;</small><i>read_length<small>&gt;</small> </i></td>
89 <td bgcolor="#ffffff" valign="top">the actual length of the read (including the clipped ends).&nbsp; If segmented (see the <b>G:</b> option of <i>&lt;extra_attributes&gt;</i>), the intra-segment gaps are not considered as part of read length in the layout. <br>
90 </td>
91 </tr>
92 <tr>
93 <td bgcolor="#ffffff" valign="top">4.<br>
94 </td>
95 <td bgcolor="#ffffff" valign="top"><small>&lt;</small><i>read_start_coord<small>&gt;</small></i></td>
96 <td bgcolor="#ffffff" valign="top">the leftmost (lowest)
97 coordinate of this read in the current layout. The position could be
98 "virtual" if the read is clipped at that left end. The orientation of
99 the read does not matter for this assessment.<br>
100 </td>
101 </tr>
102 <tr>
103 <td bgcolor="#ffffff" valign="top">5.<br>
104 </td>
105 <td bgcolor="#ffffff" valign="top"><small>&lt;</small><i>clip_left<small>&gt; </small></i></td>
106 <td bgcolor="#ffffff" valign="top">the number of nucleotides trimmed at the left end. (Orientation doesn't matter)<br>
107 </td>
108 </tr>
109 <tr>
110 <td bgcolor="#ffffff" valign="top">6.<br>
111 </td>
112 <td bgcolor="#ffffff" valign="top"><i><small> </small></i><small>&lt;</small><i>clip_right<small>&gt; </small></i></td>
113 <td bgcolor="#ffffff" valign="top">the number of nucleotides trimmed at the right end of this read.<br>
114 </td>
115 </tr>
116 <tr>
117 <td bgcolor="#ffffff" valign="top">7.-...<br>
118 </td>
119 <td bgcolor="#ffffff" valign="top"><i>[&lt;extra_attributes...&gt;]</i></td>
120 <td bgcolor="#ffffff" valign="top">One ore more space delimited<i> optional</i>
121 attributes for this read may follow. Their order is not enforced and
122 they should all start with a letter code followed by the ':' character
123 and then followed by attribute specific text data (spaces not allowed).
124 General format is:&nbsp; <small>&lt;</small><i>attr_code<small>&gt;</small></i><small><big><b>:</b></big></small><small>&lt;</small><i>attr_data<small>&gt;<br>
125 <br>
126 </small></i>Attributes recognized by the current specification:<br>
127 <table bgcolor="#cccccc" border="0" cellpadding="4" cellspacing="3" width="100%">
128 <tbody>
129 <tr>
130 <td bgcolor="#ffffff" valign="top"><b>C:</b><small>&lt;</small><i>grp#<small>&gt;</small></i></td>
131 <td bgcolor="#ffffff" valign="top">group color for the current read. This instructs the viewer to
132 draw this read with a specific color uniquely associated to the given <small>&lt;</small><i>grp#<small>&gt;</small>
133 </td>
134 </tr>
135 <tr>
136 <td bgcolor="#ffffff" valign="top"><b>L:</b><small>&lt;</small><i>other_readName<small>,..&gt;</small></i></td>
137 <td bgcolor="#ffffff" valign="top">[clone] link to another
138 read in the layout. This instructs the viewer to place the other read
139 on the same vertical line in the layout display (if possible), with
140 perhaps a dotted line connecting such reads; a comma delimited list of
141 read names can be given if such links extend to more than one other
142 read. Only one read in such a linked list&nbsp; needs to have such a L:
143 entry in order to declare the linked list/group of reads (that is, the
144 other linked reads do not need to reciprocate by having the
145 corresponding, but redundant L: attribute).<br>
146 </td>
147 </tr>
148 <tr>
149 <td bgcolor="#ffffff" valign="top"><b>S:</b><small>&lt;</small><i>sequence<small>&gt;</small></i></td>
150 <td bgcolor="#ffffff" valign="top">The nucleotide sequence
151 of the read exactly as it is included in the alignment. It must include
152 the clipped ends and the small gaps (indels) introduced by the MSA
153 (represented as '-' or '*' in the<small> &lt;</small><i>sequence<small>&gt;</small></i>) -- so the length of such <small>&lt;</small><i>sequence<small>&gt;</small></i> string must be equal to <small>&lt;</small><i>read_length<small>&gt;</small> </i></td>
154 </tr>
155 <tr>
156 <td bgcolor="#ffffff" valign="top"><b>G:</b><small>&lt;</small><i>seg1_end[</i><big><tt><b>c</b></tt></big><i><small>&lt;</small>seq1clipright<small>&gt;</small>]</i><i>[</i><big><tt><b>s</b><i>|</i><b>S</b></tt></big><i><small></small>]</i><tt><b>-</b></tt><i><br>
157 &nbsp;seg2_start</i><i>[</i><big><tt><b>c</b></tt></big><i><small>&lt;</small>seg2clipleft<small>&gt;</small>]</i><i>[</i><big><tt><b>s</b><i>|</i><b>S</b></tt></big><i>]</i><b>,</b><i>...&gt;</i></td>
158 <td bgcolor="#ffffff" valign="top">Segmented alignment
159 (e.g. EST to genome). This is an indication that the read contains
160 large internal gaps -- which should be displayed as <i>segments</i>
161 connected by lines. The data for this attribute consists of a comma
162 delimited list of coordinate pairs for the inter-segment gaps.
163 Coordinates in a pair are separated by the '-' character. For each pair
164 the first coordinate is the <i>end</i> position in the layout of the <i>previous</i> segment, while the second coordinate is the <i>start</i> position of the <i>next </i>segment in the layout.<br>
165 <br>
166 Example: say we have a "read" (e.g. a mRNA) called "MRNA244" which
167 aligns onto the "contig" (e.g. genomic sequence) of length 34000 as 3
168 distinct segments (e.g. 3 exons), aligned at genomic coordinates: 300
169 to 500 (first segment), 800 to 1100 (second segment) and 1500 to 1900
170 (third segment) respectively. Assuming there is a 30nt clipping at the
171 left end and 20nt clipping at the right end, and that the alignment has
172 a "forward" orientation, the contig and the sequence line for MRNA244
173 in the layout file would look like this (let's assume there are 281
174 "reads" total in this imagined layout):<br>
175 <br>
176 <tt>&gt;contig1 281 1 34000<br>
177 MRNA244 + 953 270 30 20 G:500-800,1100-1500<br>
178 </tt><br>
179 The actual length of the "read" accounts for the length of each segment
180 (201, 301 and 401 respectively) plus the clipping lengths at each end
181 (20 and 30), so the total is 201+301+401+20+30=953<br>
182 The left coordinate of the sequence in the alignment (270) is equal to
183 the position of the first (leftmost) segment (300) minus the left-end
184 clipping (30). <br>
185 <br>
186 There is a possibility to have clipping for each segment. This can be
187 specified for each segment's end by appending the character <big><b><tt>c</tt></b></big>
188 followed by the amount of clipping at that end. If in the example
189 segment alignment above we had the 1st exon clipped 10 nucleotides at
190 the right end, the 2nd exon clipped 5nt at the left end and 7nt at the
191 right end, with the 3rd exon having 9nt clipped at the left end, the
192 above read line may look like this:<br>
193 <br>
194 <tt>
195 MRNA244 + 953 270 30 20 G:490c10-805c5,1093c7-1509c9<br>
196 </tt><br>
197 Please note that in this last example the actual coordinates of the
198 alignment of the 3 segments (exons) to the genomic sequence are
199 300-490, 805-1093 and 1509-1900 respectively. The way the clipping is
200 specified in this G: attribute differs from the way the leftmost and
201 rightmost clipping of the whole read is given. The difference is that
202 the <big><b><tt>c</tt></b></big>
203 clipping lengths in the G: attribute lie OUTSIDE the coordinates given
204 for the segment ends in the same G: attibute, while the global leftmost
205 clipping (30 in the example above) is included in the offset coordinate
206 for the whole read (270 here).<br>
207 <br>
208 For EST to genome alignments, an optional 'S' (or 's') character may
209 follow the inter-segment ends, indicating that a splice consensus
210 (major or minor, respectively) was found on that side of the intron
211 corresponding to that inter-segment gap.<br>
212 &nbsp;<br>
213 </td>
214 </tr>
215 <tr>
216 <td bgcolor="#ffffff" valign="top"><b>D:</b><small>&lt;</small><i>se</i><i>q_diffs<small>..&gt;<br>
217 </small></i>&nbsp;or for segmented (G:)&nbsp; reads:<i><small><br>
218 </small></i><b>D:</b><small>&lt;</small><i>se</i><i>g1_diffs<small>..&gt;</small></i><big><b>/</b></big><small>&lt;</small><i>se</i><i>g2_diffs<small>..&gt;</small></i><i><small><br>
219 <br>
220 </small></i><br>
221 </td>
222 <td bgcolor="#ffffff" valign="top">If the contig sequence is given, this read attribute is the way to provide the display application with only a list of point-<i>differences</i> between this read's sequence and the contig sequence (so S: is not be needed).&nbsp; The <small>&lt;</small><i>seq_diffs<small>&gt;</small></i> is a concatenation of elements of this format:<br>
223 <small>&lt;</small><i>incremental_coordinate<small>&gt;</small></i><small>&lt;</small><i>character<small>&gt;</small></i><i><br>
224 <br>
225 </i>..where <small>&lt;</small><i>incremental_coordinate<small>&gt;</small></i> is the numeric position of such a difference <i>relative </i>to
226 the previous difference -- or&nbsp; if no such previous difference exists, relative to
227 the first (leftmost) non-clipped nucleotide of the read. This incremental
228 coordinate (which is always at least 1) must be followed by a <small>&lt;</small><i>character<small>&gt;</small></i>
229 code. This character can be either an actual DNA base letter ('A', 'G', 'C', 'T', 'N', etc.) -- which
230 indicates a nucleotide mismatch at that position, or the "dash"
231 character ('-') indicating a gap in the alignment or this read to the
232 contig sequence. <br>
233 <br>
234 Example: the following alignment is there between contig sequence and the read sequence:<br>
235 <br>
236 <tt>contig:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ..A G T T G C T - C C T A - C T A C A G A C C N G...<br>
237 read: &nbsp; &nbsp; &nbsp;&nbsp; ..A G T <font color="#990000">-</font> <font color="#990000">C</font> C T <font color="#990000">T</font> C C <font color="#990000">A</font> A <font color="#990000">N</font> C T <font color="#990000">- -</font> A <font color="#990000">T</font> A C C <font color="#990000">A</font> G...<br>
238 (increments:&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; 4 1&nbsp; &nbsp;&nbsp; 3
239 &nbsp;&nbsp;&nbsp; 3 &nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp; 3 1 &nbsp; 2
240 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4&nbsp; ... )<br>
241 </tt><br>
242 Assuming that the above alignment starts at position 200 in the contig
243 and that the read called RDAAA of length 620 has 20 bp clipping right
244 before this alignment (so the read left end coordinate in the layout is
245 181), the following line description would apply to this read in the
246 layout file (the ending ellipsis ... is not part of the actual text but
247 just e placeholder for possible other differences to report):<br>
248 <br>
249 <tt>RDAAA + 620 181 20 0 D:4-1C3T3A2N3-1-2T4A...<br>
250 </tt><br>
251 Note that in this compact MSA representation there is no information
252 provided about the nucleotide content of the clipped ends of the read.
253 The viewer application may choose to represent such clipped regions as
254 empty or gray boxes (rectangles), with the actual nucleotides only
255 displayed in the non-clipped regions.<br>
256 <br>
257 For segmented alignments (i.e. those reads having a <b>G:</b>
258 attribute), multiple such lists should be given (one for each segment),
259 separated by the '/' (slash) character. For each such
260 segment-differences list, the first incremental coordinate will be the
261 distance from the beginning of the first (leftmost) non-clipped
262 nucleotide of that segment.<br>
263 <br>
264 </td>
265 </tr><tr>
266 <td bgcolor="#ffffff" valign="top"><b>I:</b><small>&lt;</small><i>se</i><i>q_indels<small>..&gt;<br>
267 </small></i>or for segmented (G:)&nbsp; reads:<i><small><br>
268 </small></i><b>I:</b><small>&lt;</small><i>se</i><i>g1_indels<small>..&gt;</small></i><big><b>/</b></big><small>&lt;</small><i>se</i><i>g2_indels<small>..&gt;</small></i><br>
269 </td>
270 <td bgcolor="#ffffff" valign="top">Similar to the D: attribute, only that the <i>original</i>
271 sequence for the read is assumed known from other sources (e.g. an
272 indexed multi-FASTA file) and only gaps and deletions are reported as
273 the operations needed to make that sequence fit into the current MSA. <br>
274 The coordonate system is now entirely based on the <i>original</i>, raw
275 read sequence, but with the same adjustment of the start coordinate
276 based on the left clipping (i.e. all coordinates are relative to the
277 first (leftmost) base in the read that is not clipped but actually used
278 in the MSA).<br>
279 The <small>&lt;</small><i>seq_indels<small>&gt;</small></i> is a concatenation of elements of this format:<br>
280 <br>
281
282 <small>&lt;</small><i>incremental_coordinate<small>&gt;</small></i><small>&lt;</small><i>indel_char<small>&gt;</small></i> <br>
283 <br>
284 ..where&nbsp; <small>&lt;</small><i>indel_char<small>&gt;&nbsp; </small></i>can
285 be either '-' (gap) or 'd' (deletion)&nbsp; at the specific base
286 position in the original read sequence. The actual base position can be
287 obtained by this iterative formula:<br>
288 <br>
289 <i>&lt;base_position&gt;</i>&nbsp; =&nbsp; <i>&lt;incremental_coordinate&gt;</i> + <i>&lt;prev_base_position&gt;</i><br>
290 <br>
291 ..where <i>&lt;prev_base_position&gt;</i> = <i>&lt;clip_left&gt;</i> for the first iteration (first element of&nbsp; <i>&lt;seq_indels&gt;</i>)<br>
292 <br>
293 <i><small></small></i></td>
294 </tr>
295 <tr>
296 <td bgcolor="#ffffff" valign="top"><b>R:</b><i><small>&lt;</small>gaplist<small>&gt;</small></i><b><big>/</big></b><i><small>&lt;</small>contig_gaplist<small>&gt;</small></i><br>
297 </td>
298 <td bgcolor="#ffffff" valign="top">a special attribute for
299 the Assembly-on-Reference procedure, providing all gap information for
300 the alignment of this read to the parent contig. The gapping
301 information (as produced by mgblast with -D5 option) is stored directly
302 in this attribute data: the gaps in the read are in the first list <i><small>&lt;</small>gaplist<small>&gt;</small></i> and the gaps in the contig (reference) sequence are stored in <i><small>&lt;</small>contig_gaplist<small>&gt;</small></i>. The two lists are separated by the '/' (slash) character. Just like
303 mgblast's -D5 output, the gap list has the format:<br>
304 <br>
305 <i><small>&lt;</small>gap1pos<small>&gt;</small>[+<small>&lt;</small>gap1length<small>&gt;</small>]</i><b>,</b><i><small>&lt;</small>gap2pos<small>&gt;</small>[+<small>&lt;</small>gap2length<small>&gt;</small>],...</i><br>
306 <br>The <i>nrcl</i>
307 (non-redundandification clustering) program automatically writes this
308 attribute in the layout file produced when the -y option is given (if
309 the gap information is available in the parsed mgblast hits).<br>
310 <br>
311 The <i>mblaor</i> (mgblast assembly-on-reference) program requires this
312 attribute to be present when parsing an input layout file in order to
313 generate a full MSA, transforming this info into indel operations
314 applied to the read.<br>
315
316 </td>
317 </tr>
318
319 </tbody>
320 </table>
321 <br>
322 </td>
323 </tr>
324 </tbody>
325 </table>
326
327
328
329
330 <br>
331
332
333
334
335 <br>
336
337
338
339
340 <br>
341
342
343
344
345 </body></html>