ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
Revision: (vendor branch)
Committed: Thu Sep 7 15:35:22 2006 UTC (9 years, 7 months ago) by knirirr
Branch: MAIN, cehox
CVS Tags: start, HEAD
Changes since 1.1: +0 -0 lines
Log Message:
Imported sources

Line File contents
1 <HTML>
2 <HEAD>
3 <TITLE>Msatfinder manual</TITLE>
4 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
5 <link rel="stylesheet" href="quickmineoutput.css" type="text/css">
6 </HEAD>
8 <!--<BODY bgcolor="#FFFFCC">-->
9 <body style='background-color:#FF6633;'>
10 <table align="center" style='border:solid;border-width:2px;border-color:navy;background-color:#FFFFCC'>
11 <tr><td>
14 <table border="2" cellspacing="2" cellpadding="2" bgcolor="#FFFFFF">
15 <tr valign="middle">
16 <td width="90%"><span class="quickminetitle">Msatfinder manual</span> </td>
17 <td width="10%"><img src="msatminer.png"></td>
18 </tr>
19 </table>
22 <a name="index"></a>
23 <!-- INDEX BEGIN -->
24 <!--
25 <a href="index.html">HOME</a> |
26 <a href="msatfinder_manual.html">msatfinder manual</a> |
27 <a href="">msatfinder on-line</a> |
28 <a href="online_help.html">msatfinder on-line help</a>
29 -->
30 <UL>
31 <li><a href="#intro">Introduction</a>
32 <li><a href="#install">Installation</a>
33 <ul>
34 <LI><A HREF="#base_install">The msatfinder script</A></LI>
35 <LI><A HREF="#deps_install">Dependencies</A></LI>
36 </ul>
37 <li><a href="#msatfinder">Using msatfinder</a></li>
38 <ul>
39 <LI><A HREF="#finder_overview">Overview</A></LI>
40 <LI><A HREF="#finder_files">Input and output files</A></LI>
41 <LI><A HREF="#finder_key">Key to column headers</A></LI>
42 <LI><A HREF="#finder_config">Configuration</A></LI>
43 <LI><A HREF="#running_finder">Searching for microsatellites</A></LI>
44 </ul>
45 <li><a href="#online_help">Online interface</a></li>
46 <ul>
47 <li><A HREF="#motif">Motif selection</A></li>
48 <li><A HREF="#threshold">Threshold selection</A></li>
49 <li><A HREF="#advanced">Advanced options</A></li>
50 <li><A HREF="#download">Download options</A></li>
51 <li><A HREF="#interrupts">Interrupted msats</A></li>
52 <li><A HREF="#upload">Upload file</A></li>
53 <li><A HREF="#paste">Paste sequence</A></li>
54 <li><A HREF="#output">Output</A></li>
55 <li><A HREF="#errors">What if it didn't work?</A></li>
56 </UL>
57 </ul>
58 <li><a href="#other_things">Other information</a>.
59 <ul>
60 <LI><A HREF="#bugs">Bug reporting/known bugs</A></LI>
61 <LI><A HREF="#convert">File conversion/handling</A></LI>
62 <LI><A HREF="#coding_style">Coding style</A></LI>
63 </ul>
64 <li><a href="index.html">Back to the main page</a>
65 </UL>
66 <!-- INDEX END -->
68 <hr>
69 <font color="green">
70 <center>
71 <h1><a name="intro">Introduction</h1>
72 </center>
73 </font>
74 <p>Msatfinder examines sequence files (generally, small genomes) in GenBank, FASTA, EMBL and Swissprot (though ASCII can also be read) formats, and determines the number, type and position of microsatellite repeats.
75 <p>The software is designed to run on unix or Linux computers. If you'd like to test it on Mac OSX, we'd like to hear your feedback.
77 <center>
78 <br><a href="#index">To the top.</a>
79 </center>
80 <hr>
82 <font color="green">
83 <center>
84 <h1><a name="install">Installation</h1>
85 </center>
86 </font>
88 <h3><a name="base_install">The msatfinder script</h3>
89 <ol>
90 <li><a href="">Download the tar or zip file</a>.
91 <li>Unpack the downloaded file and cd to the new directory.
92 <li>run <pre>./msatfinder [options] file(s)</pre>
93 <li>Er...
94 <li>That's it.
95 </ol>
98 <p>You may make a file containing a list of all input files that you'd like msatfinder to run on, or supply their names on the command line (e.g. with a glob). <p>If you use a list, then run msatfinder as:
99 <pre>
100 ./msatfinder -l name_of_list_file
101 </pre>
103 <p>To run on the sample files provided you could provide their names or glob the file suffixes. For example:
104 <pre>
105 ./msatfinder *.gbk
106 </pre>
107 to search the Genbank file provided.
109 <p>Msatfinder can in fact be placed anywhere you like, for example in /usr/local/bin. As long as you have a configuration file in the same directory as your data then msatfinder will work. For example:
110 <pre>
111 cd /home/user/msat_data/bacteria
112 /usr/local/bin/msatfinder *.gbk
113 </pre>
116 <p>Msatfinder should run without any need to configure it. If you'd like to change any of the parameters, please see the <A HREF="#finder_config">configuration</A> section, below. There are a variety of options that can be used each time msatfinder is run, and these are described in the <A HREF="#running_finder">searching for microsatellites</A> section.
119 <h3><a name="deps_install">Dependencies</h3>
120 <p>There are two types of dependency required by msatfinder: Perl modules, and external programs.
122 <h4>Perl modules</h4>
123 <p>All of the following must be installed for all the scripts to work properly:
125 <ul>
126 <li><a href="">BioPerl</a>.
127 <li><a href="">CGI</a>.
128 <li><a href="">Config::Simple</a>.
129 <li><a href="">File::Copy</a>.
130 <li><a href="">Getopt::Std</a>.
131 <li><a href="">Term::ReadLine</a>.
132 <li><a href="">Term::ANSIColor</a>.
133 <li><a href="">List::Util</a>.
134 </ul>
136 <p>More information on installing Perl modules can be found at <a href="">CPAN</a>. However, many of them will be installed as standard with Perl 5.8.3 (if so, <b>man</b> or <b>Perldoc</b> should provide information).
138 <h4>External programs</h4>
139 <p>The following external applications are used by msatfinder:
141 <ul>
142 <li><a href="">EMBOSS</a>.
143 <li><a href="">primer3</a>. primer3_core should be installed on your PATH, as it is required for the eprimer3 part of EMBOSS to function (used by msatfinder to determine primers).
144 </ul>
146 <p>We use <a href="">Gentoo Linux</a> and <a href="">BioLinux</a> as the former has packages available for all these dependencies and the latter has them already installed.
148 <h4>Dependency search priorities</h4>
149 <p>Msatfinder looks for (external program) dependencies as follows:
151 <ol>
152 <li>It will check whether the location given in the config file is present and executable.
153 <li>If not, it will use &ldquo;which&rdquo; to look for the executable on one's path.
154 <li>If none can be found, the script will die with a list of which dependencies were missing.
155 </ol>
157 <p>This system will allow you to specify alternative versions of dependencies, which will have priority over the ones on the user's PATH, if required. This may be useful if the user has more than one version of EMBOSS and running a specific version is required.
160 <center>
161 <br><a href="#index">To the top.</a>
162 </center>
165 <hr>
166 <font color="green">
167 <center>
168 <h1><a name="msatfinder">Using msatfinder</a></h1>
169 </center>
170 </font>
172 <H2><A NAME="finder_overview">Overview</A></H2>
173 <p>Msatfinder finds was designed to find perfect repeats (e.g. A(13) would be detected, but AAAAAATAAAAAA would not) in annotated (e.g. GenBank, EMBL, Swissprot) or unannotated (Fasta,raw) format files, but is also capable of finding interrupted microsatellites. It can be used to examine both protein and nucleic acid sequences. If given an annotated file it will extract information about each microsatellite and the sequence it is found in. In addition, for nucleic acid sequences it will determine whether it is possible to create PCR primers containing each repeat region found (EMBOSS/primer3 are required for this feature to work). The various features and output files may be controlled by editing the configuration file (msatfinder.rc) - notes are provided in the file on how to edit it, and more detail is given in this manual.
175 <H2><A NAME="finder_files">Input and output files</A></H2>
176 <p>Input may be many separate sequence files, or multiple sequences in a file. To convert file formats <a href="">readseq</a> may be useful.
178 <h3>Input files</h3>
179 <p>Input file types are automatically detected - if the file format cannot be determined a warning will be given. Allowed file types are:
180 <ul>
181 <li>GenBank.
182 <li>Swissprot.
183 <li>EMBL.
184 <li>FASTA.
185 <li>ASCII/raw.
186 </ul>
187 <p>Input sequences may be amino acid or nucleic acid. Please note that if you use &lsquo;ASCII&rsquo; format (i.e. raw sequence in a text file) then msatfinder will treat each new line as a new sequence, and blank lines are likely to cause it to fail.
189 <p>You may name your input files as you please, but naming of the sequences in input files is particularly important for the use of msatfinder. As it is possible to use a file containing many sequences as the input to msatfinder, the script uses BioPerl to extract a unique name for each sequence. This is then used to name output files, such as the microsatellite FASTA files. It's therefore possible to have (for example) an input file called &ldquo;MACO.gbk&rdquo;, but the FASTA files will be called such things as &ldquo;NC_004117.122038.CGA.6.fasta&rdquo;. In the case of FASTA files, the unique identifier extracted is the first entry following the &ldquo;&gt;&rdquo;. If you have no unique identifier in your files then msatfinder will attempt to generate one, but it is best if you can supply sequence files with each sequence already labeled.
192 <h3>Output files</h3>
193 <p>When run, msatfinder creates some directories to store the output files &mdash; this is convenient as there are often large numbers of output files. If you install a version of msatfinder on your own computer then these directory names can be changed by editing the msatfinder.rc file. An html file (results.html) is created in the same directory in which msatfinder is run &mdash; open this in a browser to view all of msatfinder's output. Results.html contains links to the contents of the following seven subdirectories:
195 <ul>
196 <li><b>Repeats</b>: Contains various data files summarising genome and microsatellite information, e.g. taxonomy, number of microsatellites (see below). An index file summarising the results is created here.
197 <li><b>Counts</b>: A summary of all microsatellites found in a genome by length and type or exact motif. Unlike the index and count files in Repeats/, the files here preserve the exact motifs rather than just their content. For example TAT and AAT would be recorded as different microsatellites, whereas in some files they would be counted as the same as they have the same base content - this saves space.
198 <li><b>Msat_tabs</b>, <b>Flank_tabs</b>: Contain feature tables for use with artemis. To use them, cd to either of these directories and run <b>artemis <i>file.gbk</i> +<i></i></b>. The msat_tab files show only the microsatellite itself, whilst the flank_tabs also include the flanking regions.
199 <li><b>Fasta</b>: The sequence of the microsatellite plus flanking regions, in fasta format. These would be suitable for use in blast searches.
200 <li><b>MINE</b>: These files contain the same information as shown in the &ldquo;repeats&rdquo; file in the Repeats directory (a summary of the details of each microsatellite), but one file per repeat is produced. These contain HTML formatting, and are designed to allow the creation of simple databases using our related <a href="">MINE</a> software, and are turned off by default.
201 <li><b>Primers</b>: Primer files are produced for each microsatellite, if possible. By default it produces information for possible PCR primers. This option will be disabled if you enter a swissprot format amino acid sequence, but should be turned off if you intend to enter your AA sequence in FASTA format.
202 </ul>
205 <p>The following files will be found in the directory called &ldquo;Repeats/&rdquo;:
207 <ul>
208 <li><b>genomes</b>: This contains the information on each genome, including number of microsatellites found, GC content, &amp;c. A full listing of the column headers is shown <a href="#finder_key">here</a>.
209 <li><b>repeats</b>: This file contains the details of every individual microsatellite, plus similar genomic information to the genomes file. An explanation of the column headers is shown in the main msatfinder manual, <a href="">here</a>. Both this file and the genomes file may easily be imported into <a href="">Open Office</a> (or similar), or imported into a database.
210 <li><b>type.count</b>: This is a table showing the number of microsatellites found, categorised by motif type (mono, di, tri &amp;c.) and number of repeat units.
211 <li><b>motif.count</b>: Similar to the type.count file, it shows the number of microsatellites found categorised by the bases/amino acids in the motif. These are ordered by the total base content only, thus AAT would be counted the same as TAT (exact summaries are available in the Counts/ directory).
212 <li><b>index</b>: A handy summary of the results.
213 <li><b>primers.csv</b>: A tabular summary of the information in all the primer files in the Primers/ directory.
214 <li><b>errors</b>: file contains details of anything that looked unusual, e.g. very short sequences, features that did not match the sequence, &amp;c.
215 </ul>
217 <h2><a name="finder_key">Column headers for &ldquo;repeats&rdquo; and &ldquo;genomes&rdquo; files</a></h2>
219 <p>The headers in these tables depend on the file format provided (e.g. Swissprot protein files don't have GC content). The headers are not found in the exact order shown below.
221 <h4>These column headings are found in both the repeats and genomes files in the Repeats/ directory.</h4>
223 <p><table border=1>
225 <tr><td>genome</td><td>The NC/Accession number for this genome.</td></tr>
226 <tr><td>specific_taxon, generic_taxon</td><td>Fields derived from the names of he directories you store your data in. Actual taxonomic information is captured in the fields below. You probably won't need this, but please e-mail if you need more information about this.</td></tr>
227 <tr><td>division,binomial,genus,species,strain/subspecies,common_name,organism,definition,taxid</td><td>Taxonomic features parsed from the genome file. These will not be available for some (e.g. FASTA) files.</td></tr>
228 <tr><td>strand</td><td>Whether the genome is single (ss) or double (ds) stranded.</td></tr>
229 <tr><td>alphabet</td><td>RNA, DNA or sometimes mRNA, depending on the genome annotation.</td></tr>
230 <tr><td>circular</td><td>Whether the genome is circular or linear</td><td>
231 <tr><td>date</td><td>Submission date from the annotated file. These are reformatted into YYYY-MM-DD format.</td></tr>
232 <tr><td>specific_host,lab_host</td><td>This information on host range is sometimes provided in annotated files, but this field may often be blank.</td></tr>
233 <tr><td>notes</td><td>Any notes from the annotated file, if present.</td></tr>
234 <tr><td>genome_length</td><td>Length of the sequence.</td></tr>
235 <tr><td>no_of_coding_regions</td><td>The number of CDS annotations found in the genome.</td></tr>
236 <tr><td>total_nt_coding,percent_coding</td><td>The proportion of nucleotides that occur within CDS regions.</td></tr>
237 </table>
239 <h4>These additional column headings are found in the genomes file.</h4>
240 <p><table border=1>
241 <tr><td>GC_content</td><td>GC content of the entire genome.</td></tr>
242 <tr><td>A, T, G, C </td><td>Base counts for the individual bases.</td></tr>
243 <tr><td>A/AT, C/GC</td><td>The &lsquo;askew&rsquo; and &lsquo;cskew&rsquo; values; the value is expressed as a percentage of no. of A in total number of AT, &amp;c.</td></tr>
244 <tr><td>total_rep_length</td><td>Total length of all microsatellites in the genome.</td></tr>
245 <tr><td>pc_repeats</td><td>Microsatellites as a percentage of the genome.</td></tr>
246 <tr><td>msats, No_of_msats</td><td>The first value is set to 1 if any microsatellites were found, the second value is set to the total number found.</td></tr>
247 </table>
249 <h4>These additional column headings are found in the repeats file.</h4>
250 <p><table border=1>
251 <tr><td>repeat</td><td>A unique name identifying this particular microsatellite.</td></tr>
252 <tr><td>flank_length</td><td>The length of the flanking regions (default: 300 each side).</td></tr>
253 <tr><td>repeat_plus_flank</td><td>Total length of the microsatellite and the flanks.</td></tr>
254 <tr><td>start, stop</td><td>Start and stop positions of the microsatellite within the genome.</td></tr>
255 <tr><td>motif_units</td><td>The motif and number of repeat units, e.g. AT(6).</td></tr>
256 <tr><td>motif, motifrevcom</td><td>The motif (e.g. AT) and its reverse complement (e.g TA).</td></tr>
257 <tr><td>repeat_units</td><td>Number of repeat units.</td></tr>
258 <tr><td>footprint</td><td>Total length of the microsatellite; no. of units multiplied by unit length.</td></tr>
259 <tr><td>dist_from_left, pc_from_left</td><td>Distance from the start of the sequence, as a number of bases and as a percentage, where the start of the microsatellite is found.</td></tr>
260 <tr><td>dist_from_right, pc_from_right</td><td>Distance from the end of the sequence, as a number of bases and as a percentage, where the start of the microsatellite is found.</td></tr>
261 <tr><td>motif_type</td><td>Mono, di, tri, tetra, penta or hexa.</td></tr>
262 <tr><td>reverse</td><td>whether the feature the msat is found in is on the forward or reverse strand (1 = reverse, 0 = forward or not in a feature).</td></tr>
263 <tr><td>primers</td><td>Was it possible to make a primer for this microsatellite (1 = yes, 0 = no).</td></tr>
264 <tr><td>GC_content(genome), GC_content(flank), GC_content(repeat)</td><td>GC content of the entire genome, the flanks only and the microsatellite itself.</tr></td>
265 <tr><td>coding_repeat</td><td>Set to 1 if msat is within a CDS region, 0 if it is not.</td></tr>
266 <tr><td>gene,product,protein</td><td>If the msat is withn a CDS region, msatfinder will parse all available annotations for that region and place them in these categories.</td></tr>
267 <tr><td>
268 </table>
270 <h2><a name="names">File names</a></h2>
271 <p>Some of the output files produced by msatfinder have names such as &ldquo;NC_000871.33400.AT.6.fasta&rdquo;. These files are the fasta files of msat and flank sequence, primer files, mine files and feature tables.
272 <p>The various sections are as follows.
274 <p><table border=1>
275 <tr><td>NC_000871</td><td>NC number of the genome in which the microsatellite was found. This is derived from the unique identifier for each sequence (typically the accession number), extracted by msatfinder.</td></tr>
276 <tr><td>33400</td><td>Start position of the microsatellite.</td></tr>
277 <tr><td>AT</td><td>Repeat motif. This could be a combination of motifs (e.g. AT-TA) if the msat is <a href="#ird">interrupts</a></td></tr>
278 <tr><td>6</td><td>Number of repeat units.</td></tr>
279 <tr><td>fasta</td><td>Format of the file. Other suffixes include &lsquo;primers.txt&rsquo; (primer files), &lsquo;db&rsquo; (MINE files) and &lsquo;count&rsquo; (summary of motif numbers). Feature tables have the filename NC_xxxxx.flank_tab or NC_xxxxx.msat_tab</td></tr>
280 </table>
284 <H2><A NAME="finder_config">Configuration</A></H2>
285 <p>All of the parameters that can be customised by the user are found in the msatfinder.rc file. A brief description of how to set each of these is described in the file, but here we give some supplementary information. Variables in the COMMON and FINDER sections of the configuration file control the behaviour of msatfinder. Most of the default values will be acceptable in most cases.
287 <ul>
288 <li><font color = "green">debug</font> - if set to 1, will print extra debugging information. Set to 2 for even more.
289 <li><font color = "green">flank_size</font> - the amount of sequence either side of the microsatellite that will be extracted and saved to the microsatellite FASTA file.
290 <li><font color = "green">mine_dir,repeat_dir,tab_dir,bigtab_dir,fasta_dir,prime_dir,align_dir,cont_dir</font> - several variables that set the name of the subdirectories that will be created when the script is run.
291 <li><font color = "green">run_eprimer</font> - set to 1 if you want to determine whether a primer can be made for each repeat.
292 <li><font color = "green">eprimer_args</font> - the eprimer man page has more information on what to put here, if you are dissatisfied with the default (pick PCR primers). Please note that the &ldquo;-task 0&rdquo; option works with EMBOSS 2.8.0. If you have 2.9.0 then you should use &ldquo;-primers&rdquo; instead.
293 <li><font color = "green">eprimer</font> - full path to the eprimer3 binary.
294 <li><font color = "green">primer3core</font> - the full path to the primer3_core binary.
295 <li><font color = "green">override</font> - turns off the following variables. It's easier than editing lots of lines in the config file.
296 <ul>
297 <li>artemis.
298 <li>mine.
299 <li>fastafile.
300 <li>sumswitch.
301 <li>screendump.
302 <li>run_eprimer.
303 </ul>
304 <li><font color = "green">motif_threshold</font> - this is particularly important, as it defines the thresholds <b>equal to or above</b> which microsatellites will be detected, and which types will be detected. The types may be set to any length, and the lowest the thresholds can be set is 1, which will find every single base, pair of bases, triplet &amp;c. It will take a long time to run if thresholds are set that low and the &ldquo;regex&rdquo; engine will not operate on such a small threshold. By default, mono-hexa will be searched for. Please refer to <a href="#setting_thresholds">setting thresholds and motif types</a> (below) for more information.
305 <li><font color = "green">artemis</font> - turns on the Artemis feature tables.
306 <li><font color = "green">mine</font> - turns on MINE summary files. These are equivalent to the "repeats" output file in the data they contain.
307 <li><font color = "green">fastafile</font> - turns on whether or not a FASTA format file containing the sequence information for each microsatellite found will be generated.
308 <li><font color = "green">taxon information</font> - two of the fields in the repeats and genomes files are &ldquo;specific_taxon&rdquo; and &ldquo;generic_taxon&rdquo;. See <a href="#taxon">here</a>.
309 <li><font color = "green">remote_link</font> - used to put a hyperlink into MINE files for looking at the original genomes.
310 <li><font color = "green">sumswitch</font> - determines whether or not the "repeats" output file will be created. This contains a large amount of information about each microsatellite and its genomic context, and can become rather large. However, it is very useful for importing into a database.
311 <li><font color = "green">screendump</font> - prints out verbose information to the screen whilst running if set to 1.
314 </ul>
316 <H3><a name="setting_thresholds">Setting thresholds and motif types</a></H3>
317 <p>It is possible to search for motif types of any length. However, configuring this may not be very intuitive. The default entries in msatfinder.rc dictate that msatfinder searches for motifs of length 1 to 6 (eg. A to ATTCCG). This, along with the thresholds for each motif length, is encoded by the line :
318 <p>
319 <pre>
320 motif_threshold = "1,12|2,8|3,5|4,5|5,5|6,5"
321 </pre>
322 <p>In this case, a motif of length 1 (eg. A) must be repeated 12 times before being reported. A motif of length 2 (eg. AC) must be repeated 8 times, &amp;c. If the user wishes to search for longer motifs, for example motifs 20 elements in length, then they simply need to add a pipe symbol (&lsquo;|&rsquo;), followed by the motif length required and the corresponding
323 threshold, eg. 2. The line in msatfinder.rc would then read:
324 <p>
325 <pre>
326 motif_threshold = "1,12|2,8|3,5|4,5|5,5|6,5|20,2"
327 </pre>
328 <p>Likewise, if the user does not wish to search for a certain type of motif, eg. those of length 1, then they simply delete the '1,12' and the unneeded pipe symbol (pipes should not be at the beginning and end of the string). Doing so would reduce the time needed for searching.
331 <H2><A NAME="running_finder">Searching for microsatellites</A></H2>
332 <p>Running msatfinder is described in the <a href="README.txt">README</a> and the <a href="#install">installation</a> section.
334 <p><b>N.B.</b>You must have an msatfinder.rc config file present in the same directory as your data for msatfinder to run (see the <A HREF="#finder_config">configuration</a> section. If one is not present, the program will stop and give an error message suggesting that you place a copy of msatfinder.rc in the data directory. You may still use the help option if you don't have an msatfinder.rc file present (see below).
336 <p>Msatfinder has a lot of output files, and users may sometimes want to back these up or suppress them. Running msatfinder with the --help or -h option will print out a list of the options available. The full list of options is:
338 <center><p><table border="1" cellpadding="1" cellspacing="1"><tr><td>
339 <pre>
340 -b backup your old data directories (adds date &amp; time suffix)
341 -d delete most recent data directories, don't search for microsatellites
342 -e &lt;1-3&gt; engine to use - see the manual (default = 1, the &ldquo;regex&rdquo; engine)
343 -f set flank size, overriding config file (default 300)
344 -h list these options
345 -l &lt;list_file&gt; read a list of genomes from a text file
346 -m &lt;N,N,...&gt; Types of msats to search for, overriding config file (default 1,2,3,4,5,6)
347 -s silence most output to screen (overrides config file)
348 -t &lt;N,N,N,N,N,N&gt; set thresholds, overriding config file (default 12,5,5,5,5,5)
349 -x delete all data directories, don't search for microsatellites
350 </pre>
351 </td></tr></table></center>
353 <H3>Microsatellite searching engines</A></H3>
354 <p>There are three &ldquo;engines&rdquo; currently implemented for microsatellite searching. These all operate in slightly different ways, and if you don't find the default useful then the others may prove effective. They are as follows.
356 <ul>
357 <li><b>Regex</b> (option 1): This uses fast regular expressions to search once through the sequence. It is the fastest method, but cannot detect very small repeats (threshold &lt;3). This threshold is much lower than most people require, hence we have selected this engine as the default.
358 <li><b>Multipass</b> (option 2): This steps through the sequence several times looking for microsatellites of one motif type on each pass, using regular expressions. This method may produce slightly different results, as it can find microsatellites that overlap.
359 <li><b>Iterative</b> (option 3): This steps through the sequence one base at a time and attempts to construct a microsatellite at that position without using regular expressions. If it succeeds, it continues searching from the end of the last microsatellite. If you'd like to split the entire genome into repeat units this is the engine to use (it is slow, however).
360 </ul>
362 <p>The program is reasonably fast. For example, searching a large, microsatellite-rich bacterial genome such as Xylella fastidiosa (NC_002488, 2.68 Mb) with default thresholds takes under two minutes. The exact time will vary depending on your computing power.
364 <H3><A NAME="ird">Interrupted microsatellites</A></H3>
365 <p>&ldquo;Interrupted&rdquo; microsatellites include several possible features. One is that a microsatellite tract consists of two motifs of the same class (e.g. dinucleotides) adjacent to each other (example 1). Another is that a long microsatellite tract may have one or more point mutations in it, making it appear to be several shorter tracts (example 2). Sometimes this latter category may include a pseudo-frameshift (example 3), thus appearing to be two different motifs.
367 <h4>Examples</h4>
368 <ol>
369 <li>ACACACACACACACAATGTGTGTGTGTGTGTG - a microsatellite tract consisting of (AC)8 with a single point mutation on the last unit (making AA), followed by a (TG)8 microsatellite.
370 <li>AAAAAAAAAAGAAAAAAAAAA - what appears to be 2x (A)10 is in fact an (A)21, with a point mutation.
371 <li>ATATATATATATTATATATATATAT - the insertion of an extra &ldquo;T&rdquo; into this msat makes it appear to be an AT followed by a TA motif, when it should be considered as an interrupted AT motif.
372 </ol>
374 <p>Once microsatellites have been detected by whichever engine has been selected, the complete list of msats is scanned to determine the relative positions of each microsatellite within each genome. Microsatellites are joined together when they meet two criteria:
376 <ul>
377 <li>The distance from on microsatellite to the preceding one is equal to or less than the footprint of the current microsatellite.
378 <li>The current microsatellite and the preceding one are of the same motif length (mono-, di- &amp;c.).
379 </ul>
381 <p>Each &ldquo;cluster&rdquo; of microsatellites thus found is combined into a single interrupted microsatellite, using the usual <a href="msatfinder_manual.html#names">nomenclature</a>. The motif type will be altered to contain all the motif types in the order they are found. So, example 1 (above) would become ac-tg.16, and example 2 would be a-a.21. In the latter case, the &ldquo;-&rdquo; is kept in to mark that this is an interrupted microsatellite.
384 <center>
385 <br><a href="#index">To the top.</a>
386 </center>
388 <hr>
389 <font color="green">
390 <center>
391 <h2><a name="online_help">Help for using msatfinder on-line</a></h2>
392 </center>
393 </font>
394 <p>The <a href="">on-line version</a> of msatfinder is available for those who don't want to do a local install. It offers the same features as the downloadable version, but due to server limitations can only accept sequences up to 10Mb. Explanations of each of the on-line options are shown below.
396 <h3><a name="motif">Motif selection</h3>
397 <p>Allows a selection of the motif types, mono (e.g. (A)12)to hexa (e.g. (ATAACA)20), for which one wishes to search. By default all the types are selected, but if you are not interested in mononucleotides, for example, then switching them off will save time and result in a smaller results file to download.
399 <h3><a name="threshold">Threshold selection</h3>
400 <p>For each motif type, there is a minimum number of repeat units that msatfinder will look for. For example, the default setting is 12 units for mononucleotides and 5 for everything else, so that an (AT)5 (i.e. ATATATATAT) will be detected but an (AT)4 will not. You can set these as low as 3, but be prepared for a long wait and a lot of output. The default values are recommended for most files, but if you find nothing of use you may wish to lower the thresholds a little to see if any smaller microsatellites are present.
401 <p>The interface states that there is a minimum of three repeat units, but there are some combinations of thresholds that may cause msatfinder to fail. One example is if you have a threshold for monos that is lower than the number of boxes ticked under &ldquo;choose the microsatellite motifs to search for&rdquo;. The reason for this is that the software may have trouble discriminating between some types of microsatellites, for example:
402 <pre>
403 ccccctccccctccccctccccct
404 </pre>
405 <p> ...could be (ccccct)4 but could also be 4x (c)5, as the computer sees it. To prevent this from happening, msatfinder will not run in cases where such confusion should occur. If you find that it fails when you are looking for very small microsatellites, try re-running but turning off the larger microsatellites, e.g. look for monos and dis only.
407 <h3><a name="advanced">Advanced options</h3>
408 <p>The default settings are probably best, but you may like to experiment with these. These options fall into three groups. The first is the list of options for disabling various output files that msatfinder produces (all except MINE files are enabled by default, which should be suitable for most uses). The second is the flank size, which determines the length of sequence either side of the microsatellite that will be saved as a FASTA format file for blast searching, &amp;c. We have found the default of 300 to be suitable for most purposes. The third option is the search engine to be used. It should not be necessary to change this unless you need to look for tiny microsatellites or your sequence has lots of unknown bases in it.
410 <h3><a name="download">Download options</h3>
411 <p>After running your analysis, you will be able to view the results on-line and/or download them as a compressed archive in tar.gz (unix) or zip formats. Select your preferred format before running the analysis. The default setting is tar.gz unless you connect with a Windows machine, in which case it will be set to zip.
413 <h3><a name="interrupts">Find interrupted msats</h3>
414 <p>If this box is checked (it's unchecked by default) then msatfinder will process the microsatellites found to determine if any could be joined into larger microsatellites, according to <a href="#ird">certain rules</a>. Typically, about 10% of microsatellites found could be so joined.
416 <h3><a name="upload">Upload file</h3>
417 <p>Choose a file from your local system to run msatfinder on. There is a 10Mb file size limit on uploads &mdash; see <A href="#finder_files">file formats</a> for allowable file types..
419 <h3><a name="paste">Paste sequence</h3>
420 <p>See <A href="#finder_files">file formats</a> for input file formats. There is a size limit of 10Mb on the sequence that may be pasted here. If you have larger sequences, or many of them, we recommend a local installation and are happy to offer assistance, or run your sequence for you. If you paste anything in here, it will be used instead of any uploaded file data, so make sure that this box is cleared if you'd like to upload a file.
422 <p>Once you've pasted your sequence, click "search" and wait for a few moments (an animated picture will be shown whilst msatfinder runs). A brief summary will be displayed and a link to the downloadable file and the viewable results will be shown. These links will become inactive after a couple of hours, so please download them immediately if you'd like to keep the results.
424 <h3><a name="output">Output</a></h3>
425 </center>
426 </font>
428 <p>The output will be viewable on-line for 2 hours, and can be downloaded as a tar or zip file within that time, so we recommend that you bookmark the link to your output. The various files that are included in the download directory are described below. The most important is "results.html" that includes links to all the other output files so that you may view them in a browser. Your input sequence and a configuration file will also be saved for future reference, in case of any problem with the results.
430 <p>The output files produced by the online version of msatfinder are identical to those produced by the local version. A detailed description of these is available in <a href="msatfinder_manual.html#finder_files">the output file section</a>. Please note that MINE files will not be produced by default - they must be enabled under &ldquo;advanced options&rdquo;.
432 <h3><a name="errors">What if it didn't work?</a></h3>
433 </center>
434 </font>
435 <p>Occasionally, Msatfinder will fail to run on some input files. There are various reasons that this might happen, which are described below and also mentioned in the <a href="#bugs">bugs</a> section.
436 <ul>
437 <li>The unique identifier extracted from the sequence may contain some characters that the operating system does not like. This does not happen very often, but if you suspect that this is the case then please send your sequence to <a href=""></a>. If you are using a FASTA format file you could also change the header to something inocuous like &ldquo;> sequence1&rdquo;.
438 <li>If your sequence contains lots of unknown bases then Msatfinder may have encountered a bug in Perl itself. The Perl developers know about this bug, but it seems that it cannot be fixed in the near future. Msatfinder will still work if the iterative engine is used, although this will be much slower.
439 <li>Another possibility is that your threshold settings are very low. As described <a href="#threshold">here</a>, low thresholds can cause ambiguities in the microsatellites found, and the program will not run if there is the potential for this. You can work around this problem by only looking for one type of microsatellite at a time, e.g. setting Msatfinder to find all monos of 3 or more units.
440 </ul>
441 <p>If you encounter a problem that does not seem to fit into these categories, please <a href="">contact us</a> and we will endeavour to fix the problem or analyse your data for you as soon as possible.
443 <center>
444 <br><a href="#index">To the top.</a>
445 </center>
447 <hr>
449 <font color="green">
450 <center>
451 <h1><a name="other_things">Other information</a></h1>
452 </center>
453 </font>
455 <h2><a name="bugs">Bug reporting/known bugs</a></h2>
456 <p>Though we are using this software successfully, there's a small chance of bugs turning up somewhere. Should you find any, please contact us (details below) and we will squash them. If you're using msatfinder, there are a few things that you ought to bear in mind.
457 <ul>
458 <li>The number of interrupted microsatellites is dependent upon the thresholds chosen, as the program joins together microsatellites that are above the thresholds specified. An interrupted microsatellite whose individual parts are below the thresholds will not currently be detected.
459 <li>The program is reasonably fast, unless the thresholds are set very low or the genome is very large. The slowest part of the script is the section that parses genomic features, but this will not be needed if FASTA files are used.
460 <li>With low thresholds (e.g. 1), the counts may be skewed somewhat. For example, (AAT)7 could also be detected as seven (A)2 repeats. In general, it is not necessary to set thresholds this low, though. N.B. the default engine will not allow msats of less than three units to be detected, and will fail to run if ambiguities like this could be encountered, so this problem will only be seen if using one of the other engines.
461 <li>There is a problem in Perl's regex engine that can cause stack overflow in some cases. This generally means that if your data contains many unknown bases, msatfinder may die with a segmentation fault. Msatfinder is not the only application to be affected by this, and the Perl developers plan to change their code to overcome this problem. Meanwhile, if your data are affected, you can try a newer version of Perl or running msatfinder with the -e3 option (this is slow, but bypasses the problem).
462 </ul>
464 <p>We welcome suggestions for new features or other improvements.
466 <h2><a name="convert">File conversion/handling methods</a></h2>
467 <p>For converting files one could use <a href="">Seqret</a> (from <a href="">EMBOSS</a>), or use <a href="">readseq</a>.
469 <h2><a name="coding_style">Coding style</a></h2>
470 <p>The braces in this code are written thus:
471 <center><p><table border="1" cellpadding="1" cellspacing="1"><tr><td>
472 <pre>
473 foreach (@item)
474 {
475 print $_;
476 }
477 </pre>
478 </td></tr></table></center>
479 <p>This is simply because it's easier to read this way. However, if you disagree, you may wish to try <a href="">this</a>.
481 <center>
482 <br><a href="#index">To the top.</a>
483 </center>
484 <hr>
486 <center>
487 <br><a href="index.html">To the main msatfinder page.</a>
488 </center>
491 <br>
492 <center>
493 <br>
494 <p><a href=""><img SRC="vim.gif" ALT="HTML edited by Vim" BORDER=0 ></a>
495 <br>
496 </center>
497 </td></tr>
498 </body>
499 </html>