ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
root/yamap/pfam_scan.html
Revision: 1.1.1.1 (vendor branch)
Committed: Thu Sep 7 15:35:22 2006 UTC (10 years ago) by knirirr
Branch: MAIN, cehox
CVS Tags: start, HEAD
Changes since 1.1: +0 -0 lines
Log Message:
Imported sources

Line File contents
1 <?xml version="1.0" ?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns="http://www.w3.org/1999/xhtml">
4 <head>
5 <title>pfam_scan.pl - search protein fasta sequences against the Pfam
6 library of HMMs.</title>
7 <meta http-equiv="content-type" content="text/html; charset=utf-8" />
8 <link rev="made" href="mailto:root@localhost" />
9 </head>
10
11 <body style="background-color: white">
12
13 <p><a name="__index__"></a></p>
14 <!-- INDEX BEGIN -->
15
16 <ul>
17
18 <li><a href="#name">NAME</a></li>
19 <li><a href="#version">VERSION</a></li>
20 <li><a href="#requirements">REQUIREMENTS</a></li>
21 <li><a href="#how_to_install_pfam_locally">HOW TO INSTALL PFAM LOCALLY</a></li>
22 <li><a href="#searching_pfam">SEARCHING PFAM</a></li>
23 <li><a href="#bugs">BUGS</a></li>
24 <li><a href="#history">HISTORY</a></li>
25 <li><a href="#contact">CONTACT</a></li>
26 </ul>
27 <!-- INDEX END -->
28
29 <hr />
30 <p>
31 </p>
32 <h1><a name="name">NAME</a></h1>
33 <p>pfam_scan.pl - search protein fasta sequences against the Pfam
34 library of HMMs.</p>
35 <p>
36 </p>
37 <hr />
38 <h1><a name="version">VERSION</a></h1>
39 <p>This is version 0.5 of pfam_scan.pl. See the history section for
40 recent changes.</p>
41 <p>Behaviour of recent versions is a significantly different from 0.1.
42 From version 0.5, overlapping matches to families within the same clan
43 are removed, keeping the best scoring hit. This behaviour can be
44 overridden with the --overlap option. From version 0.2, we can use
45 BLAST to preprocess the input sequences with the --fast option, so we
46 only have to search a subset of sequences against a subset of HMMs
47 using hmmpfam. For puritanical reasons we don't do this by default!
48 Read the notes about this below.</p>
49 <p>This version has been tested with Perl 5.6.1, Pfam 10.0 (through
50 13.0), Bioperl 1.2 and HMMER 2.3.1. It should work with any versions
51 higher than these.</p>
52 <p>
53 </p>
54 <hr />
55 <h1><a name="requirements">REQUIREMENTS</a></h1>
56 <pre>
57 - this script
58 - Perl 5.6 or higher (and maybe lower)
59 - The Pfam database (downloadable from
60 <a href="ftp://ftp.sanger.ac.uk/pub/databases/Pfam/">ftp://ftp.sanger.ac.uk/pub/databases/Pfam/</a>)
61 - HMMER software (from <a href="http://hmmer.wustl.edu/">http://hmmer.wustl.edu/</a>)
62 - NCBI BLAST binaries (from <a href="http://www.ncbi.nlm.nih.gov/Ftp/">http://www.ncbi.nlm.nih.gov/Ftp/</a>)
63 - Bioperl (from <a href="http://bio.perl.org/">http://bio.perl.org/</a>)</pre>
64 <p>The Bioperl modules directory must be in your perl library path, and
65 the HMMER and BLAST binaries must be in your executable path.</p>
66 <p>You also need to be able to read and write to /tmp on your machine.</p>
67 <p>Some of these requirements are easily circumvented, but this script
68 should at least give you a start.</p>
69 <p>
70 </p>
71 <hr />
72 <h1><a name="how_to_install_pfam_locally">HOW TO INSTALL PFAM LOCALLY</a></h1>
73 <p>1. Get the Pfam database from
74 <a href="ftp://ftp.sanger.ac.uk/pub/databases/Pfam/.">ftp://ftp.sanger.ac.uk/pub/databases/Pfam/.</a> In particular you need
75 the files Pfam-A.fasta, Pfam_ls, Pfam_fs, and Pfam-A.seed.</p>
76 <p>2. Unzip them if necessary
77 $ gunzip Pfam*.gz</p>
78 <p>3. Grab and install HMMER, NCBI BLAST and Bioperl, and make sure your
79 paths etc are set up properly.</p>
80 <p>4. Index Pfam-A.fasta for BLAST searches
81 $ formatdb -i Pfam-A.fasta -p T</p>
82 <p>5. Index the Pfam_ls and Pfam_fs libraries for HMM fetching
83 $ hmmindex Pfam_ls
84 $ hmmindex Pfam_fs</p>
85 <p>
86 </p>
87 <hr />
88 <h1><a name="searching_pfam">SEARCHING PFAM</a></h1>
89 <p>This script is really just a wrapper around hmmpfam.</p>
90 <p>Run pfam_scan.pl -h to get a list of options. Probably the only thing
91 to worry about is supplying the -d option with the location of your
92 downloaded Pfam database. Or you can set the PFAMDB environment
93 variable to point to the right place and things should work without
94 -d. And you should decide whether or not to use --fast.</p>
95 <p>A few things to note:</p>
96 <p>--fast uses BLAST as a preprocessor to reduce the amount of compute we
97 have to do with hmmpfam. This is known to reduce sensitivity in the
98 case of a very small number of families (those whose length is
99 exceptionally short, like the XYPPX repeat). If you're annotating
100 genomes then you *probably* don't care too much about these families.
101 Omiting this option may give you a small added sensitivity, but with a
102 rough 10 fold time cost. If you want to exactly replicate the Pfam
103 web site results or distributed data, you probably shouldn't use this.</p>
104 <p>Overlapping above-threshold hits to families within the same clan are
105 removed -- only the best scoring hit is kept. You can override this
106 behaviour with the --overlap option.</p>
107 <p>Pfam provides two sets of models, called ls and fs models, for whole
108 domain and fragment searches. This wrapper basically returns all hits
109 to the ls models, and then adds to these all non-overlapping hits to
110 the fragment models. This mimics the behaviour of Pfam web site
111 searches. You can choose to search only one set of models with the
112 --mode option.</p>
113 <p>Unless you want to grub around in the noise you should probably use
114 the default thresholds - these are hand curated for every family by
115 the Pfam team, such that we believe false positives will not score
116 above these levels. The consequence is that some families may miss
117 members.</p>
118 <p>You may want to adjust the threshold used for the preprocessing BLAST
119 search (default evalue 10). Raising this to 50 will slow everything
120 down a bit but may gain you a little sensitivity. Lowering the evalue
121 cutoff will speed things up but with an inevitable sensitivity cost.</p>
122 <p>It is important that each sequence in the fasta file has a unique
123 identifier. Note that the fasta header format should be:</p>
124 <p>&gt;identifier &lt;optional description&gt;</p>
125 <p>so the identifier should not contain whitespace.</p>
126 <p>The format of the output is:</p>
127 <p>&lt;seq id&gt; &lt;seq start&gt; &lt;seq end&gt; &lt;hmm acc&gt; &lt;hmm start&gt; &lt;hmm end&gt; &lt;bit score&gt; &lt;evalue&gt; &lt;hmm name&gt;</p>
128 <p>hmmpfam returns scores for sequence and domain matches seperately.
129 For simplicity, the single line for each domain format returned here
130 reports domain scores.</p>
131 <p>
132 </p>
133 <hr />
134 <h1><a name="bugs">BUGS</a></h1>
135 <p>Many options are not rigorously tested. Error messages are
136 uninformative. The documentation is inadequate. You may find it
137 useful. You may not.</p>
138 <p>
139 </p>
140 <hr />
141 <h1><a name="history">HISTORY</a></h1>
142 <p>Version Main changes
143 ------- ------------</p>
144 <p>0.5 Removes overlapping above-threshold hits to families
145 within the same clan. --overlap overrides.</p>
146 <p>0.4 Work-around for hmmpfam bug/feature that reports hits
147 above domain threshold even if the sequence doesn't
148 score above the sequence threshold.</p>
149 <p>0.3 Fix minor bugs to be compatable with HMM versions in
150 Pfam 13.</p>
151 <p>0.2 --fast option to use BLAST preprocessing for significant
152 speed-up.</p>
153 <p>0.1 First effort, simply wraps up hmmpfam without doing
154 anything clever.</p>
155 <p>
156 </p>
157 <hr />
158 <h1><a name="contact">CONTACT</a></h1>
159 <p>This script is copyright (c) Genome Research Ltd 2002-2005. Please
160 contact <a href="mailto:pfam@sanger.ac.uk">pfam@sanger.ac.uk</a> for help.</p>
161
162 </body>
163
164 </html>