################################################################################### # DENSERM_P: A package to DEtect Negative SElection on Recurrent Mutations # (almost exclusively via Perl modules and scripts). # Version 0.3: Copyright (C) 2012 Kiyoshi Ezawa # Version 0.5: Copyright (C) 2013 Kiyoshi Ezawa # # This package is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This package is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License, "GNU_GPL.txt", # along with this package. If not, see . # # The author can be contacted by e-mailing to # (replace " dot " and " at " with "." and "@", respectively). ################################################################################### This package contains Perl scripts and Bourne-shell scripts that, combined together, (i) perform a coalescent simulator 'msms' (Ewing and Hermission 2010), each session of which generates a set of sequences harboring a recurrently mutating locus and a specified number of flanking SNP sites in a population; and (ii) conduct various neutrality tests, traditional and devised by us, on the simulated sequence sets. The "traditional tests" are: Tajima's D test (Tajima 1989), Ewens' test (Ewens 1973), and the Ewens-Watterson (EW) test (Watterson 1978). The tests devised by us are (Ezawa, Landan and Graur (submitted)): (a) Max^D|_M, which is the maximum number of identical-by-descent mutants in a sample (Max^D), conditional on the number of detected forward mutations (M); (b) Tot^D|_M, which is the total number of mutants in a sample (Tot^D), conditional on the number of detected forward mutations (M); and (c) their "easy emulators" (explained below). The package also contains Perl modules that are necessary for the neutrality tests, including "MyTreeMap_IImlt.pm", which implements our new parsimony algorithm that, if necessary, partially resolves genealogical relationships on a multi-furcated input tree (Ezawa, Landan, and Graur (submitted)). -------------------------[CAVEAT1]------------------------------------------ The current version can only handle sequence data simulated via 'msms', under simple population genetics settings. Handling real-life data requires incorporating more realistic aspects, like demography, recombination, etc., which is currently underway. [NOTE ADDED on the release of ver 0.5: This version can handle an exponentially expanding population starting with a constant-size ancestral population (i.e. a bottleneck), in addition to a simple constant-size population.] ------------------------------------------------------------------------------- -------------------------[CAVEAT2]------------------------------------------ Thus far, I confirmed that the package works on the termnal of my Mac OS X platform (10.6 and 10.5 on Intel Mac). And I suspect that it should work on other UNIX-type platforms, such as Linux, as well (but I am not sure). On the other hand, I am almost sure that it won't work on Windows, because the modules and scripts heavily depend on Unix commands. So, for an exclusive Windows-user to use the package, he/she will have to learn how to use Unix (and maybe to install a Unix emulator). I hope that somebody with a goodwill will adapt the package to non-Unix platforms... ------------------------------------------------------------------------------- If you find any problems that is not solvable by yourself (and to which the above caveats do NOT apply), please e-mail the author at the address: (replace " dot " with ".", and " at " with "@"). [ Directory (i.e. folder) structure ] DENSERM_P/ is the "root directory" of the package. It contains the following sub-directories and files: * DISCLAIMER.txt Disclaimer of warranty and limitation of liability extracted from the GNU General Public License, version 3. * GNU_GPL.txt Copy of the GNU General Public License, version 3. * MyPerlModules/ Directory containing Perl modules necessary for the project. Among the modules, "MyTreeMap_IImlt.pm", "MyPopgenGenealogy.pm", and "MyPopgenAlleleFreq.pm" play main roles in the new and traditional neutrality tests. Especially, "MyTreeMap_IImlt.pm" implements our new parsimony algorithm (explained at the top; Ezawa, Landan, and Graur, submitted). Other modules either logistically support the three main modules or take care of the input/output of data. * Parsing3/ Directory containing Perl scripts and Bourne-shell scripts that parse the simulation results and perform our two new neutrality tests on simulated sequence sets. * Parsing5/ Directory containing Perl scripts and Bourne-shell scripts that parse the simulation results and perform traditional neutrality tests, as well as "easy emulators", on simulated sequence sets. * README.txt File you are now reading. * Simulations/ Directory containing a Perl script and a Bourne-shell script that perform a batch execution of the coalescent simulator, 'msms' (Ewing and Hermission 2010), under a simple constant-size population. * Simulations.Exp/ Directory containing a Perl script and a Bourne-shell script that perform a batch execution of 'msms,' under an exponentially expanding population, starting with a constant-size ancestral population. (Added in creation of ver. 0.5.) (The following gzipped tar-archives should accompany "DENSERM_P.vxxx.tar.gz" on the project FTP-directory at http: slash slash www dot bioinformatics dot org slash ftp slash pub slash DENSERM (replace " slash " and " dot " with "/" and ".", respectively).) * NullDistributions.vxxx.tar.gz [ --- now separated from the package.] Gzipped tar-archive containing "empirical" null distributions of almost all test statistics used for the analyses described in our paper (Ezawa, Landan, and Graur (submitted)). (v0.3, null distributions under a constant-size population; v0.5, null distributions under expanding populations.) * NullDistributions_comb.vxxx.tar.gz [ --- now separated from the package.] Gzipped tar-archive containing "empirical" null distributions of a "hybrid" test statistics, which is the average pairwise sequence divergence among mutants conditional on the number of detected forward mutations (described in Ezawa, Landan, and Graur (submitted)). [ How to use the package ] * GENERAL NOTE: In this package, you should run a Perl script "xxx.pl" by issuing a command "[time] perl xxx.pl", where "[time]" means that you could either prepend or omit the "time" command, and a Bourne-shell script "yyy.sh" should be run via a command "[time] sh yyy.sh". (Alternatively, you could change the permission mode of each script via, e.g., "chmod u+x {script anme}", and issue a command "[time] ./{script name}". I would not recommend it much, though...) 0. If not yet, install the coalescent simulator 'msms' (Ewing and Hermission 2010) into your platform. 'msms' and instructions are available at the URL, http: slash slash www dot mabs dot at slash ewing slash msms slash (replace " dot " and " slash " with "." and "/", respectively, but leave "at" as it is here). 1. Extract the archive, "additional_file_3.zip" or "DENSERM_P.vxxx.tar.gz", via the command, "unzip DENSERM_P.vxxx.zip" or "tar -xpzf DENSERM_P.vxxx.tar.gz" (, which I suspect you've already done). Then, 'cd' to the top directory, "DENSERM_P", which popped out of the archive. 2. Add the absolute path of the directory "MyPerlModules" to the environment variable "PERL5LIB". If you are using bash, add the following lines to an appropriate place in the ".bashrc" file in your home directory: ---------- lines to be added to "~/.bashrc" ----------------- if [ -n "$PERL5LIB" ]; then export PERL5LIB=${PERL5LIB}:{the absolute path of "MyPerlModules"} else export PERL5LIB={the absolute path of "MyPerlModules"} fi -------------- end --------------------------- If you are using tcsh (or csh), add the following lines to a proper place in '.tcshrc' (or '.cshrc') in your home directory: ---------- lines to be added to "~/.tcshrc" (or "~/.cshrc") ----------------- if (($?PERL5LIB) && ("$PERL5LIB" !~ "")) then setenv PERL5LIB ${PERL5LIB}:{the absolute path of "MyPerlModules"} else setenv PERL5LIB {the absolute path of "MyPerlModules"} endif -------------- end --------------------------- Alternatively, you could 'cp' the modules in "MyPerlModules" to a directory that is already listed in "PERL5LIB". (You can confirm the directories listed in the environment variable via, e.g., "echo $PERL5LIB".) 3. 'cd' to "Simulations/", edit the scripts "hyperserial_simul_msms_bch.pl" and "bch_hypser_siml_msms.sh" appropriately, and run the latter script, which will automatically run the former script sequentially. Especially, modify the following variables as you wish: (a) In "hyperserial_simul_msms_bch.pl", $Ne (= effective population size), $CT_REP (= #{simulated replicates} per a single setting), $CT_SEGSITES (= #{segregating sites}), @SMUs (= a list of 4*$Ne*mu, where mu is the forward mutation rate per generation), @SAAs (= a list of 2*$Ne*sAA, where 1 + sAA is the relative fitness of a mutant homozygote), @SSIZES (= a list of sample sizes); (b) In "bch_hypser_siml_msms.sh", the list of values that the variable "snu" can assume in the for-loop. Here "snu" = 4*$Ne*nu, where nu is the backward mutation rate per generation. The output will be produced into the sub-directory "FullScale/" (under "Simulations/") by default. CAUTION: Depending on the parameter setting (especially $CT_REP), this session can generate a very large amount (up to 100 GB, or maybe > 1TB!!) of output data. So you should start with a small value of $CT_REP (such as 5) and get an idea on how much storage will be necessary for your full-scale analysis. If you want, you could divide the simulation session into several sub-sessions, each of which generates a moderate amount of output. Then, after conducting each sub-session, perform 4 (i) and 5 (i) below, and move the output of the simulation session to a backup storage device. After all sub-sessions go through this cycle, you can merge the results by using the script "merge_br_pars_msms.{3,5}.pl" in "Parsing{3,5}". This way, you could manage to perform and parse a tremendous number of simulation replicates even if your storage device has only a limited volume. [Although a better way would be to parse the simulation results "on the fly", I did not implement that function. If you want, you could try it by combining the Perl scripts,"hyperserial_simul_msms_bch.pl", "hyperser_br_pars_out_msms.3b_bch.pl", and "hyperser_br_pars_out_msms.5b_bch.pl".] 4. 'cd' to "Parsing3/", and conduct our new neutrality tests. The standard order of works is as follows. (i) Edit "hyperser_br_pars_out_msms.3b_bch.pl" and "bch_hypser_br_pars_out_msms.sh" appropriately, and run the latter (as in the 3rd session). The following variables may have to be edited: (a) In "hyperser_br_pars_out_msms.3b_bch.pl", $CT_SSITES (=#{segregating sites to be kept}), which must be <= $CT_SEGSITES in the 3rd session; (b) In "bch_hypser_br_pars_out_msms.sh", The list of values that the variable "snu" can take in the for-loop. (Normally, this needs to be identical to that in the 3rd session.) The remaining variables (such as $CT_REP, @SMUs and @SAAs) will be automatically adjusted... The output will be produces in the sub-directory "FullScale/" (under "Parsing3/") by default. (ii) 'cd' to "Summary1/" (under "Parsing3/"), edit the Perl script "summarize_br_pars_msms.1.3.pl" if necessary, and run it. (Usually the script will not have to be edited. In a non-default setting, however, you may have to modify $intopdir, for example.) (iii) 'cd' to "NullDistributions/" (under "Parsing3/"), then further 'cd' to its subdirectories one by one. The subdirectories are: "Maxctmuts_cond_ctfwmuts/", for our new statistic, Max^D|_M (explained at the top); and "Totctmuts_cond_ctfwmuts/", for our new statistic, Tot^D|_M (explained at the top). In each of the subdirectories, edit the Perl script "make_nulldistri.xxx.pl", and run it. You may need to modify the following variables: $LB_SMD_WGT, which is the lower-boundary of the summed weight of the parameter category that will be incorporated into the null-distribution, $POWER_SMU, which specifies the exponent for the power-law distribution, which is assumed for the forward mutation rate that are selectively neutral, @SMUs, which lists the values of $Smu = 4*$Ne*mu, @LABELS_SNU, which lists the values of $Snu = 4*$Ne*nu expressed in terms of $Smu, @SSIZES, which lists the sample sizes, [the above three sets need to be identical to or included in the counterparts specified in 3 (a),] @SETS_WEIGHTS_RATIOS, which specifies the combinations of $Snu/$Smu incorporated in the null distribution, @NAMES_SETS_WEIGHTS_RATIOS, which lists the names of the combinations. ** Alternatively, you can extract the gzipped tar-archive "NullDistributions.tar.gz" into a directory of your choice (preferably into "DENSERM_P/"), and use the null distributions contained in the subdirectory "Parsing3/" under the resulting directory "NullDistributions/" . (iv) 'cd' to "StatTests/" (under "Parsing3/"), then further 'cd' to each of its two subdirectories, "Maxctmutss_cond_ctfwmuts" and "Totctmuts_cond_ctfwmuts", and edit and run the Perl script "stattests_via_xxx.pl" in each subdirectory. The variables that may need to be modified are: $POWER_SMU, @SMUs, @LABELS_SNU, @SSIZES, each of which has to be identical to or included in that specified in (iii); @LABELS_SAAs, which should be identical to or included in @SAAS in 3 (a); and $auxdir, which specifies the location of the null distribution to be used. (v) Finally, 'cd' to "StatTests/SUMMARY1", then further 'cd' to each of its subdirectories, "Maxctmutss_cond_ctfwmuts" and "Totctmuts_cond_ctfwmuts", and edit and run the Perl script "summarize_stattests1b.pl" in each subdirectory. The variables that may need to be modified are: $POWER_SMU, @SMUs, @LABELS_SNU, @SSIZES, @LABELS_SAAs, each of which has to be identical to that specified in (iv); $auxdir, which specifies the location of the null distribution to be used. NOTE: $auxdir may not be identical to that in (iv), because the relative paths differ. 5. 'cd' to "Parsing5/", and conduct traditional neutrality tests, as well as "easy emulators" of our new tests. The standard order of works is very similar to that for the 4th session. Here I only note remarkable differences. (ii) The counterpart of "Summary1/" now consists of three directories, "Summary1wh/", "Summary1mut/", and "Summary1mut_comb/". "Summary1wh/" contains the results of Tajima's D test (Tajima 1989), Ewens' test (Ewens 1973), the Ewens-Watterson test (Watterson 1978) on the simulated sequence sets; "Summary1mut/" contains the ingredients of easy emulators (Ezawa, Landan and Graur (submitted)); "Summary1mut_comb/" contains the ingredients of the "hybrid test" (Ezawa, Landan and Graur (submitted)). (iii), (iv), (v) The subdirectories "NullDiscributions/", "StatTests/", and "StatTests/SUMMARY1/" all contain the sub-directories of the same names. Among them: "D_taj/", "EW_test(.REV)/", and "Ewens_test(.REV)/" are self-explanatory; "Frqcmnstmut_cond_ctmuthaplos/" is for an "easy emulator", Max^Hm|_Km, which is the frequency of the commonest mutant haplotype in the sample (Max^HM), conditional on the number of all mutant haplotypes (Km); "Totctmuts_cond_ctmuthaplos/" is for a second "easy emulator", Tot^Hm|_Km, which is the total frequency of the mutant haplotypes (= total #{mutants}) in the sample, conditional on Km; "Pimut_cond_classctmuts/" and "Pimut_cond_ctmuthaplos/" are for a third "easy emulator", pi_Hm, which is the average pairwise sequence difference among mutant sequences (Innan H and Teshima MK, personal communication), as well as its conditional versions (the former and the latter being conditional on the number of mutants themselves and on the number of mutant haplotypes, respectively). Besides, if you want, similar analyses could be conducted also on "NullDistributions_comb/", "StatTests_comb/", and "StatTests_comb/SUMMARY1/", each of which contains a subroutine named "Pimut_cond_ctfwmuts/". This subroutine is for a "hybrid test statistic", pi_Hm|_M, which is the average pairwise sequence difference among mutant sequences, conditional on the number of forward mutations detected (Ezawa, Landan, and Graur, submitted). Because of this, you may have to specify the location of "Summary3/StatTests/Totctmuts_cond_ctfwmuts/..." via $auxtopdir (or $auxtopdir2). [ References ] * Ewens WJ. 1973. "Testing for increased mutation rate for neutral alleles." Theor Popul Biol 4:251-258. * Ewing G, Hermission J. "MSMS: a coalescent simulation program including recombination, demographic structure, and selection at a single locus." Bioinformatics 26:2064-2065. * Ezawa K, Landan G, and Graur D. (submitted). "Detecting negative selection on recurrent mutations using gene genealogy." Submitted to BMC Genetics. * Tajima F. 1989. "Statistical method for testing the neutral mutation hypothesis by DNA polymorphism." Genetics 123:585-595. * Watterson GA. 1978. "The homozygosity test of neutrality." 88:405-417.