################################################################################### # # ComplLiMment_P: A package to calculate the Complete-likelihood for a given Multiple-sequence alignment # (almost exclusively via Perl modules and scripts). # # Version 0.6: Copyright (C) 2015 Kiyoshi Ezawa # Version 0.6.1: Copyright (C) 2015 Kiyoshi Ezawa # Version 0.6.1.1: Copyright (C) 2015 Kiyoshi Ezawa # Version 0.6.1.2: Copyright (C) 2016 Kiyoshi Ezawa # Version 0.6.1.3: Copyright (C) 2016 Kiyoshi Ezawa # Version 0.6.1.5: Copyright (C) 2016 Kiyoshi Ezawa # # This package is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This package is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License, "GNU_GPL.txt", # along with this package. If not, see . # # The author can be contacted by e-mailing to # (replace " dot " and " at " with "." and "@", respectively). ################################################################################### # * [ Major modification from version 0.6.1.3 to version 0.6.1.5 ] The reference names, "Ezawa. unpublished a" and "Ezawa. unpublished b", were changed into "Ezawa 2016a" and "Ezawa 2016b", respectively. And their reference information was updated. * [ Major modification from version 0.6.1.2 to version 0.6.1.3 ] A bug was removed from the subroutine, 'correl_linregr_sglexpl_wgt', in the Perl Module, 'MyPerlModules_LOLIPOG/MyLinearRegression.pm'. * [ Modifications from version 0.6.1.1 to version 0.6.1.2 ] The templates of Dawg control files were replaced. In the old version, their simulation parameters were determined from a benchmark MSA set, which was based on 3D structural alingments. (These old templates and simulation scripts using them are still available in the package, "FA_LOLIPOG_P.") In this version, the parameters are based on the evolutionary analyses of mammalian DNA sequences, especially those undergoing neutral or nearly neutral evolution (e.g., Pollard et al. 2010). Accordingly, other parts of the package were also slightly modified if relevant. * [ Modifications from version 0.6.1 to version 0.6.1.1 ] A few bugs were fixed in the Perl scripts, 'classify_msa_errors_via_mblks.2.pl' and 'classify_msa_errors_via_mblks.alpha2.pl'. 1) "$state_ue1 == $state_ue2" was corrected as "$state_ue1 eq $state_ue2". 2) "$br2 == $equiv_branches{$br2}" was corrected as "$br2 == $equiv_branches{$br1}". * [ Major modifications from version 0.6 to version 0.6.1 ] The Perl scripts to classify MSA errors ('classify_msa_errors_via_mblks.1.pl' and 'classify_msa_errors_via_mblks.alpha.pl') were replaced with their updated versions ('classify_msa_errors_via_mblks.2.pl' and 'classify_msa_errors_via_mblks.alpha2.pl', respectively). Accordingly, the related scripts and modules were also updated. -------------------------[CAVEAT1]------------------------------------------ The current version was designed mainly to analyze (almost selectively neutral) DNA MSAs, and performance tests were conducted only on simulated DNA MSAs generated by DAWG (Cartwright 2005). Although the package is in principle applicable to real-life MSAs of DNA or protein sequences, caution should be excercised when doing such analyses, because a real-life MSA in general is a product of more complex evolutionary processes (some examples are discussed in Ezawa, Graur and Landan (2015b), and because NO MSAs reconstructed by sequence alignment programs are guaranteed to be correct. ------------------------------------------------------------------------------- -------------------------[CAVEAT2]------------------------------------------ Thus far, I confirmed that the package works on the termnal of my Mac OS X platform (10.6 on Intel Mac), and on a GNU/Linux platform (x86_64). And I suspect that it should work on other UNIX-type platforms as well (although I am not completely sure). On the other hand, I am almost sure that it won't work on Windows, because the modules and scripts heavily depend on Unix commands. So, for an exclusive Windows-user to use the package, he/she will have to learn how to use Unix (and maybe to install a Unix emulator). I hope that somebody with a goodwill will adapt the package to non-Unix platforms... ------------------------------------------------------------------------------- If you find any problems that is not solvable by yourself (and to which the above caveats do NOT apply), please e-mail the author at the address: (replace " dot " with ".", and " at " with "@"). [ Directory (i.e. folder) structure ] ComplLiMment_P.ver0.6.1/ is the "root directory" of the package (ver 0.6.1). It contains the following sub-directories and files: * README.txt File you are now reading. * DISCLAIMER.txt Disclaimer of warranty and limitation of liability extracted from the GNU General Public License, version 3. * GNU_GPL.txt Copy of the GNU General Public License, version 3. * Sample_Scripts/ Directory containing sample Perl scripts that concretely shows how the main analyses in (Ezawa 2016b) were performed. Especially important are the two scripts, "compare_ref_vs_rec_compllimment.hs.alpha.pl" and "classify_msa_errors_via_mblks.alpha2.pl." The former compares the complete likelihood scores (ibid) of two MSAs ("reference" and "reconstructed") in the "erroneous segments," where the two MSAs disagree. The latter classifies the MSA error in each erroneous segments using the position-shift map (ibid). There are three other scripts, which will be useful for preparing the input files of the above two scripts. The file, "MANUAL.sample_scripts.txt," describes how to use these scripts. * MyPerlModules/ Directory containing Perl modules that are necessary for this package but that are not available either as a part of the DENSERM_P package (Ezawa 2013) or as a part of the FA_LOLIPOG_P package (Ezawa, Graur and Landan 2015a). * MyPerlModules_DENSERM/ Directory containing Perl modules that are already available as a part of the DENSERM_P package (Ezawa 2013a; Ezawa, Landan and Graur 2013). Some of them are necessary for this ComplLiMment_P package as well. * MyPerlModules_LOLIPOG/ Directory containing Perl modules that are already available as a part of the FA_LOLIPOG_P package (Ezawa 2013b; Ezawa, Graur and Landan 2015a). Some of them are necessary for this ComplLiMment_P package as well. Actually, the following are essential for this package: + "MyPerlModules/MyTreeMap_indels_spt_odr.pm," which is essential for our sub-algorithm that attempts to enumerate all possible parsimonious insertion/deletion histories each of which result in the homology structure of a gapped segment in a given MSA. + "MyPerlModules/MyTreeMap_indels_ML_hs.pm," which is essential for calculating the logarithmic probability (i.e., log-likelihood) of the homology structure. The algorithms are described briefly in (Ezawa 2016a) and in details in (Ezawa, Graur and Landan, 2015a). * ANALYSES/ Directory containing major Perl scripts, a set of tables of pre-computed PWA multiplication factors, as well as contril files of Dawg (Cartright 2005), that were used for the analyses to characterize MSA errors using the complete-likelihood and the "position-shift map." Methods and Results of the analyses are described in (Ezawa 2016b). See "README.ANALYSES.txt" in the directory for more details. [NOTE: The directory also contains the script, "ANALYSES/MSA_Prcd_Errors/Contrast_Analyses/Classify_Errors/Erroneous_Regions/Validation/Scripts/classify_msa_errors_via_mblks.2val.pl," which was used for the manual validation of our method to classify MSA errors (Ezawa 2016b).] [ How to use the package ] * GENERAL NOTE: In this package, you should run a Perl script "xxx.pl" by issuing a command "perl xxx.pl." (Alternatively, you could change the permission mode of each script via, e.g., "chmod u+x {script name}", and issue a command "./{script name}". I would not recommend it much, though...) 0. a) If not yet, you NEED TO install the maximum-likelihood phylogeny software, 'PhyML' (Guindon et al. 2010) into your platform. 'PhyML,' its manual, etc. are available at the URL, http colon slash slash www.atgc-montpellier.fr slash phyml slash (replace ' colon ' and 'slash ' with ':' and '/', respectively). b) If not yet, you might also want to install the molecular evolution simulator, 'Dawg' (Cartwright 2005) into your platform, especially if you want to create simulated MSAs by yourself. 'Dawg,' its brief manual, etc. are available at the URL, http colon slash slash scit.us slash projects slash dawg (replace ' colon ' and ' slash ' with ':' and '/', respectively). c) If not yet, you might also want to install the two multiple sequence aligners, 'MAFFT' (Katoh and Toh 2008) and 'PRANK' (Loytynoja and Goldman 2008), into your platform, especially if you want to trace the path we took to characterize MSA errors (Ezawa 2016b). 'MAFFT,' its manual, etc. are available at the URL, http colon slash slash mafft.cbrc.jp slash alignment slash software slash (replace ' colon ' and ' slash ' with ':' and '/', respectively). 'PRANK,' its manual, etc. are available at the URL, http colon slash slash wasabiapp.org slash software slash prank slash (replace ' colon ' and ' slash ' with ':' and '/', respectively). (To enable our analyses on MSAs reconstructed via MAFFT, MAFFT needs be slightly modified. See "ANALYSES/Mod_MAFFT/HOW_TO.txt" for details.) 1. Extract the archive, "Additional_file_2.zip" or "ComplLiMment_P.vxxx.tar.gz", via the command, "unzip Additional_file_2.zip" or "tar -xpzf ComplLiMment_P.vxxx.tar.gz" (, which I suspect you've already done). Then, 'cd' to the top directory, "ComplLiMment_P.vxxx/", which popped out of the archive. 2. Add the absolute paths of the sub-directories "MyPerlModules/," "MyPerlModules_DENSERM/" and "MyPerlModules_LOLIPOG/" to the environment variable "PERL5LIB." If you are using bash, add the following lines to an appropriate place in the ".bashrc" file in your home directory: ---------- lines to be added to "~/.bashrc" ----------------- if [ -n "$PERL5LIB" ]; then export PERL5LIB=${PERL5LIB}:{the absolute path of "MyPerlModules"}:{the absolute path of "MyPerlModules_LOLIPOG"}:{the absolute path of "MyPerlModules_DENSERM"} else export PERL5LIB={the absolute path of "MyPerlModules"}:{the absolute path of "MyPerlModules_LOLIPOG"}:{the absolute path of "MyPerlModules_DENSERM"} fi -------------- end --------------------------- If you are using tcsh (or csh), add the following lines to a proper place in '.tcshrc' (or '.cshrc') in your home directory: ---------- lines to be added to "~/.tcshrc" (or "~/.cshrc") ----------------- if (($?PERL5LIB) && ("$PERL5LIB" !~ "")) then setenv PERL5LIB ${PERL5LIB}:{the absolute path of "MyPerlModules"}:{the absolute path of "MyPerlModules_LOLIPOG"}:{the absolute path of "MyPerlModules_DENSERM"} else setenv PERL5LIB {the absolute path of "MyPerlModules"}:{the absolute path of "MyPerlModules_LOLIPOG"}:{the absolute path of "MyPerlModules_DENSERM"} endif -------------- end --------------------------- Alternatively, you could 'cp' the modules in "MyPerlModules/," "MyPerlModules_LOLIPOG/" and "MyPerlModules_DENSERM/" to a directory that is already listed in "PERL5LIB." (You can confirm the directories listed in the environment variable via, e.g., "echo $PERL5LIB.") 3. Now you can run the Perl script, "compllimment.alpha.pl," in the subdirectory, "Sample_Scripts/," to find out parsimonious local indel histories that can explain the homology structure of each gapped segment of a given MSA, calculate the probabilities of their relative contributions, calculate the approximate probability that the homology structure of each segment occurs, and calculate the complete-likelihood score of the input MSA. [Depending on the situation, you will also have to run one or both of the auxiliary scripts, "preprocess_msa_dawg.pl" and "Sample_Scripts/InData/calc_log_mfacs_pars_pwa_dawg.pl."] For instructions on how to do that, read the file, "manual.compllimment.alpha.txt," in the subdirectory. 4. Or, alternatively, you could roughly trace the path that we took to characterize MSA errors using the complete-likelihood score and the "position-shift map," as described in our manuscript (Ezawa 2016b). This can be done by 'cd'ing to "ANALYSES/" and running the scripts in its sub-directories in a proper order. See "README.ANALYSES.txt" for more detailed instructions. (NOTE: To trace the entire web of paths we took, you will need lots of minor Perl scripts. If you want to do this, please contact the author of this package (K.Ezawa).) [ References ] * Cartwright RA. 2005. "DNA assembly with gap (Dawg): simulating sequence evolution." Bioinformatics 21:iii31-iii38. * Ezawa K. 2013a. "DENSERM: DEtecting Negative SElection on Recurrent Mutations," in Bioinformatics.org [URL: "http colon slash slash www.bioinformatics.org slash ftp slash pub slash DENSERM" (replace ' colon ' and ' slash ' with ':' and '/', respectively)]. * Ezawa K. 2013b. "LOLIPOG: LOg-LIkelihood for the Pattern Of Gaps in MSA," in Bioinformatics.org [URL: "http colon slash slash www.bioinformatics.org slash ftp slash pub slash lolipog" (replace ' colon ' and ' slash ' with ':' and '/', respectively)]. * Ezawa K. 2016a. "General continuous-time Markov model of sequence evolution via insertions/deletions: local alignment probability computation." BMC Bioinformatics 17:397. (DOI: 10.1186/s12859-016-1167-6). * Ezawa K. 2016b. "Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map." BMC Bioinformatics 17:133. (DOI: 10.1186/s12859-016-0945-5). * Ezawa K, Landan G, Graur D. 2013. "Detecting negative selection on recurrent mutations using gene genealogy." BMC Genetics. 14:37. * Ezawa K, Graur D, Landan G. 2015a. "Perturbative formulation of general continuous-time Markov model of sequence evolution via insertions/deletions, Part III: Algorithm for first approximation." bioRxiv doi:10.1101/023614. * Ezawa K, Graur D, Landan G. 2015b. "Perturbative formulation of general continuous-time Markov model of sequence evolution via insertions/deletions, Part IV: Incorporation of substitutions and other mutations." bioRxiv doi:10.1101/023622. * Guindon S, Dufayard JF, Lefort V, Anisimova M, hordijk W, Gascuel O. 2010. "New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0." Syst Biol. 59:307-321. * Katoh K, Toh H. 2008. "Recent developments in the MAFFT multiple sequence alignment program." Brief Bioinformatics. 9:286-298. * Loytynoja A, Goldman N. 2008. "Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis." Science 320:1632-1635. * Lunter G, Miklos I, Drummond A, Jensen JL, Hein J. 2005. "Bayesian coestimation of phylogeny and sequence alignment." BMC Bioinformatics 6:83. * Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. 2010. "Detection of nonneutral substitution rates on mammalian phylogenies." Genome Res. 20:110-121. # First version of this file was created on Sep 2nd (Wed), 2015 by K. Ezawa. # It was rewritten on Oct 30 (Fri) and Oct 31 (Sat), 2015, when the package was updated from ver. 0.6 to ver. 0.6.1, by K. Ezawa. # It was modified on Jan 12 (Tue), 2016, when the package was updated from ver 0.6.1.1 to ver. 0.6.1.2 by K. Ezawa. # It was further modified on May ? (??), 2016, when the package was updated from ver 0.6.1.2 to ver. 0.6.1.3 by K. Ezawa. # It was further modified on October 8 (Sat), 2016, when the package was updated from ver 0.6.1.3 to ver. 0.6.1.5 by K. Ezawa.