################################################################################### # # LASTPIECE_P (Local Alignment-STate Probabilities that Insertion-type and dEletion-type gaps Co-Exist, type P): # A package of programs to compute the probabilities of gapped segments in each of which an insertion-type gap and a deletion-type gap co-exist, (which were referred to as case-(iv) gapped segments by (Ezawa 2016a)), # under a stochastic model of sequence evolution with biologically realistic insertions/deletions. # (Written almost exclusively in Perl.) # # Version 0.3: Copyright (C) 2020 Kiyoshi Ezawa # Version 0.3.1: Copyright (C) 2020 Kiyoshi Ezawa # # This package is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This package is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License, "GNU_GPL.txt", # along with this package. If not, see . # # The author can be contacted by e-mailing to # (replace " dot " and " at " with "." and "@", respectively). # ################################################################################### # * [ Major modification from version 0.3 to version 0.3.1 ] * Information on the references, Ezawa 2020a,b,c, was updated. -------------------------[CAVEAT1]------------------------------------------ # Although the package fairly accurately computes the probabilities of case-(iv) gapped segments, # under genuine sequence evolution model with realistic insertions/deletions, # it should be kept in mind that real-life DNA sequences may in general undergo other types of mutations, such as inversions and duplications (some examples are discussed in Ezawa, Graur and Landan (2015b)). ------------------------------------------------------------------------------- -------------------------[CAVEAT2]------------------------------------------ Thus far, I confirmed that the package works on the termnal of my Mac OS X platform (10.11 & 10.13 on Intel Mac with SSD storage). And I suspect that it should work on other UNIX-type platforms, including Linux, as well (although I am not completely sure). On the other hand, I am almost sure that it won't work on Windows, because the modules and scripts heavily depend on Unix commands. So, for an exclusive Windows-user to use the package, he/she will have to learn how to use Unix (and maybe to install a Unix emulator). I hope that somebody with a goodwill will adapt the package to non-Unix platforms... ------------------------------------------------------------------------------- -------------------------[CAVEAT3]------------------------------------------ Currently, this package (fairly accurately) computes ONLY multiplication factors IN THE BULK. As for the multiplication factors ON SEQUENCE ENDS, the package just creates a link to the corresponding bulk multiplication factors. I apologize for the inconvenience this may cause. Although, in principle, the factors on sequence ends could be computed by slight modifications of the formulas (in Ezawa 2020a), I had no time to implement this computation. In any case, however, CAUTION should be excercised when doing analyses involving SEQUENCE ENDS, because NO (genuine) sequence evolution MODELS AVAILABLE at present take serious account of indels on sequence ends, whose behaviors could substantially vary depending on how the sequences are collected, biological properties near the ends, etc. I hope that, some day, someone will propose a genuine sequence evolution model that takes serious account of sequence ends. ------------------------------------------------------------------------------- When you read this "README.txt" file, it is most likely that I am no longer in this world. Therefore, if you find any problems that is not solvable by yourself (and to which the above CAVEATES do NOT apply), please consult neaby experts on IT (information technology) and/or molecular evolution (especially those familiar with the problems on insertions/deletions). [ Directory (i.e. folder) structure ] "LASTPIECE_P.ver0.3/" is the "root directory" of the package (ver 0.3). It contains the following sub-directories and files: * README.txt File you are now reading. * DISCLAIMER.txt Disclaimer of warranty and limitation of liability extracted from the GNU General Public License, version 3. * GNU_GPL.txt Copy of the GNU General Public License, version 3. * Main_Scripts/ Directory containing the Perl scripts that perform the MAIN jobs of this package. The main scripts are broadly classified into two: the master script, "lastpiece.alpha.pl," and the servant scripts in the sub-directory, "Servants/." In short, the master script calls the servant scripts to compute (exhaustively within a length upper-bound) the multiplication factors of gap-configurations of gapped segments in ancestor-descendant pairwise alignments (PWAs) at a series of time-lapses, using the method described in (Ezawa 2020a). The file, "MANUAL.main_scripts.txt," describes how to use these scripts. * MyPerlModules_DENSERM/ Directory containing Perl modules that are already available as a part of the DENSERM_P package (Ezawa 2013a; Ezawa, Landan and Graur 2013). Some of them are necessary for this LASTPIECE_P package as well. * MyPerlModules_LASTPIECE/ Directory containing Perl modules that have been developed exclusively for this package, either to provide input/output subroutines, to supplement the functions of main scripts, or to validate and test the main scripts. * ANALYSES/ Directory containing major Perl scripts, a set of tables of pre-computed PWA multiplication factors, as well as control files of Dawg (Cartright 2005), that were used for the analyses to validate the main script of this package and to examine their performances. Methods and Results of the analyses are described in (Ezawa 2020a). See "README.ANALYSES.txt" in the directory for more details. *** An accompanying tar-gzipped archive, "ExOutputs_LASTPIECE.tgz," contains some of the output files (& log files), *** especially those briefly discussed in (Ezawa 2020a), *** to enable you to examine more detailed features. [ How to use the package ] * GENERAL NOTE: In this package, you should run a Perl script "xxx.pl" by issuing a command "perl xxx.pl." (Alternatively, you could change the permission mode of each script via, e.g., "chmod u+x {script name}", and issue a command "./{script name}". I would not recommend it much, though...) a) If not yet, you might want to install the molecular evolution simulator, 'Dawg' (Cartwright 2005) into your platform, especially if you want to create simulated PWAs by yourself. 'Dawg,' its brief manual, etc. are available at the URL, http colon slash slash scit.us slash projects slash dawg (replace ' colon ' and ' slash ' with ':' and '/', respectively). 1. Extract the archive, "LASTPIECE_P.verxxx.tar.gz", via the command, "tar -xpzf LASTPIECE_P.verxxx.tar.gz" (, which I suspect you've already done). Then, 'cd' to the top directory, "ANEX_P.verxxx/", which popped out of the archive. 2. Add the absolute path of the sub-directories, "MyPerlModules_DENSERM/" and "MyPerlModules_LASTPIECE/," to the environment variable "PERL5LIB." If you are using bash, add the following lines to an appropriate place in the ".bashrc" file in your home directory: ---------- lines to be added to "~/.bashrc" ----------------- if [ -n "$PERL5LIB" ]; then export PERL5LIB=${PERL5LIB}:{the absolute path of "MyPerlModules_DENSERM"}:{the absolute path of "MyPerlModules_LASTPIECE"} else export PERL5LIB={the absolute path of "MyPerlModules_DENSERM"}:{the absolute path of "MyPerlModules_LASTPIECE"} fi -------------- end --------------------------- If you are using tcsh (or csh), add the following lines to a proper place in '.tcshrc' (or '.cshrc') in your home directory: ---------- lines to be added to "~/.tcshrc" (or "~/.cshrc") ----------------- if (($?PERL5LIB) && ("$PERL5LIB" !~ "")) then setenv PERL5LIB ${PERL5LIB}:{the absolute path of "MyPerlModules_DENSERM"}:{the absolute path of "MyPerlModules_LASTPIECE"} else setenv PERL5LIB {the absolute path of "MyPerlModules_DENSERM"}:{the absolute path of "MyPerlModules_LASTPIECE"} endif -------------- end --------------------------- Alternatively, you could 'cp' the modules in "MyPerlModules_DENSERM/" and "MyPerlModules_LASTPIECE/" to a directory that is already listed in "PERL5LIB." (You can check the directories listed in the environment variable via, e.g., "echo $PERL5LIB.") 3. Now you can run the Perl script, "lastpiece.alpha.pl," in the subdirectory, "Main_Scripts/," to fairly accurately compute the probabilities of case-(iv) gapped segments, in which a run of insertion-type gap and a run of deletion-type gaps adjoin each other, under a stochastic model of sequence evolution with biologically realistic insertions/deletions. The "lastpiece.alpha.pl" is the "master-program"; it calls other "servant-programs", each of which performs each step of the computation procedure, and compiles the results into the final form of the output. The final output files can be used as an input of other packages, such as ANEX (Ezawa 2020b), LOLIPOG (Ezawa 2013b, Ezawa 2016c), and ComplLimMent (Ezawa 2016a). For instructions on how to do that, read the file, "MANUAL.main_scripts.txt," in the subdirectory, "Main_Scripts/.". [IMPORTANT NOTE: In the current version, a normal run of this program, "lastpiece.alpha.pl," requires a TREMENDOUS amonnt of TIME!! For example, with the "computational" cut-off length 150 (bases) and with 100 sub-time-intervals, it took 13 days to finish all the computations on a Mac Pro with a 3.5 GHz 6-core Indel Xeon processor. (By design, the current version should use only one core.) Therefore, BEFORE performing a FULL-fledged COMPUTATION, it would be better to TEST the program USING a SMALLER parameter SETTING (for example, with the "computational" cut-off length 20 and with 20 sub-time-itervals). This practice will substantially reduce your mental stress (and the burden on your computer). ] 4. Or, alternatively, you could roughly trace the path that we went through to validate the programs and examine its performance. This can be done by 'cd'ing to "ANALYSES/" and running the scripts in its sub-directories in a proper order. See "README.ANALYSES.txt" (in "ANALYSES/") for more detailed instructions. [ References ] * Cartwright RA. 2005. "DNA assembly with gap (Dawg): simulating sequence evolution." Bioinformatics 21:iii31-iii38. * Ezawa K. 2013a. "DENSERM: DEtecting Negative SElection on Recurrent Mutations," in Bioinformatics.org [URL: "http colon slash slash www.bioinformatics.org slash ftp slash pub slash DENSERM" (replace ' colon ' and ' slash ' with ':' and '/', respectively)]. * Ezawa K. 2013b. "LOLIPOG: LOg-LIkelihood for the Pattern Of Gaps in MSA," in Bioinformatics.org [URL: "http colon slash slash www.bioinformatics.org slash ftp slash pub slash lolipog" (replace ' colon ' and ' slash ' with ':' and '/', respectively)]. * Ezawa K. 2016a. "Characterizing multiple sequence alignment errors using complete-likelihood score and position-shift map." BMC Bioinformatics 17:133; DOI: 10.1186/s12859-016-0945-5. * Ezawa K. 2016b. "General continuous-time Markov model of sequence evolution via insertions/deletions: Are alignment probabilities factorable?" BMC Bioinformatics 17:304; DOI: 10.1186/s12859-016-1105-7. * Ezawa K. 2016c. "General continuous-time Markov model of sequence evolution via insertions/deletions: local alignment probability computation." BMC Bioinformatics 17:397; DOI: 10.1186/s12859-016-1167-6. * Ezawa K, Landan G, Graur D. 2013. "Detecting negative selection on recurrent mutations using gene genealogy." BMC Genetics. 14:37. * Ezawa K, Graur D, Landan G. 2015a. "Perturbative formulation of general continuous-time Markov model of sequence evolution via insertions/deletions, Part III: Algorithm for first approximation." bioRxiv doi:10.1101/023614. ## * Ezawa K, Graur D, Landan G. 2015b. "Perturbative formulation of general continuous-time Markov model of sequence evolution via insertions/deletions, Part IV: Incorporation of substitutions and other mutations." bioRxiv doi:10.1101/023622. * Ezawa K. 2020a. "New perturbation method to compute probabilities of mutually adjoining insertion-type and deletion-type gaps in ancestor-descendant pairwise sequence alignment under genuine sequence evolution model with realistic insertions/deletions: the 'last piece of the puzzle'." (preprint "KEZW_BI_ME00005.lastpiece.pdf" available at: https://www.bioinformatics.org/ftp/pub/anex/Documents/Preprints/.) * Ezawa K. 2020b. "Alingment Neighborhood EXplorer (ANEX): First attempt to apply genuine sequence evolution model with realistic insertions/deletions to Multiple Sequence Alignment reconstruction problem." (preprint "KEZW_BI_ME00006.anex.pdf" available at: https://www.bioinformatics.org/ftp/pub/anex/Documents/Preprints/.) * Ezawa K. 2020c. "Substitutional Residue-Difference Map (SRD Map) to help locate mis-alignments in Multiple Sequence Alignment (MSA): toward Artificial-Intelilgence-assisted probability distribution of alternative MSAs." (preprint "KEZW_BI_ME00007.srdmap.pdf" available at: https://www.bioinformatics.org/ftp/pub/anex/Documents/Preprints/.) # K. Ezawa started writing the 1st version of this file on Jan 23rd, 2020. # ended writing the 1st version on Aug 12th (Wed), 2020. # # It was rewritten on August 13th (Thu), 2020, by K. Ezawa, to update information on (Ezawa 2020a,b,c). #