###################################################################################
#
# ANEX_P (Alignment Neighborhood Explorer, type P):
# A package including a program to construct approximate probability distributions of alternative Multiple SEquence Alignments (MSAs)
# by exploring the neighborhoods of an input MSA;
# the MSA probabilities are computed under a given genuine sequence evolution model with realisric insertions/deletions.
# (Written almost exclusively in Perl.)
#
# Version 0.3: Copyright (C) 2015 Kiyoshi Ezawa
# Version 0.5: Copyright (C) 2019 Kiyoshi Ezawa
# Version 0.6: Copyright (C) 2020 Kiyoshi Ezawa
# Version 0.7: Copyright (C) 2020 Kiyoshi Ezawa
# Version 0.7.1: Copyright (C) 2020 Kiyoshi Ezawa
#
# This package is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This package is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License, "GNU_GPL.txt",
# along with this package. If not, see .
#
# The author can be contacted by e-mailing to
# (replace " dot " and " at " with "." and "@", respectively).
#
###################################################################################
#
[ * Major modifications from version 0.7 to version 0.7.1. ]
* Information on the references, Ezawa 2020a,b,c, was updated.
[ * Major modifications from version 0.6 to version 0.7. ]
* Now, the main scripts, "anex.ver0.7.pl," "anex_for_sgl_wd.ver0.7.pl" and "anex_rev_purge_for_sgl_wd.ver0.7.pl,"
can take FULL advantage of the outputs of "LASTPIECE(\_P)" (Ezawa 2020a),
which quite accurately pre-computes the multiplication factors contributed from gapped segments in ancestor-descendant PWAs
under a given genuine sequence evolution model.
This must greatly enhance the accuracy of MSA probabilities computed by these main scripts.
* The supplementary scripts have been moved to a new sub-directory, "Supplementary_Scripts/."
*** This version 0.7 is the first version that has been released to the public.
[ * Major modifications from version 0.5 to version 0.6. ]
In version 0.6, the main scripts, "anex_for_sgl_wd.ver0.5.pl" and "anex_rev_purge_for_sgl_wd.ver0.5.pl," are equipped with
a function to find the maximum value among the probabilities of MSAs explored by multiple-"shift"s,
and a function to compute the total probability summed over MSAS explored by multiple-"shift"s.
In version 0.5, the script, "detect_purge_cands.ver0.5.pl," used parsimonized Dollo parsimonious indel histories
when deciding which branches (and sites) should be included in the window analysis;
In version 0.6, it uses RAW Dollo parsimonious indel histories when making this decision.
From version 0.6, the package also contains some supplementary scripts, namely,
"ingr_correl_alnshear_cb_vs_srd.ver0.6.pl," "ingr_stat_srds_in_correct_segs.ver0.6.pl,"
"map_aln_shear.ver0.6.pl," "map_aln_shear_clm_based.ver0.6.pl," "map_alnshear_and_srd.ver0.6.pl,"
"map_alnshear_cb_and_srd.ver0.6.pl," and "srd_map.ver0.6.pl,"
which enable analyses using "substitutional residue-difference map"s (or "SRD Map"s for short)
and/or "Alignment-Shear Map"s (see Ezawa 2020c).
[ * Version 0.3 of the package was still "incomplete,"
in the sense that it lacked the functions to "explore" the neighborhoods of an input MSA;
It is version 0.5 that became equipped with such functions for the first time;
In this sense, version 0.5 could be regarded as the first "completed" version. ]
-------------------------[CAVEAT1]------------------------------------------
#The current version was designed mainly to analyze (almost selectively neutral) DNA MSAs,
#and performance tests were conducted only on simulated DNA MSAs generated by DAWG (Cartwright 2005).
#
#Although the package is in principle applicable to real-life MSAs of DNA or protein sequences,
#caution should be excercised when doing such analyses,
#because a real-life MSA in general is a product of more complex evolutionary processes (some examples are discussed in Ezawa, Graur and Landan (2015b).
# (NOTE: Because this program package, "ANEX(_P)," aims to "correct" MSA errors,
# the second reason cited in [CAVEAT1] of "FA_LOLIPOG(_P)" and of "ComplLiMment(_P)" does NOT apply here.)
-------------------------------------------------------------------------------
-------------------------[CAVEAT2]------------------------------------------
Thus far, I confirmed that the package works on the termnal of my Mac OS X platforms
(10.11 and 10.13 on Intel Mac, each with an SSD storage).
And I suspect that it should work on other UNIX-type platforms (including Linux) as well (although I am not completely sure).
On the other hand, I am almost sure that it won't work on Windows,
because the modules and scripts heavily depend on Unix commands.
So, for an exclusive Windows-user to use the package,
he/she will have to learn how to use Unix (and maybe to install a Unix emulator).
I hope that somebody with a goodwill will adapt the package to non-Unix platforms...
-------------------------------------------------------------------------------
When you read this "README.txt" file, it is most likely that I am no longer in this world.
Therefore, if you find any problems that is not solvable by yourself (and to which the above CAVEATES do NOT apply),
please consult neaby experts on IT (information technology) and/or molecular evolution (especially those familiar with the problems on insertions/deletions).
[ Directory (i.e. folder) structure ]
"ANEX_P.ver0.7/" is the "root directory" of the package (ver 0.7).
It contains the following sub-directories and files:
* README.txt
File you are now reading.
* DISCLAIMER.txt
Disclaimer of warranty and limitation of liability extracted from the GNU General Public License, version 3.
* GNU_GPL.txt
Copy of the GNU General Public License, version 3.
* Main_Scripts/
Directory containing MAIN Perl scripts, which are core of this program package.
Briefly, the master script, "anex.ver0.7.pl," creates windows and calls satellite scripts to perform actual analyses (by exploring MSA neighborhoods and computing the probabilities of MSAS visited);
the satellite script, "anex_for_sgl_wd.ver0.7.pl," performs an ordinary window analysis;
the satellite script, "anex_rev_purge_for_sgl_wd.ver0.7.pl," performs a PCC ("purge-like error candidate"-containing) window analysis;
and the satellite script, "detect_purge_cands.ver0.5.pl," detects "purge-like error candidate" regions in the input MSA.
The file, "MANUAL.main_scripts.txt," describes how to use these scripts.
+ Main_Scripts/Ingredients/
Sub-directory containing some Perl scripts used as "ingredients" of the main scripts, as well as of the supplementary scripts.
Many of these "ingredients" are scripts used for testing individual subroutines (in the Perl modules in the directories described below).
* Supplementary_Scripts/
Directory containing Supplementary Perl scripts, which provides additional convenience to this package.
Especially, the scripts, "coordinate_point_lcl_msa.ver0.7.pl" and "test_cmpt_log_total_qnt_in_multidim_storage.pl,"
literally supplements the main scripts (described in (Ezawa 2020b));
The script, "preprocess_msa_dawg.alpha.pl," pre-processes MSAs to facilitate the identification of homology structures (Lunter et al. 2005);
and the script, "InData/calc_log_mfacs_pars_pwa_dawg.pl," pre-computes the multiplication factors (of ancestor-descendant PWAs) via a parsimony-based approximation;
The other scripts enable us to conduct analyses via "substitutional residue-difference map" (or "SRD Map" for short)
and/or "Alignment-Shear Map" (described in (Ezawa 2020c)).
The file, "MANUAL.sppl_scripts.txt," describes how to use these scripts.
* MyPerlModules_ANEX/
Directory containing Perl modules that are newly created for this package. (Some of them are described in (Ezawa 2020b).)
* MyPerlModules_ComplLiMment/
Directory containing Perl modules that are already available as a part of the ComplLiMment(_P) package (Ezawa 2016a).
Some of them are necessary for this ANEX(_P) package as well.
* MyPerlModules_DENSERM/
Directory containing Perl modules that are already available as a part of the DENSERM(_P) package (Ezawa, Landan and Graur 2013).
Some of them are necessary for this ANEX(_P) package as well.
* MyPerlModules_LOLIPOG/
Directory containing Perl modules that are already available as a part of the (FA_)LOLIPOG(_P) package (Ezawa, Graur and Landan 2015a).
Some of them are necessary for this ANEX(_P) package as well.
(Some of the algorithms are described briefly in (Ezawa 2016c) and in details in (Ezawa, Graur and Landan, 2015a).)
[NOTE: Some newly created modules and modified modules have not yet been incorporated in (FA_)LOLIPOG(_P).]
Actually, the following are essential to this package:
+ "MyPerlModules_LOLIPOG/MyTreeMap_indels_spt_odr.pm,"
& "MyPerlModules/MyTreeMap_indels_spt_odr_hs.pm,"
which are essential for our sub-algorithm that attempts to enumerate all possible parsimonious effective-insertion/deletion histories each of which result in the homology structure of a gapped segment in a given MSA.
+ "MyPerlModules_LOLIPOG/MyTreeMap_indels_ML_hs.pm,"
& "MyPerlModules/MyTreeMap_indels_ML_hs_hs.pm,"
& "MyPerlModules/MyTreeMap_indels_ML_hs_hs_wLP.pm,"
which are essential for calculating the logarithmic probability (i.e., log-likelihood) of the homology structure.
Especially, the last one (i.e., "..._wLP.pm") can take full advanage of the multiplcation factors
(of gapped segments in ancestor-descendant PWAs) pre-computed by LASTPIECE(_P) (Ezawa 2020a).
+ "MyPerlModules_LOLIPOG/MyReadAlnProbIngredients.pm,"
which is essential for inputting and re-using multiplication factors (of gapped segments in ancestor-descendant PWAs) that have been pre-computed.
* ANALYSES/
Directory containing major Perl scripts and input files (including a set of tables of pre-computed PWA multiplication factors, as well as control files of Dawg (Cartright 2005)) that were used for the analyses to validate the main and supplementary scripts of ANEX(_P) and to examine their performances.
Methods and Results of the analyses are described in (Ezawa 2020b) and in (Ezawa 2020c).
See "README.ANALYSES.txt" in the directory for more details.
*** An accompanying tar-gzipped archive, "ExOutputs_ANEX.tgz," contains some of the output files (& log files),
*** as well as Excel spreadsheets that summarize the results.
### RESTART FROM HERE ###
[ How to use the package ]
* GENERAL NOTE:
In this package,
you should run a Perl script "xxx.pl" by issuing a command "perl xxx.pl."
(Alternatively, you could change the permission mode of each script via, e.g., "chmod u+x {script name}", and issue a command "./{script name}". I would not recommend it much, though...)
a) If not yet, you might also want to install the molecular evolution simulator, 'Dawg' (Cartwright 2005) into your platform, especially if you want to create simulated MSAs by yourself.
'Dawg,' its brief manual, etc. are available at the URL, http colon slash slash scit.us slash projects slash dawg (replace ' colon ' and ' slash ' with ':' and '/', respectively).
b) If not yet, you might also want to install the two multiple sequence aligners, 'MAFFT' (Katoh and Toh 2008) and 'PRANK' (Loytynoja and Goldman 2008), into your platform, especially if you want to trace the path we took to validate the program and examine its performance (Ezawa and Yada, planned)
'MAFFT,' its manual, etc. are available at the URL, http colon slash slash mafft.cbrc.jp slash alignment slash software slash (replace ' colon ' and ' slash ' with ':' and '/', respectively).
'PRANK,' its manual, etc. are available at the URL, http colon slash slash wasabiapp.org slash software slash prank slash (replace ' colon ' and ' slash ' with ':' and '/', respectively).
1. Extract the archive, "ANEX_P.verxxx.tgz", via the command,
"tar -xpzf ComplLiMment_P.vxxx.tar.gz"
(, which I suspect you've already done).
Then, 'cd' to the top directory, "ANEX_P.verxxx/", which popped out of the archive.
2. Add the absolute paths of the sub-directories "MyPerlModules_ANEX/," "MyPerlModules_ComplLiMment," "MyPerlModules_DENSERM/" and "MyPerlModules_LOLIPOG/" to the environment variable "PERL5LIB."
If you are using bash, add the following lines to an appropriate place in the ".bashrc" file in your home directory:
---------- lines to be added to "~/.bashrc" -----------------
if [ -n "$PERL5LIB" ]; then
export PERL5LIB=${PERL5LIB}:{the absolute path of "MyPerlModules_ANEX"}:{the absolute path of "MyPerlModules_ComplLiMment"}:{the absolute path of "MyPerlModules_LOLIPOG"}:{the absolute path of "MyPerlModules_DENSERM"}
else
export PERL5LIB={the absolute path of "MyPerlModules_ANEX"}:{the absolute path of "MyPerlModules_ComplLiMment"}:{the absolute path of "MyPerlModules_LOLIPOG"}:{the absolute path of "MyPerlModules_DENSERM"}
fi
-------------- end ---------------------------
If you are using tcsh (or csh), add the following lines to a proper place in '.tcshrc' (or '.cshrc') in your home directory:
---------- lines to be added to "~/.tcshrc" (or "~/.cshrc") -----------------
if (($?PERL5LIB) && ("$PERL5LIB" !~ "")) then
setenv PERL5LIB ${PERL5LIB}:{the absolute path of "MyPerlModules_ANEX"}:{the absolute path of "MyPerlModules_ComplLiMment"}:{the absolute path of "MyPerlModules_LOLIPOG"}:{the absolute path of "MyPerlModules_DENSERM"}
else
setenv PERL5LIB {the absolute path of "MyPerlModules_ANEX"}:{the absolute path of "MyPerlModules_ComplLiMment"}:{the absolute path of "MyPerlModules_LOLIPOG"}:{the absolute path of "MyPerlModules_DENSERM"}
endif
-------------- end ---------------------------
Alternatively, you could 'cp' the modules in "MyPerlModules_ANEX/," "MyPerlModules_ComplLiMment," "MyPerlModules_LOLIPOG/" and "MyPerlModules_DENSERM/" to a directory that is already listed in "PERL5LIB."
(You can confirm the directories listed in the environment variable via, e.g., "echo $PERL5LIB.")
3. Now you can run the Perl script, "anex.verxxx.pl," in the subdirectory, "Main_Scripts/,"
to construct probability distributions of alternative MSAs by exploring the neighborhoods of an input MSA.
The MSA probabilities are computed under genuine sequence evolution model with realistic indels.
## [Depending on the situation, you will also have to run one or both of the auxiliary scripts,
## "preprocess_msa_dawg.pl" and "Sample_Scripts/InData/calc_log_mfacs_pars_pwa_dawg.pl."] (... Necessary???) ... Needs modification.
For instructions on how to do that, read the file, "MANUAL.main_scripts.txt," in the subdirectory.
4. Or, alternatively, you could roughly trace the path that we took to validate the main and/or supplementary scripts
and to examine their performances.
This can be done by 'cd'ing to "ANALYSES/" and running the scripts in its sub-directories in a proper order.
See "README.ANALYSES.txt" (in "ANALYSES/") for more detailed instructions.
5. If you prefer, you could run some of the supplementary scripts in the subdirectory, "Supplementary_Scripts/,"
to perform the analyses you want, either on the input MSA or on the output of the main script, "anex.verxxx.pl."
To do this, issue a command "perl {the/path/to/the/supplementary/script/you/want} [command-line-arguments-required]".
[Or, you could change the modes of the scripts (to make them executable by yourself) and append the path to the
directory, "Supplementary_Scripts/," to the environment variable, PATH.
Then, the scripts can be run by merely invoking their names.]
See "MANUAL.sppl_scripts.txt" (in "Supplementary_Scripts/") for more detailed instructions.
[ References ]
* Cartwright RA. 2005. "DNA assembly with gap (Dawg): simulating sequence evolution." Bioinformatics 21:iii31-iii38.
* Ezawa K. 2013a. "DENSERM: DEtecting Negative SElection on Recurrent Mutations," in Bioinformatics.org [URL: "http colon slash slash www.bioinformatics.org slash ftp slash pub slash DENSERM" (replace ' colon ' and ' slash ' with ':' and '/', respectively)].
* Ezawa K. 2013b. "LOLIPOG: LOg-LIkelihood for the Pattern Of Gaps in MSA," in Bioinformatics.org [URL: "http colon slash slash www.bioinformatics.org slash ftp slash pub slash lolipog" (replace ' colon ' and ' slash ' with ':' and '/', respectively)].
* Ezawa K. 2016a. "Characterizing multiple sequence alignment errors using complete-likelihood score and position-shift map." BMC Bioinformatics 17:133; DOI: 10.1186/s12859-016-0945-5.
* Ezawa K. 2016b. "General continuous-time Markov model of sequence evolution via insertions/deletions: Are alignment probabilities factorable?" BMC Bioinformatics 17:304; DOI: 10.1186/s12859-016-1105-7.
* Ezawa K. 2016c. "General continuous-time Markov model of sequence evolution via insertions/deletions: local alignment probability computation." BMC Bioinformatics 17:397; DOI: 10.1186/s12859-016-1167-6.
* Ezawa K, Landan G, Graur D. 2013. "Detecting negative selection on recurrent mutations using gene genealogy." BMC Genetics. 14:37.
* Ezawa K, Graur D, Landan G. 2015a. "Perturbative formulation of general continuous-time Markov model of sequence evolution via insertions/deletions, Part III: Algorithm for first approximation." bioRxiv doi:10.1101/023614.
## * Ezawa K, Graur D, Landan G. 2015b. "Perturbative formulation of general continuous-time Markov model of sequence evolution via insertions/deletions, Part IV: Incorporation of substitutions and other mutations." bioRxiv doi:10.1101/023622.
* Ezawa K. 2020a. "New perturbation method to compute probabilities of mutually adjoining insertion-type and deletion-type gaps in ancestor-descendant pairwise sequence alignment under genuine sequence evolution model with realistic insertions/deletions: the 'last piece of the puzzle'." (preprint "KEZW_BI_ME00005.lastpiece.pdf" available at: https://www.bioinformatics.org/ftp/pub/anex/Documents/Preprints/.)
* Ezawa K. 2020b. "Alingment Neighborhood EXplorer (ANEX): First attempt to apply genuine sequence evolution model with realistic insertions/deletions to Multiple Sequence Alignment reconstruction problem." (preprint "KEZW_BI_ME00006.anex.pdf" available at: https://www.bioinformatics.org/ftp/pub/anex/Documents/Preprints/.)
* Ezawa K. 2020c. "Substitutional Residue-Difference Map (SRD Map) to help locate mis-alignments in Multiple Sequence Alignment (MSA): toward Artificial-Intelilgence-assisted probability distribution of alternative MSAs." (preprint "KEZW_BI_ME00007.srdmap.pdf" available at: https://www.bioinformatics.org/ftp/pub/anex/Documents/Preprints/.)
## * Ezawa K, Yada T. (planned). "Alignment Neiborhood EXplorer (ANEX): A program to 'proofread' an input MSA taking account of stochasticity, evolutionary consistency and biological realism."
## * Guindon S, Dufayard JF, Lefort V, Anisimova M, hordijk W, Gascuel O. 2010. "New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0." Syst Biol. 59:307-321.
* Katoh K, Toh H. 2008. "Recent developments in the MAFFT multiple sequence alignment program." Brief Bioinformatics. 9:286-298.
* Loytynoja A, Goldman N. 2008. "Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis." Science 320:1632-1635.
* Lunter G, Miklos I, Drummond A, Jensen JL, Hein J. 2005. "Bayesian coestimation of phylogeny and sequence alignment." BMC Bioinformatics 6:83.
* Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. 2010. "Detection of nonneutral substitution rates on mammalian phylogenies." Genome Res. 20:110-121.
# First version of this file was created from August 9th (Sun) to 11th (Tue), 2020 by K. Ezawa.
#
# On August 13 (Thr), 2020, it was rewritten by K. Ezawa, to update information on (Ezawa 2020a,b,c).
#