Phylogenetic and phylogenomic analyses for large datasets

Abstract: The phylogenetic tree is a main tool to study the evolutionary relationships among species. Computational methods for building phylogenetic trees from gene/protein sequences have been developed for decades and come of age. Efficient approaches, including distance-Based methods, maximum likelihood methods, or classical maximum parsimony methods, are now able to analyze datasets with thousands of sequences. The advanced sequencing technologies have resulted in a huge amount of data including whole genomes. A number of methods have been proposed to analyze the wholegenome datasets, however, numerous challenges need to be addressed and solved to translate phylogenomic inferences into practices. In this paper, we will analyze widely-used methods to construct large phylogenetic trees, and available methods to build phylogenomic trees from whole-genome datasets. We will also give recommendations for best practices when performing phylogenetic and phylogenomic analyses. The paper will enable researchers to comprehend the state-ofthe-art methods and available software to efficiently study the evolutionary relationships among species from large datasets

Download

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7

Trang 8

Trang 9

9 trang xuanhieu 12440

Download

Bạn đang xem tài liệu "Phylogenetic and phylogenomic analyses for large datasets", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Phylogenetic and phylogenomic analyses for large datasets

an efficient
iteratively adds new sequences into the current tree to build stochastic search algorithm for building large maximum
a whole tree. The key idea of quartet-based methods for likelihood trees [11]. They implemented IQ-TREE software
inserting new sequences is the condition that if ) (0, 1|2, H) that combined quick hill-climbing optimizations and a
is the best quartet tree of four sequences including a new stochastic perturbation method to search maximum likeli-
sequence H and three leaves 0, 1, 2 of the current tree, the hood trees. The IQ-TREE perturbs current local optimal
new sequence H should not be placed on the path connecting trees to escape from the current local optimal points and
two leaves 0 and 1. The quartet-based methods insert new subsequently optimizes the perturbed trees by quick hill-
sequences into the current tree such that the condition is climbing optimizations to search for the global optimal tree.
hold for most of the available quartets. The searching process is repeated several times and the best
Quartet puzzling method is the first quartet-based method local optimal tree found will be considered as the best tree.
to build a maximum likelihood tree [29]. The method eval- Experiments showed that IQ-TREE was better than both
89
Research and Development on Information and Communication Technology
TABLE III The second challenge when analyzing the genomes is
THE STRUCTURAL VARIANTS IN THE GENOMES. model selection. The evolutionary models including rate
1 2 3 4 5 6 7 8 9 models and substitution models might vary among loci. One
Human G1 G2 G3 G4 G5 G6 G7 G8 G9 model cannot properly reflex the evolutionary process of all
Gorilla G1 G7 G6 G5 G4 G3 G7 G8 G9 loci. The phylogenomic analyses should use model selection
Monkey - G5 G6 G7 G8 G9 G2 G3 G4 methods to assign a proper model for each locus. As esti-
Dog - G5 G6 G7 G8 G9 G4 G3 G2 mating model parameters is not strongly affected by the tree
structure, model selection methods normally include two
steps: building an initial tree and estimating model param-
PhyML and RAxML methods in a majority number of cases eters based on the initial tree and the alignment. Building
tested. The IQ-TREE software is user-friendly and widely an initial tree can be done efficiently by distance-based
used by biologists for studying the evolution of species from methods such as NJ method. Provided that the initial tree is
molecular data. reasonably close to the best tree, it is good enough to esti-
mate model parameters. Estimating model parameters for an
IV. PHYLOGENOMIC ANALYSIS alignment based on the initial tree can be solved by numeri-
cal optimization methods such as Brent’s algorithm or more
Analyzing whole genomes to investigate the relationships efficient Broyden–Fletcher–Goldfarb–Shanno (BFGS) al-
among species is a comprehensive and challenging prob- gorithm. Software like IQ-TREE provides us options to
lem. In phylogenetic analysis, a genome is separated into automatically determine the best model for each locus when
a list of loci each corresponds to a gene or a region of analyzing whole-genome datasets. Note that normally we
interest in the genome. Homologous loci are aligned to do not optimize parameters of amino acid models from
create multiple sequence alignments. As a result, the input each alignment because each alignment does not contain
for phylogenomic analysis is a list of multiple sequence sufficient data to estimate a large number of parameters.
alignments. Phylogenetic tree construction approaches have
Finally, the computational expense is a critical burden
been expanded to build phylogenomic trees from a list of
of phylogenomic inferences. A genome dataset contains
multiple sequence alignments.
several dozens to thousands of genomes with the length
The first challenge in analyzing whole genomes is the up to several hundred million nucleotides. To overcome
occurrence of structural variants in the genomes. Besides the problem, more efficient heuristic algorithms in terms
variants occurring inside genes, other common types of of both running time and memory requirement should be
structural changes are gene insertions/deletions, gene in- developed to handle whole-genome datasets. As genomes
versions (inverting the order of genes in the genome), are separated into loci, parallel computing is a promising
gene transpositions (moving a number of genes in the approach to perform phylogenomic inference on individual
genome from one position to another position in the alignments simultaneously. Most of the current widely-used
genome), and inverted transpositions (combining both gene phylogenetic software such as RAxML or IQ-TREE provide
inversion and gene transposition events into one event). options to conduct phylogenetic inferences in parallel.
Table III demonstrates an example of structural variants
in the genomes, e.g., there is a gene inversion between the
V. DISCUSSIONS
human genome and gorilla genome (genes G3, G4, G5, G6
in the human genome were inverted into G6, G5, G4, G3 Phylogenetic inference is a core study in molecular biol-
in the gorilla genome). ogy. It is an active research field for several decades and the
The structural changes resulted in genomes with different main focus of prominent researcher groups. Phylogenetic
structures. The structural difference between genomes can reconstruction for single or several genes perhaps come to
be used as phylogenetic signals for studying the evolution age. The distance-based methods are able to build large rea-
of species, i.e., the number of structural changes to explain sonable phylogenies that could be used as starting points to
the structural difference between two genomes can be search maximum parsimony or maximum likelihood trees.
considered as the genetic distance between the genomes Nowadays, maximum likelihood methods such as RAxML,
to evaluate their relationships. Overall, the genetic distance PhyML or IQ-TREE are able to efficiently construct trees
between two genomes is a combination of character changes with thousands of sequences.
inside genes and structural changes. Weighting and com- All popular phylogenetic tree reconstruction software are
bining these changes properly for estimating the overall based on heuristic search methods, therefore, the results
genetic distance between genomes enable us to build better from different software, or even from different runs of
phylogenetic trees using distance-based methods [31]. the same software, might not completely congruence. It is
90
Vol. 2019, No. 2, December
especially true when analyzing datasets whose phylogenetic [7] A. Varón, L. S. Vinh, and W. C. Wheeler, “POY version 4:
signals support polytomy tree structures. We recommend phylogenetic analysis using dynamic homologies,” Cladis-
tics, vol. 26, no. 1, pp. 72–85, 2010.
researchers to perform bootstrap analyses to assess the [8] S. Guindon and O. Gascuel, “A simple, fast, and accurate
reliability of branches in the constructed tree. Although algorithm to estimate large phylogenies by maximum like-
phylogenetic bootstrapping is computationally expensive, lihood,” Systematic Biology, vol. 52, no. 5, pp. 696–704,
ultrafast bootstrap methods such as UFBoot2 [32] are able 2003.
[9] L. S. Vinh and A. von Haeseler, “IQPNNI: moving fast
to build large bootstrap trees in an acceptable time. through tree space and stopping in time,” Molecular Biology
and Evolution, vol. 21, no. 8, pp. 1565–1571, 2004.
Determining proper evolutionary models (i.e., site rate [10] A. Stamatakis, “Using RAxML to infer phylogenies,” Cur-
models and/or substitution models) for datasets under the rent Protocols in Bioinformatics, vol. 51, no. 1, pp. 6–14,
study is very critical in phylogenetic inferences. Using 2015.
wrong evolutionary models in analyzing data will lead to [11] L.-T. Nguyen, H. A. Schmidt, A. Von Haeseler, and B. Q.
Minh, “IQ-TREE: a fast and effective stochastic algorithm
inaccurate results [32]. The evolutionary models are nor- for estimating maximum-likelihood phylogenies,” Molecular
mally selected from a list of existing models, and the model Biology and Evolution, vol. 32, no. 1, pp. 268–274, 2015.
parameters can be directly estimated from the input data. [12] J. Thompson, D. Higgins, and T. Gibson, “ClustalW,” Nu-
cleic Acids Research, vol. 22, no. 22, pp. 4673–4680, 1994.
The advance of sequencing technologies has produced [13] R. C. Edgar, “MUSCLE: multiple sequence alignment with
large genome datasets consisting of dozens to thousands of high accuracy and high throughput,” Nucleic Acids Research,
vol. 32, no. 5, pp. 1792–1797, 2004.
genomes with the length up to billion nucleotides. The large [14] F. Joseph, Inferring Phytogenies. Sunderland, MA, USA:
genome datasets provide us an unprecedented opportunity Sinauer Associates, 2003.
to study the relationships among species from their whole [15] Z. Yang, “Maximum-likelihood estimation of phylogeny
from DNA sequences when substitution rates differ over
genomes. However, new efficient computational methods sites.” Molecular Biology and Evolution, vol. 10, no. 6, pp.
should be developed for phylogenomic inferences. The 1396–1401, 1993.
relationships among genomes should be analyzed at both [16] S. Whelan and N. Goldman, “A general empirical model
levels: point level (i.e., nucleotide/amino acid changes) and of protein evolution derived from multiple protein families
using a maximum-likelihood approach,” Molecular Biology
structural level (gene insertions/deletions as well as gene re- and Evolution, vol. 18, no. 5, pp. 691–699, 2001.
arrangements). Combining changes at both levels to have a [17] S. Q. Le and O. Gascuel, “An improved general amino
comprehensive evaluation is a new challenge for researchers acid replacement matrix,” Molecular Biology and Evolution,
vol. 25, no. 7, pp. 1307–1320, 2008.
in phylogenomic analyses. Another challenge in analyzing [18] C. C. Dang, Q. S. Le, O. Gascuel, and V. S. Le, “FLU, an
whole genomes is the heterogeneity of evolutionary pro- amino acid substitution model for influenza proteins,” BMC
cesses between loci. Thus, determining proper evolutionary Evolutionary Biology, vol. 10, no. 1, p. 99, 2010.
[19] D. T. Jones, W. R. Taylor, and J. M. Thornton, “The
models is very critical when analyzing multiple genes or rapid generation of mutation data matrices from protein
whole genomes. Currently, several software such as IQ- sequences,” Computational Applied Bioinformatics, vol. 8,
TREE are able to perform phylogenomic inferences for large no. 3, pp. 275–282, 1992.
genome datasets. [20] L. L. Cavalli-Sforza and A. W. Edwards, “Phylogenetic anal-
ysis: models and estimation procedures,” American Journal
of Human Genetics, vol. 19, pp. 233–257, 1967.
[21] A. Rzhetsky and M. Nei, “Theoretical foundation of
REFERENCES the minimum-evolution method of phylogenetic inference.”
Molecular Biology and Evolution, vol. 10, no. 5, pp. 1073–
[1] N. Saitou and M. Nei, “The neighbor-joining method: a new 1095, 1993.
method for reconstructing phylogenetic trees.” Molecular [22] W. H. Day and D. Sankoff, “Computational complexity of
Biology and Evolution, vol. 4, no. 4, pp. 406–425, 1987. inferring phylogenies by compatibility,” Systematic Biology,
[2] O. Gascuel, “BIONJ: an improved version of the NJ algo- vol. 35, no. 2, pp. 224–229, 1986.
rithm based on a simple model of sequence data.” Molecular [23] O. Gascuel, “BIONJ: an improved version of the NJ algo-
Biology and Evolution, vol. 14, no. 7, pp. 685–695, 1997. rithm based on a simple model of sequence data.” Molecular
[3] L. S. Vinh and A. von Haeseler, “Shortest triplet clustering: Biology and Evolution, vol. 14, no. 7, pp. 685–695, 1997.
reconstructing large phylogenies using representative sets,” [24] A. W. Edwards and Cavalli-Sforza, “The Reconstruction of
BMC Bioinformatics, vol. 6, no. 1, p. 92, 2005. Evolution,” Annals of Human Genetics, vol. 27, pp. 105–106,
[4] J. C. Wilgenbusch and D. Swofford, “Inferring evolutionary 1963.
trees with PAUP*,” Current Protocols in Bioinformatics, [25] W. M. Fitch, “Toward defining the course of evolution:
no. 1, pp. 6–4, 2003. minimum change for a specific tree topology,” Systematic
[5] P. A. Goloboff, J. S. Farris, and K. C. Nixon, “TNT, a free Biology, vol. 20, no. 4, pp. 406–416, 1971.
program for phylogenetic analysis,” Cladistics, vol. 24, no. 5, [26] R. Graham and L. Foulds, “Unlikelihood that minimal phy-
pp. 774–786, 2008. logenies for a realistic biological study can be constructed in
[6] D. T. Hoang, L. S. Vinh, T. Flouri, A. Stamatakis, A. von reasonable computational time,” Mathematical Biosciences,
Haeseler, and B. Q. Minh, “MPBoot: fast phylogenetic maxi- vol. 60, no. 2, pp. 133–142, 1982.
mum parsimony tree inference and bootstrap approximation,” [27] Z. Yang, “Maximum likelihood phylogenetic estimation from
BMC Evolutionary Biology, vol. 18, no. 1, p. 11, 2018. DNA sequences with variable rates over sites: approximate
91
Research and Development on Information and Communication Technology
methods,” Journal of Molecular Evolution, vol. 39, no. 3, Le Sy Vinh obtained PhD in Bioinformat-
pp. 306–314, 1994. ics from Heinrich Heine University, Dues-
[28] B. Chor and T. Tuller, “Maximum likelihood of evolution- seldorf, Germany 2005, subsequently fol-
ary trees is hard,” in Annual International Conference on
lowed a postdoc fellowship at American
Research in Computational Molecular Biology, 2005, pp.
296–310. Museum of Natural History, NYC from
[29] K. Strimmer and A. Von Haeseler, “Quartet puzzling: a 2005 to 2008. He is currently the Dean
quartet maximum-likelihood method for reconstructing tree of the Faculty of Information Technology,
topologies,” Molecular Biology and Evolution, vol. 13, no. 7, University of Engineering and Technology,
pp. 964–969, 1996. Vietnam National University, Hanoi.
[30] B. Q. Minh, L. S. Vinh, A. Von Haeseler, and H. A. Schmidt,
“PIQPNNI: parallel reconstruction of large maximum likeli- Le Sy Vinh is an expert in phylogenetic analysis, the author
hood phylogenies,” Bioinformatics, vol. 21, no. 19, pp. 3794– of widely-used software such as IQPNNI, POY4, UFBoot2. He
3796, 2005. is the group leader of many human genome projects in Viet-
[31] L. S. Vinh, A. Varón, and W. C. Wheeler, “Pairwise align- nam including the first Vietnamese human genome, building the
ment with rearrangements,” Genome Informatics, vol. 17, comprehensive Vietnamese human genome database, or Autism
no. 2, pp. 141–151, 2006.
spectrum disorder in Vietnamese children.
[32] D. T. Hoang, O. Chernomor, A. Von Haeseler, B. Q. Minh,
and L. S. Vinh, “UFBoot2: improving the ultrafast bootstrap
approximation,” Molecular Biology and Evolution, vol. 35,
no. 2, pp. 518–522, 2018.
92

File đính kèm:

phylogenetic_and_phylogenomic_analyses_for_large_datasets.pdf