Showing posts with label Introgression. Show all posts
Showing posts with label Introgression. Show all posts
Monday, October 8, 2018
A proper network of Europeans
Back in May this year, Iosif Lazaridis submitted a paper to the arXiv, called: "The evolutionary history of human populations in Europe". It is now online as part of the December 2018 issue of Current Opinion in Genetics & Development (53: 21-27).
Its interest for readers of this blog is the one and only figure that the paper contains. It is a genealogical network, showing the obvious — that the human "family tree" has quite a few reticulations, mostly due to introgression (or admixture, as human geneticists like to call it). Here is the figure, along with the legend. Note that not all of the edges in the network have a direction, so that it is not really a directed acyclic graph (see also First-degree relationships and partly directed networks).
A sketch of European evolutionary history based on ancient DNA
Bronze Age Europeans (~4.5-3kya) were a mixture of mainly two proximate sources of ancestry: (i) the Neolithic farmers of ~8-5kya who were themselves variable mixtures of farmers from Anatolia and hunter-gatherers of mainland Europe (WHG), and (ii) Bronze Age steppe migrants of ~5kya who were themselves a mixture of hunter-gatherers of eastern Europe (EHG) and southern populations from the Near East. Thus, we only have to go ~8 thousand years backwards in time to find at least four sources of ancestry for Europeans. But, each of these sources was also admixed: European hunter-gatherers received genetic input from Siberia and ultimately also from archaic Eurasians, and Near Eastern populations interacted in unknown ways with Europe and Siberia and also had ancestry from ‘Basal Eurasians’, a sister group of the main lineage of all other non-African populations. Dates correspond to sampled populations; in the case of a cluster of populations (such as the WHG), they correspond to the earliest attestation of the group.
Labels:
Genealogy,
Introgression,
Neanderthal
Tuesday, August 15, 2017
Is reticulation as important in rice as in wheat?
I have previously discussed the use of phylogenetic networks to study the Complex hybridizations in wheat, due to the very reticulate evolutionary history. It seems that the situation for the other major world food source, rice, also requires network analysis, although this time introgression is the biological source of reticulation, rather than hybridization.
Jae Young Choi, Adrian E. Platts, Dorian Q. Fuller, Yue-Ie Hsing, Rod A. Wing, and Michael D. Purugganan (2017) The rice paradox: multiple origins but single domestication in Asian rice. Molecular Biology & Evolution 34: 969-979.
The authors note:
The Asian rice Oryza sativa is the world’s most important food crop, and is a staple for more than one-third of the world’s population. Oryza sativa is genetically differentiated into several groups, the main ones being japonica and indica, which have been considered as subspecies / subpopulations with distinct morphological and physiological characteristics
The origin of domesticated Asian rice has been a contentious topic, with conflicting evidence for either single or multiple domestication of this key crop species. We examined the evolutionary history of domesticated rice by analyzing de novo assembled genomes from domesticated rice and its wild progenitors. Our results indicate multiple origins, where each domesticated rice subpopulation (japonica, indica, and aus) arose separately from progenitor O. rufipogon and / or O. nivara.
We also show that there is significant gene flow from japonica to both indica (c. 17%) and aus (c. 15%), which led to the transfer of domestication alleles from early-domesticated japonica to proto-indica and proto-aus populations. Our results provide support for a model in which different rice subspecies had separate origins, but that de novo domestication occurred only once, in O. sativa ssp. japonica, and introgressive hybridization from early japonica to proto-indica and proto-aus led to domesticated indica and aus rice.Similar reticulation histories have, of course, been reported for most domesticated organisms (see Are phylogenetic trees useful for domesticated organisms?), including dogs, cattle, horses, sheep, grapes, etc.
Labels:
Introgression
Tuesday, June 20, 2017
Cichlids, species and trees
Lake Malawi, in south-eastern Africa, is famous for its large diversity of cichlid fishes. Indeed, it sometimes seems to have more biologists studying these fish than there are actual fish in the lake, even though there are allegedly hundreds of cichlid fish species in that lake. In this sense, it is somewhat similar to Lake Baikal, in southern Siberia, home to the sole species of freshwater seals.
The cichlid biologists are interested in describing the extensive fish diversity, pondering its origin, and thus its contribution to the study of speciation. After all, we are talking about what is usually claimed to be "the most extensive recent vertebrate adaptive radiation". So, we are talking here as much about population genetics as we are about ichthyology.
Inevitably, the genome biologists have been spotted in the vicinity of the lake; and we now have a preliminary report from them:
The issue here is that the authors write the paper solely from the perspective of an expected phylogenetic tree, and then feel compelled to explain why they do not produce such a tree. Indeed, the authors present their paper as a study of "violations of the species tree concept".
For data analysis, they proceed as follows:
The authors continue:
The ultimate questions from this paper are: "what is a species concept?", and "what is a species tree?". The authors write a lot about species and trees, and yet their data provide very clear evidence that both "species" and "tree" are very restrictive concepts for studying the cichlids of Lake Malawi.
Coincidentally, another recent paper tackles the same problems:
The cichlid biologists are interested in describing the extensive fish diversity, pondering its origin, and thus its contribution to the study of speciation. After all, we are talking about what is usually claimed to be "the most extensive recent vertebrate adaptive radiation". So, we are talking here as much about population genetics as we are about ichthyology.
Inevitably, the genome biologists have been spotted in the vicinity of the lake; and we now have a preliminary report from them:
Milan Malinsky, Hannes Svardal, Alexandra M. Tyers, Eric A. Miska, Martin J. Genner, George F. Turner, Richard Durbin (2017) Whole genome sequences of Malawi cichlids reveal multiple radiations interconnected by gene flow. BioRxiv 143859.These authors summarize the situation like this:
We characterize [the] genomic diversity by sequencing 134 individuals covering 73 species across all major lineages. Average sequence divergence between species pairs is only 0.1-0.25%. These divergence values overlap diversity within species, with 82% of heterozygosity shared between species. Phylogenetic analyses suggest that diversification initially proceeded by serial branching from a generalist Astatotilapia-like ancestor. However, no single species tree adequately represents all species relationships, with evidence for substantial gene flow at multiple times.The last sentence seems to be somewhat disingenuous. How could a single tree be expected to describe this scale of biodiversity? Any rapid radiation of diversity is unlikely to be completely tree-like. The increase in diversity can be modeled as a tree, sure, but it is very unlikely that there will be instant separation of the taxa, and so the tree model will be ignoring a large part of the evolutionary action. There will, for example, be ongoing introgression between the diverging taxa, as well as hybridization due to incomplete breeding barriers. These avenues for gene flow can best be modeled as a network, not a tree.
The issue here is that the authors write the paper solely from the perspective of an expected phylogenetic tree, and then feel compelled to explain why they do not produce such a tree. Indeed, the authors present their paper as a study of "violations of the species tree concept".
For data analysis, they proceed as follows:
To obtain a first estimate of between-species relationships we divided the genome into 2543 non-overlapping windows, each comprising 8000 SNPs (average size: 274kb), and constructed a Maximum Likelihood (ML) phylogeny separately for each window, obtaining trees with 2542 different topologies.So, only two sequence blocks produced the same tree, presumably by random chance. An example "tree" for 12 OTUs is shown in the diagram. It superimposes a possible mitochondrial trees on a summary of the "genome tree".
The authors continue:
The fact that we are using over 25 million variable sites suggests these differences are not due to sampling noise, but reflect conflicting biological signals in the data. For example, gene flow after the initial separation of species can distort the overall phylogeny and lead to intermediate placement of admixed taxa in the tree topology.Note that gene flow is seen to "distort" the phylogeny rather than being an integral part of it. In this case, "phylogeny" apparently refers solely to the diversification part evolutionary history, rather than to the whole history.
The ultimate questions from this paper are: "what is a species concept?", and "what is a species tree?". The authors write a lot about species and trees, and yet their data provide very clear evidence that both "species" and "tree" are very restrictive concepts for studying the cichlids of Lake Malawi.
Coincidentally, another recent paper tackles the same problems:
Britta S. Meyer, Michael Matschiner, Walter Salzburger (2017) Disentangling incomplete lineage sorting and introgression to refine species-tree estimates for Lake Tanganyika cichlid fishes. Systematic Biology 66: 531-550.The authors describe their work, on the same fish group but in a lake further north-west, as follows:
Because of the rapid lineage formation in these groups, and occasional gene flow between the participating species, it is often difficult to reconstruct the phylogenetic history of species that underwent an adaptive radiation. In this study, we present a novel approach for species-tree estimation in rapidly diversifying lineages, where introgression is known to occur, and apply it to a multimarker data set containing up to 16 specimens per species for a set of 45 species of East African cichlid fishes (522 individuals in total), with a main focus on the cichlid species flock of Lake Tanganyika. We first identified, using age distributions of most recent common ancestors in individual gene trees, those lineages in our data set that show strong signatures of past introgression ... We then applied the multispecies coalescent model to estimate the species tree of Lake Tanganyika cichlids, but excluded the lineages involved in these introgression events, as the multispecies coalescent model does not incorporate introgression. This resulted in a robust species tree.Once again, phylogeny = species tree.
Labels:
Gene flow,
Hybridization network,
Introgression
Tuesday, June 13, 2017
Bayesian inference of phylogenetic networks
Over the years, a number of methods have been explored for constructing evolutionary networks, starting with parsimony criteria for optimization, and moving on to likelihood-based inference. However, the development of Bayesian methods has been somewhat delayed by the computational complexities involved.
The earliest work on this topic seems to be the thesis of:
Rosalba Radice (2011) A Bayesian Approach to Phylogenetic Networks. PhD thesis, University of Bath, UK.Apparently, the only part of this work to be published has been:
Rosalba Radice (2012) A Bayesian approach to modelling reticulation events with application to the ribosomal protein gene rps11 of flowering plants. Australian & New Zealand Journal of Statistics 54: 401-426.The method described requires the prior specification of the species tree (phylogeny), and the position and number of the reticulation events. The algorithm was implemented in the R language.
More recently, methods have been developed that infer phylogenies by using (i) incomplete lineage sorting (ILS) to model gene-tree incongruence arising from vertical inheritance, and (ii) introgression / hybridization to model gene-tree incongruence attributable to horizontal gene flow. ILS has been addressed using the multispecies coalescent.
The first of these publications was:
Dingqiao Wen, Yun Yu, Luay Nakhleh (2016) Bayesian inference of reticulate phylogenies under the multispecies network coalescent. PLoS Genetics 12(5): e1006006. [Correction: 2017 PLoS Genetics 13(2): e1006598]The method requires the set of gene trees as input, along with the number of reticulations. The algorithm was implemented in the PhyloNet package.
In the past few months, two manuscripts have appeared that try to co-estimate the gene trees and the species network, using the original sequence data (assumed to be without recombination) as input:
Dingqiao Wen, Luay Nakhleh (2017) Co-estimating reticulate phylogenies and gene trees from multi-locus sequence data. bioRxiv 095539. [v.2; v.1: 2016]
Chi Zhang, Huw A Ogilvie, Alexei J Drummond, Tanja Stadler (2017) Bayesian inference of species networks from multilocus sequence data. bioRxiv 124982.The algorithm for the first method has been implemented in the PhyloNet package, while the second has been implemented in the Beast2 package.
Finally, another manuscript describes a method utilizing data based on single nucleotide polymorphisms (SNPs) and/or amplified fragment length polymorphisms (AFLPs), which thus sidesteps the assumption of no recombination:
Jiafan Zhu, Dingqiao Wen, Yun Yu, Heidi Meudt, Luay Nakhleh (2017) Bayesian inference of phylogenetic networks from bi-allelic genetic markers. bioRxiv 143545.This method has also been implemented in PhyloNet.
Due to the computational complexity of likelihood inference, all of these methods are currently severely restricted in the number of OTUs that can be analyzed, irrespective of whether these involve multiple samples from the same species or not. In this sense, parsimony-based inference or approximate likelihood methods are still useful for constructing evolutionary networks of any size. However, progress is clearly being made to alleviate the computational restrictions.
Tuesday, March 14, 2017
Detecting introgression versus hybridization
There has been considerable interest in recent years in developing methods that will detect hybridization in the presence of incomplete lineage sorting (ILS), which will allow the construction of a realistic hybridization network. Clearly, both ILS and hybridization create conflicting gene trees, which will lead to a very complex data-display network. However, if the ILS signals in the data can be used to construct a small collection of gene-tree groups, in which the gene trees within each group are congruent with a single species tree (under the ILS model), then the incongruence between groups can be used to construct a hybridization network. This network will then be an hypothesis for a realistic evolutionary network.
Recently, a paper has appeared that uses simulations to evaluate several of these methods:
Olga K. Kamneva and Noah A. Rosenberg (2017) Simulation-based evaluation of hybridization network reconstruction methods in the presence of incomplete lineage sorting. Evolutionary Bioinformatics 2017:13.I am not a great fan of simulations, because they exist under very restricted and usually unrealistic mathematical conditions. They are, however, useful for exploring the mathematical properties of various methods, even if they are hard to connect to the biological properties.
My interpretations of the results from the particular scenarios explored by Kamneva and Rosenberg are:
- Most of the methods improve as the internal network edges increase in length.
- Most of the methods improve as the number of gene trees increases.
- Under good conditions the maximum-likelihood methods do better than the parsimony and consensus methods.
- The maximum-likelihood methods are more affected by gene-tree error than are the other methods.
- There are conditions under which none of the methods work well.
For me, the most interesting part of the paper is the examination of balanced versus skewed parental contributions to the hybrid taxon. A balanced genetic contribution in the simulations is analogous to homoploid or polyploid hybridization, whereas a skewed contribution is analogous to introgression or horizontal gene transfer (HGT). The simulations seem to show that the methods examined do not deal very well with skewed contributions.
So, these methods may literally be hybridization-network methods only, with separate network methods needed for detecting introgression or HGT — for example, the admixture methods used for genomes (see the recent post on Producing admixture graphs).
This would mean that we cannot first produce networks with reticulations, and then afterwards explore what is causing the reticulations. Instead, we will need to decide on the possible biological mechanisms of reticulation before the analysis, and then mathematically explore possible networks that reflect those mechanisms.
This is not an issue for constructing trees, of course, since the only recognized mechanisms are speciation and extinction, both of which are explored post hoc rather than a priori. This is an important difference of networks versus trees.
Labels:
Hybridization network,
Introgression
Tuesday, October 4, 2016
The practical limits of networks?
Network techniques are becoming more widespread in biology and anthropology. However, the data in both of these disciplines can form very complicated patterns, indeed; and there must be practical limits to what one can do with a network analysis. This post discusses an example that covers both disciplines, and which may well exceed those limits.
The data come from:
Pugach I, Matveev R, Spitsyn V, Makarov S, Novgorodov I, Osakovsky V, Stoneking M, Pakendorf B (2016) The complex admixture history and recent southern origins of Siberian populations. Molecular Biology and Evolution 33: 1777-1795.
The authors note:
Siberia is an extensive geographical region of North Asia stretching from the Ural Mountains in the west to the Pacific Ocean in the east, and from the Arctic Ocean in the north to the Kazakh and Mongolian steppes in the south. This vast territory is inhabited by a relatively small number of indigenous peoples, with most populations numbering only in the hundreds or few thousands. These indigenous peoples speak a variety of languages belonging to the Turkic, Tungusic, Mongolic, Uralic, Yeniseic, Chukotko-Kamchatkan, and Aleut-Yupik-Inuit families, as well as a few isolates. There is also variation in traditional subsistence patterns ... This linguistic and cultural diversity suggests potentially different origins and historical trajectories of the Siberian peoples.
Previous studies of the genetic history of Siberian populations were hampered by the extensive admixture that appears to have taken place among these populations, because commonly used methods assume a tree-like population history and at most single admixture events.This suggests the use of network techniques, instead of tree-based ones. However, under the circumstances described here it may be unwise to try to produce a phyogenetic network. The situation, as described, does not resemble a "tree with reticulations" but more of an "anastomosing plexus". The latter may be more confusing than helpful, when visualized as a network.
So, the authors do not mention the word "network" nor even "reticulation". Instead:
Here we analyze geogenetic maps and use other approaches to distinguish the effects of shared ancestry from prehistoric migrations and contact, and develop a new method based on the covariance of ancestry components, to investigate the potentially complex admixture history. We furthermore adapt a previously devised method of admixture dating for use with multiple events of gene flow, and apply these methods to whole-genome genotype data [genome-wide SNPs] from over 500 individuals belonging to 20 different Siberian ethnolinguistic groups [plus 9 reference populations].
The results of these analyses indicate that there have been multiple layers of admixture detectable in most of the Siberian populations, with considerable differences in the admixture histories of individual populations.The admixture (or introgression) patterns among the populations are illustrated using a map. Each bar represents a population, with the colors denoting the different enthnolinguistic groups. Note that every population shows admixture.
The reconstructed migration relationships among the populations are also illustrated using a map. This time, the colors of the arrows represent the different ethnolinguistic groups.
I would not like to have to represent these patterns using a network, and make that network comprehensible. So, this dataset may exceed the practical limits of networks.
Labels:
Admixture,
Anthropology,
Introgression
Wednesday, August 31, 2016
Network thinking in phylogeography?
This blog has, of course, long championed the importance of network models in phylogenetics. Slowly, very slowly, the rest of the world is catching up.
Apparently, the world of phylogeography has now woken up:
Scott V. Edwards, Sally Potter, C. Jonathan Schmitt, Jason G. Bragg and Craig Moritz (2016) Reticulation, divergence, and the phylogeography–phylogenetics continuum. Proceedings of the National Academy of Sciences of the USA 113: 8025-2032.Phylogeography was conceived as some sort of connection between population biology and phylogenetics. It has always seemed odd that the tree model has been used in phylogeography at all, because there is no a priori reason to expect within-species phylogenetic patterns to be tree-like. Indeed, inter-breeding seems to suggest quite the opposite. Nevertheless, phylogeographic studies are full of trees.
But apparently no more. To quote the authors:
As phylogeography moves into the era of next-generation sequencing, the specter of reticulation at several levels — within loci and genomes in the form of recombination and across populations and species in the form of introgression — has raised its head with a prominence even greater than glimpsed during the nuclear gene PCR era ... We discuss a variety of forces generating reticulate patterns in phylogeography, including introgression, contact zones, and the potential selection-driven outliers on next-generation molecular markers. We emphasize the continued need for demographic models incorporating reticulation at the level of genomes and populations ...
That phylogeography sits centrally in this process-oriented space emphasizes the importance of understanding interactions between reticulation (gene flow / introgression and recombination), drift, and protracted isolation. This combination of processes sets phylogeography apart from traditional population genetics and phylogenetics.
Scanning entire genomes of closely related organisms has unleashed a level of heterogeneity of signals that was largely of theoretical interest in the PCR era. This genomic heterogeneity is profoundly influencing our basic concepts of phylogeography and phylogenetics, and indeed our views of speciation processes. It is now routine to encounter a diversity of gene trees across the genome that is often as large as the number of loci surveyed.
The new genome-scale analyses are causing evolutionary biologists to reevaluate the very nature of species, which, in some cases, appear to maintain phenotypic distinctiveness despite extensive gene flow across most of the genome, and to recognize introgression as an important source of adaptive traits in a variety of study systems.The role of horizontal gene flow in speciation and phylogeography, particularly for animal taxa, has long been championed by Michael L. Arnold (see the references). However, the authors ignore this literature, and claim that this is a recent insight, instead. They also mention only in passing the extensive genomics literature on human introgression, where it is called "admixture". Indeed, they mention only a data-analysis technique, rather than the biological insights that have arisen. It is still disappointing just how little information-connection there is between different fields of biology.
Finally, the authors manage to mention the work "network" only three times in the whole paper. Their key word is "reticulation", instead, in the sense that a phylogeny is a tree with reticulation, rather than any other form of network. So, they are still only one step away from tree-thinking, and at least one step from true network-thinking.
In the context of trees versus networks, the authors mention so-called "species tree" methods based on the multispecies coalescent, which try to account for incomplete lineage sorting in genome studies (see also Edwards et al. 2016). Unfortunately, these have recently been shown to be inconsistent in the presence of gene flow (SolÃs-Lemus et al. 2016), thus emphasizing the need for proper network methods.
References
Arnold ML (1997) Natural Hybridization and Evolution. Oxford University Press.
Arnold ML (2006) Evolution Through Genetic Exchange. Oxford University Press.
Arnold ML (2009) Reticulate Evolution and Humans – Origins and Ecology. Oxford University Press.
Arnold ML (2016) Divergence With Genetic Exchange. Oxford University Press.
Edwards SV, Xi Z, Janke A, Faircloth BC, McCormack JE, Glenn TC, Zhong B, Wu S, Lemmon EM, Lemmon AR, Leaché AD, Liu L, Davis CC (2016) Implementing and testing the multispecies coalescent model: a valuable paradigm for phylogenomics. Molecular Phylogenetics & Evolution 94: 447-462.
SolÃs-Lemus C, Yang M, Ané C (2016) Inconsistency of species tree methods under gene flow. Systematic Biology 65: 843–851.
Labels:
Admixture,
Introgression,
Phylogeny
Wednesday, October 21, 2015
Studying gene flow using genomes
Continuing the recent blog theme of researchers analyzing potentially reticulate relationships without explicitly using networks (Are networks actually used to explore reticulate histories? ; Problems with manually constructing networks), there is this just-published paper:
Nater A, Burri R, Kawakami T, Smeds L, Ellegren H (2015) Resolving evolutionary relationships in closely related species with whole-genome sequencing data. Systematic Biology 64: 1000-1017.The authors note:
Using genetic data to resolve the evolutionary relationships of species is of major interest in evolutionary and systematic biology. However, reconstructing the sequence of speciation events, the so-called species tree, in closely related and potentially hybridizing species is very challenging. Processes such as incomplete lineage sorting and interspecific gene flow result in local gene genealogies that differ in their topology from the species tree, and analyses of few loci with a single sequence per species are likely to produce conflicting or even misleading results ... Although gene tree incongruences caused by ILS are still fully compatible with a strictly bifurcating species tree, gene flow among species requires a more complex representation of evolutionary histories, resembling reticulate networks rather than trees.Unfortunately, this is the sole mention of the word "network" in the text.
The authors addressed the issues of incomplete lineage sorting and interspecific gene flow using whole-genome sequence data from 198 individuals of four flycatcher species, plus two outgroup genomes. They found that, for most genomic regions, none of the 15 possible rooted gene tree topologies appeared consistently at high frequencies — the most frequent gene tree occurred 17.7% of the time, with the second at 14.3% and the third at 10.5%.
They investigated this gene-tree diversity using four programs that attempt to resolve a species tree in the context of incomplete lineage sorting and the coalescent: MP-EST, SNAPP, Fastsimcoal2, and ABC. The latter two approaches also allow for post-divergence gene flow. All four methods have limited applicability when applied to 200 genomes, and so in each case only a subset of the data was analyzed or a subset of the possible species trees was tested. All four methods produced the same species tree, which was also the same as the most commonly encountered gene tree.
Unfortunately, the authors found almost no evidence of gene flow using these methods, although their detailed gene-tree analyses do suggest its existence. This indicates that there are problems with these methods. Perhaps the main problem is that the authors approached their analyses almost exclusively in the context of a species tree rather than a network. There are other methods that one could try, including the one used by researchers studying introgression in archaic hominoids (as discussed in Are networks actually used to explore reticulate histories?).
In addition, the authors seem to be unclear about their concept of what is a species. For example, they note that "gene flow among lineages in the species tree can confound the true order of speciation events", which seems to preclude use of the biological species concept. Furthermore, they note that "lack of species monophyly is common in this study system", which seems to preclude the phylogenetic species concept. What then constitutes speciation?
Finally, the authors seem to have a common misconception of ancestral character states. Their approach includes this statement: "If both outgroup individuals were monomorphic for the same allele, this allele was considered ancestral." This argument has been repeatedly rejected in the literature. See, for example, Crisp MD, Cook LG. (2005) Do early branching lineages signify ancestral traits? Trends in Ecology and Evolution 20: 122-128.
Labels:
Evolutionary network,
Introgression
Wednesday, October 14, 2015
Problems with manually constructing networks
I wrote recently about whether explicit network methods are currently used in practice to construct evolutionary networks (Are networks actually used to explore reticulate histories?), and noted that they usually are not. Here I explore in a bit more detail another example, and point out a couple of limitations of constructing such networks manually.
Earlier this year a paper was published exploring the Anopheles gambiae species complex, this group of mosquitoes being the principal vector of the malaria parasite:
Fontaine MC, et al. (2015) Extensive introgression in a malaria vector species complex revealed by phylogenomics. Science 347: 1258524.There are about 450 known species of anopheline mosquitoes, which transmit five species of malaria to humans, and many other malaria species to most other vertebrates. The genomes of the six Anopheles species were included as part of a genome study published simultaneously, which also included other Anopheles species:
Neafsey DE, et al. (2015) Highly evolvable malaria vectors: the genomes of 16 Anopheles mosquitoes. Science 347: 1258522.
Both groups of researchers constructed a phylogenetic tree of their organisms, but Fontaine et al. then added reticulations to their tree (thus manually forming an evolutionary network). The reticulations represent putative introgression among members of the An. gambiae species complex, many of which have overlapping distributions within sub-saharan Africa.
Fontaine et al. constructed their network by trying to take into account incomplete lineage sorting (which Neafsey et al. apparently did not — they left the An. gambiae species complex as an unresolved polychotomy). This is all well and good, and it matches the current paradigm in the literature where hybridization / introgression (a process involving horizontal gene flow that creates gene-tree discordance) is studied in association with ILS (a process involving vertical inheritance but which also creates gene-tree discordance). The alternative paradigm is that lateral gene transfer (a process involving horizontal gene flow that creates gene-tree discordance) is studied in association with gene duplication–loss (a process involving vertical inheritance but which also creates gene-tree discordance).
However, this might not be the best strategy in this particular case. In the companion paper by Neafsey et al., they note that for their 16 genomes:
Copy-number variation in homologous gene families also reveals striking evolutionary dynamism. Analysis of 11,636 gene families ... indicates a rate of gene gain / loss higher by a factor of at least 5 than that observed for 12 Drosophila genomes.Under these circumstances, why ignore the possibility that gene duplication and selective loss has created gene-tree discordance? This possibility is not even mentioned by Fontaine et al. Also not mentioned are other possible sources of gene-tree discordance that are associated with vertical inheritance (eg. balancing selection), but they do at one stage concern themselves with the possibility of unequal rates of evolution among the chromosomes.
Their data-analysis strategy was this:
To infer the correct species branching order in the face of anticipated ILS and introgression, maximum-likelihood (ML) phylogenies were constructed from 50-kilobase (kb) non-overlapping windows across the alignments (referred to here as "gene trees" regardless of their protein-coding content), considering six in-group species rooted alternatively with An. christyi or An. epiroticus (n = 4063 windows).They found a total of 85 different gene-tree topologies, some of them occurring much more frequently than others. They plotted these onto the four autosomal chromosomes plus the X chromosome, and found that the X chromosome favoured very different gene trees than did the autosomes.
From this analysis, the authors constructed a phylogenetic network (shown in the next figure) based on a species tree (black lines) with reticulations added (green arrows) to indicate introgression. I have added two labels ("Vertical" and "Horizontal") to emphasize the authors' interpretation of the evolutionary flow of genetic information, separated into vertical inheritance and horizontal gene flow (introgression).
The authors interpret the horizontal gene flow as being introgression because:
Autosomal introgression between An. arabiensis and the ancestor of An. gambiae [gam] + An. coluzzii [col] has long been postulated and could explain the strong discordance between the dominant tree topologies of the X and autosomes.The idea of the introgression being autosomal seems to be based on the idea that the "true species tree" is the one shown by the genes that mediate male and female fertility (ie. the sex chromosomes).
The authors note that, for a "definitive interpretation of these conflicting signals" between the gene trees, they need to have "the correct species branching order". I have raised a number of times in this blog the difficulty of constructing a "species tree" in the face of reticulation. If there is evidence for horizontal gene flow in the data then how do we first extract just the vertical inheritance? The authors attempted to address this question in a section entitled "Tree height reveals the true species branching order in the face of introgression". Their argument is this:
To infer the correct historical branching order, we applied a strategy based on sequence divergence ... Because introgression will reduce sequence divergence between the species exchanging genes, we expect that the correct species branching order revealed by gene trees constructed from non-introgressed sequences will show deeper divergences than those constructed from introgressed sequences. If the hypothesis of autosomal introgression is correct, this implies that the topologies supported by the X chromosome should show significantly higher divergence times ... than topologies supported by the autosomes.This, indeed, was what they found; and so they concluded that the X chromosome topology represents the species tree, and the autosomes are showing introgression. However, this seems to be a somewhat specious argument. Maybe introgression does lower tree height, but I don't think that we should conclude from this that lowered tree height indicates introgression. We cannot simply invert this argument (ie. A causes to B, and therefore B implies A), because there may be other differences between the autosomes and the X chromosome that also affect relative tree height, such as unequal gene duplication-loss, convergence, unequal evolutionary rates, balancing selection, and so on.
Therefore, we should not be surprised if the authors have got it wrong about whether the X chromosome or the autosomes is showing the "true species tree" (if there is one). That is, the edge labelled "Vertical" in the above network may actually represent the horizontal gene flow, while the edge labelled "Horizontal" may actually represent the vertical inheritance.
Finally, there is a published commentary on the two Anopheles papers:
Clark AG, Messer PW (2015) Conundrum of jumbled mosquito genomes. Science 347: 27-28.These authors appropriately note that:
Fontaine et al. adhere to a classical view that there is a "true species tree" ... But given that the bulk of the genome has a network of relationships that is different from this true species tree, perhaps we should dispense with the tree and acknowledge that these genomes are best described by a network, and that they undergo rampant reticulate evolution.This alternate philosophy requires an integrated method for constructing the network, rather than manually constructing a species tree and then adding reticulations. Such a method would construct a network from first principles, and then reveal whether the species phylogeny is tree-like or not, rather than assuming that it is a tree a priori. There are a number of methods being developed for doing this.
Labels:
Evolutionary network,
Introgression
Wednesday, July 1, 2015
Networks of admixture or introgression
There are several processes that create reticulate phylogenetic topologies, including hybridization, introgression (or admixture) and horizontal gene transfer (HGT). Biologically, introgression operates via the same mechanism as does hybridization (ie. during sexual reproduction), but it results in only a small amount of genetic material entering the recipient genome, making an admixed genome that is similar to the end result of HGT.
Constructing phylogenetic networks in situations where introgression or HGT have occurred has been somewhat different in practice to that used for hybridization. Hybridization has usually been tackled by merging incongruent tree topologies, based on the idea that the different topologies represent the phylogenetic history of the different genomes of the hybrid taxon. Introgression and HGT have usually been tackled by adding reticulation edges to a phylogenetic tree, on the basis that the tree represents the phylogenetic history of the main part of the genome.
So, the study of introgression (and HGT) involves (a) constructing a phylogenetic tree from some genomic sample, and (b) detecting the introgressed (or HGT) parts of the genome. This is potentially a problematic procedure, because how do we construct a phylogenetic tree from data that already contain non-tree components? Apparently, the expectation is that a single tree will be supported by the majority of the data, and the remainder will represent the introgressed (or HGT) pathways(s), plus whatever other components have created the observed genomic variability (such as incomplete lineage sorting, gene duplication-loss, and stochastic mutations).
Recently, there have been quite a few studies published that have adopted a specific protocol for this procedure, usually under the rubric of admixture. Most of these have involved the study of ancient human DNA, but there have also been studies of contemporary humans, as well as ancient non-humans, An example of the latter is shown in the next two figures, which represent parts (a) and (b), respectively. They are taken from this study of the relatives of horses: Hákon Jónsson, et alia (2014) Speciation with gene flow in equids despite extensive chromosomal plasticity. Proceedings of the National Academy of Sciences of the USA 111: 18655-18660.
The phylogenetic tree (step a) was constructed using "maximum likelihood inference and 20,374 protein-coding genes ... based on a relaxed molecular clock." So, only stochastic mutations were accounted for when constructing the tree, and not incomplete lineage sorting or gene duplication-loss.
The detection of introgression (step b) used "the D statistics approach, which tests for an excess of shared polymorphisms between one of two closely related lineages (E1 or E2) and a third lineage (E3)". The reticulations representing the detected gene flow were then added to the tree manually.
The D-statistic is also known as the ABBA-BABA test (see: Patterson NJ et alia. 2012. Ancient admixture in human history. Genetics 192: 1065-1093). It operates as follows for sets of four taxa, applied to character data.
Let the species tree be this, where E1–E3 are the three taxa being compared, and O is the outgroup:
There are three possible allele trees for each binary character (ie. single nucleotide polymorphism) in which states are shared pairwise:
In the first tree, E3 shares the ancestral character state with the outgroup, which is expected to be the most common pattern in the absence of gene flow. E1 and E2 share the ancestral state with the outgroup in the second and third trees, respectively.
The admixture test compares the ABBA tree to the BABA tree. The expectation is that if there has been no introgression then the data support for these two trees should be equal. That is, under the null hypothesis that there is no gene flow between the species (and the underlying species tree is correct), the difference in the expected number of occurrences of the ABBA and BABA patterns should be zero. Deviation from this expectation is statistically evaluated using a jackknife procedure.
When there are more than three ingroup taxa, they are tested in groups of three (plus the outgroup). No correction for multiple hypothesis testing seems ever to be applied. Recently, the test has been extended to five taxa (Pease JB, Hahn MW. 2015. Detection and polarization of introgression in a five-taxon phylogeny. Systematic Biology 64: 651-662).
Note that this test assumes that:
Constructing phylogenetic networks in situations where introgression or HGT have occurred has been somewhat different in practice to that used for hybridization. Hybridization has usually been tackled by merging incongruent tree topologies, based on the idea that the different topologies represent the phylogenetic history of the different genomes of the hybrid taxon. Introgression and HGT have usually been tackled by adding reticulation edges to a phylogenetic tree, on the basis that the tree represents the phylogenetic history of the main part of the genome.
So, the study of introgression (and HGT) involves (a) constructing a phylogenetic tree from some genomic sample, and (b) detecting the introgressed (or HGT) parts of the genome. This is potentially a problematic procedure, because how do we construct a phylogenetic tree from data that already contain non-tree components? Apparently, the expectation is that a single tree will be supported by the majority of the data, and the remainder will represent the introgressed (or HGT) pathways(s), plus whatever other components have created the observed genomic variability (such as incomplete lineage sorting, gene duplication-loss, and stochastic mutations).
Recently, there have been quite a few studies published that have adopted a specific protocol for this procedure, usually under the rubric of admixture. Most of these have involved the study of ancient human DNA, but there have also been studies of contemporary humans, as well as ancient non-humans, An example of the latter is shown in the next two figures, which represent parts (a) and (b), respectively. They are taken from this study of the relatives of horses: Hákon Jónsson, et alia (2014) Speciation with gene flow in equids despite extensive chromosomal plasticity. Proceedings of the National Academy of Sciences of the USA 111: 18655-18660.
The phylogenetic tree (step a) was constructed using "maximum likelihood inference and 20,374 protein-coding genes ... based on a relaxed molecular clock." So, only stochastic mutations were accounted for when constructing the tree, and not incomplete lineage sorting or gene duplication-loss.
The detection of introgression (step b) used "the D statistics approach, which tests for an excess of shared polymorphisms between one of two closely related lineages (E1 or E2) and a third lineage (E3)". The reticulations representing the detected gene flow were then added to the tree manually.
The D-statistic is also known as the ABBA-BABA test (see: Patterson NJ et alia. 2012. Ancient admixture in human history. Genetics 192: 1065-1093). It operates as follows for sets of four taxa, applied to character data.
Let the species tree be this, where E1–E3 are the three taxa being compared, and O is the outgroup:
There are three possible allele trees for each binary character (ie. single nucleotide polymorphism) in which states are shared pairwise:
In the first tree, E3 shares the ancestral character state with the outgroup, which is expected to be the most common pattern in the absence of gene flow. E1 and E2 share the ancestral state with the outgroup in the second and third trees, respectively.
The admixture test compares the ABBA tree to the BABA tree. The expectation is that if there has been no introgression then the data support for these two trees should be equal. That is, under the null hypothesis that there is no gene flow between the species (and the underlying species tree is correct), the difference in the expected number of occurrences of the ABBA and BABA patterns should be zero. Deviation from this expectation is statistically evaluated using a jackknife procedure.
When there are more than three ingroup taxa, they are tested in groups of three (plus the outgroup). No correction for multiple hypothesis testing seems ever to be applied. Recently, the test has been extended to five taxa (Pease JB, Hahn MW. 2015. Detection and polarization of introgression in a five-taxon phylogeny. Systematic Biology 64: 651-662).
Note that this test assumes that:
- the "excess of shared polymorphisms" arises solely from gene flow, with or without incomplete lineage sorting, rather than from any other tree-like processes such as gene duplication-loss or ancestral population structure
- there are no other sources of co-ordinated polymorphisms, such as character-state reversals due to adaptation / selection
- any gene flow that does exist is due to introgression, rather than to hybridization or HGT.
Labels:
Admixture,
Introgression
Wednesday, June 24, 2015
Trees, networks and dogs
One of the perennially most popular posts in this blog has been the one about the domestication of dogs: Why do we still use trees for the dog genealogy?
In that post I noted that, up to 2012, there were three distinct trends in the presentation of the genealogy of dog breeds:
- the study of whole-genome data, in which the results are presented solely as a neighbor-joining tree
- the study of mtDNA sequence data, in which the results are presented both as a tree and as a haplotype network
- the study of combined Y-chromosome and mtDNA sequence data, in which the results are presented solely as a haplotype network.
Skoglund P, Ersmark E, Palkopoulou E, Dalén L (2015) Ancient wolf genome reveals an early divergence of domestic dog ancestors and admixture into high-latitude breeds. Current Biology 25:1515-1519.
The tree is based on mitochondrial genome data for the highlighted fossil, compared to the mitochondrial sequences of modern-day dogs and wolves, as well as ancient canids. The use of a phylogenetic tree seems to be based on the idea that mitochondria consist of tightly linked genes that are uniparentally inherited. However, neither of these characteristics is universal, and so a network might be more appropriate.
The dog genealogy is recognized as being characterized by introgression with wolves, as the authors themselves note. Also, the origin of dogs is not directly from wolf ancestors, but both modern wolves and modern dogs are derived from a common ancestor. For example, this next diagram is from:
Freedman AH, et alia (2014) Genome sequencing highlights the dynamic early history of dogs. PLoS Genetics 10:e1004016.
The width of each population branch is proportional to inferred population size. Note that wolves and dogs originated at roughly the same time, as the result of bottlenecks in the ancestral population size. Wolves diversified slightly earlier than dogs. Also, Skoglund et al. dispute the dating of the splits, suggesting that the dog-wolf divergence was "at least 27,000 years ago".
As a final note, there is a tendency to credit Charles Darwin with originating just about everything in the study of genealogy, although he was a synthesizer as much as an innovator. For example, David Grimm suggests (Dawn of the dog. Science 348: 274-279):
Charles Darwin fired the first shot in the dog wars. Writing in 1868 in The Variation of Animals and Plants under Domestication, he wondered whether dogs had evolved from a single species or from an unusual mating, perhaps between a wolf and a jackal.However, the first hypothesized genealogy was actually published more than a century earlier, by Georges-Louis Leclerc, comte de Buffon (see the blog post on The first phylogenetic network), who suggested a common origin with wolves.
Labels:
Hybridization network,
Introgression,
Phylogeny
Wednesday, February 18, 2015
Representing macro- and micro-evolution in a network
In biology we often distinguish microevolutionary events, which occur at the population level, from macroevolutionary events, which involve species. We have traditionally treated phylogenetics as a study of macroevolution. However, more recently there has been a trend to include population-level events, such as incomplete lineage sorting and introgression.
This is of particular importance for the resulting display diagrams. A phylogenetic tree was originally conceived to represent macroevolution. For example, speciation and extinction occur as single events at particular times, and these events apply to discrete groups of organisms. The taxa can be represented as distinct lineages in a tree graph, and the events by having these lineages stop or branch in the graph.
This idea is easily extended to phylogenetic networks, where the gene-flow events are also treated as singular, so that hybridization or horizontal gene transfer can be represented as single reticulations among the lineages.
These are sometimes called "pulse" events. However, there are also "press" events that are ongoing. That is, a lot of genetic variation is generated where populations repeatedly mix, so that every gene-flow instance is part of a continuous process of mixing. This often occurs, for example, in the context of isolation by distance, such as ring species or clinal variation. Under these circumstances, processes like introgression and HGT can involve ongoing events.
For instance, in an earlier life I once studied three species of plant in the Sydney region (Morrison DA, McDonald M, Bankoff P, Quirico P, Mackay D. 1994. Reproductive isolation mechanisms among four closely-related species of Conospermum (Proteaceae). Botanical Journal of the Linnean Society 116: 13-31). One of the species was ecologically isolated from the other two (it occurred in dry rather than damp habitats), and the other two were geographically isolated from each other (they occurred on separate sandstone uplands with a large valley in between). These species look very different from each other, as shown in the picture above, but looks are deceiving. Where the ecological isolation was incomplete, introgression occurred and admixed populations could be found.
These dynamics are more difficult to represent in a phylogenetic tree or network. We do not have discrete groups that can be represented by lines on a graph, but instead have fuzzy groups with indistinct boundaries. Furthermore, we do not have discrete events, but instead have ongoing (repeated) processes.
Nevertheless, it seems clear that there is a desire in modern biology to integrate macroevolutionary and microevolutionary dynamics in a single network diagram. That is, some parts of the diagram will represent pulse events involving discrete groups and other parts will represent press events among fuzzy groups. This situation seems to be currently addressed by practitioners by first creating a tree to represent the pulse events (and possibly their times), and then adding imprecisely located dashed lines as a representation of ongoing gene flow — see the example in Producing trees from datasets with gene flow. This particular mixture of precision and imprecision seems rather unsatisfactory.
Perhaps someone might like to have a think about this aspect of phylogenetic networks, to see if there is some way we can do better.
Labels:
Admixture,
HGT network,
Introgression,
Phylogeny
Subscribe to:
Comments (Atom)