Showing posts with label Hybridization network. Show all posts
Showing posts with label Hybridization network. Show all posts

Monday, June 3, 2019

A phylogenetic network outside science


I have written before about the presentation of historical information using the pictorial representation of a phylogeny (eg. Phylogenetic networks outside science; Another phylogenetic network outside science). These diagrams are often representations of the evolutionary history of human artifacts, and so a phylogeny is quite appropriate. They are of interest because:
  • they are usually hybridization networks, rather than divergent trees, because the artifact ideas involve horizontal transfer (ideas added) and recombination (ideas replaced);
  • they are often not time consistent, because ideas can leap forward in time, so that the reticulations do not connect contemporary artifacts (see Time inconsistency in evolutionary networks); and
  • they are sometimes drawn badly, in the sense that the diagram does not reflect the history in a consistent way.
The latter point often involves poor indication of the time direction (see Direction is important when showing history), or involves subdividing the network into a set of linearized trees.

One particularly noteworthy example that I have previously discussed is of the GNU/Linux Distribution Timeline, which illustrates the complex history of the computer operating system. The problems with this diagram as a phylogeny are discussed in the blog post section History of Linux distributions.

In this new post I will simply point out that there is a more acceptable diagram, showing the key Unix and Unix-like operating systems. I have reproduced a copy of it below.

Click to enlarge.

This version of the information correctly shows the history as a network, not a series of linearized trees (each with a central axis). It also draws the reticulations in an informative manner, rather than having them be merely artistic fancies.

It is good to know that phylogenetic diagrams can be drawn well, even outside biology and linguistics.

Monday, December 3, 2018

The pedigree of grape varieties


We are all familiar with the concept of a family tree (formally called a pedigree). People have been compiling them for at least a thousand years, as the first known illustration is from c.1000 CE (see the post on The first royal pedigree). However, these are not really tree-like, in spite of their name, unless we exclude most of the ancestors from the diagram. After all, family histories consist of males and females inter-breeding in a network of relationships, and this cannot be represented as a simple tree-like diagram without leaving out most of the people. I have written blog posts about quite a few famous people who have really quite complex and non-tree-like family histories (including Cleopatra, Tutankhamun, Charles II of Spain, Charles Darwin, Henri Toulouse-Lautrec, and Albert Einstein).

A history of disease within an Amish community

Clearly, the history of domesticated organisms is even more complex than that of humans. After all, in most cases we have gone to a great deal of trouble to make these histories complex, by deliberately cross-breeding current varieties (of plants) and breeds (of animals) to make new ones. So, I have previously raised the question: Are phylogenetic trees useful for domesticated organisms? The answer is the same: no, unless you leave out most of the ancestry.

In most cases, we have no recorded history for domesticated organisms, because most of the breeding and propagating was undocumented. Until recently, it was effectively impossible to reconstruct the pedigrees. This has changed with modern access to genetic information; and there is now quite a cottage industry within biology, trying to work out how we got our current varieties of cats, dogs, cows and horses, as well as wheat, rye and grapes, etc. I have previously looked at some of these histories, including Complex hybridizations in wheat, and Complex hybridizations in barley and its relatives.

Grapes

One example of particular interest has been grape varieties. I have discussed some of the issues in a previous post: Grape genealogies are networks, not trees, including the effects of unsampled ancestors when trying to perform the reconstruction.

There are a number of places around the web where you can see heavily edited summaries of what is currently known about the grape pedigree. However, these simplifications defeat the purpose of this blog post, which is to emphasize the historical complexity. The only diagram that I know of that shows you the full network (as currently known) is one provided by Pop Chart (The Genealogy of Wine), a commercial group who provide infographic posters for just about anything. They will sell you a full-sized poster of the pedigree (3' by 2'), but here I have provided a simple overview (which you can click on to see somewhat larger).

Grape variety genealogy from Pop Chart

You can actually zoom in on the diagram on the Pop Chart web page to see all of the details. This allows you to spend a few happy hours finding your favorite varieties, and to see how they are related. You will presumably get lost among the maze of lines, as I did.

Tuesday, June 20, 2017

Cichlids, species and trees

Lake Malawi, in south-eastern Africa, is famous for its large diversity of cichlid fishes. Indeed, it sometimes seems to have more biologists studying these fish than there are actual fish in the lake, even though there are allegedly hundreds of cichlid fish species in that lake. In this sense, it is somewhat similar to Lake Baikal, in southern Siberia, home to the sole species of freshwater seals.

The cichlid biologists are interested in describing the extensive fish diversity, pondering its origin, and thus its contribution to the study of speciation. After all, we are talking about what is usually claimed to be "the most extensive recent vertebrate adaptive radiation". So, we are talking here as much about population genetics as we are about ichthyology.


Inevitably, the genome biologists have been spotted in the vicinity of the lake; and we now have a preliminary report from them:
Milan Malinsky, Hannes Svardal, Alexandra M. Tyers, Eric A. Miska, Martin J. Genner, George F. Turner, Richard Durbin (2017) Whole genome sequences of Malawi cichlids reveal multiple radiations interconnected by gene flow. BioRxiv 143859.
These authors summarize the situation like this:
We characterize [the] genomic diversity by sequencing 134 individuals covering 73 species across all major lineages. Average sequence divergence between species pairs is only 0.1-0.25%. These divergence values overlap diversity within species, with 82% of heterozygosity shared between species. Phylogenetic analyses suggest that diversification initially proceeded by serial branching from a generalist Astatotilapia-like ancestor. However, no single species tree adequately represents all species relationships, with evidence for substantial gene flow at multiple times.
The last sentence seems to be somewhat disingenuous. How could a single tree be expected to describe this scale of biodiversity? Any rapid radiation of diversity is unlikely to be completely tree-like. The increase in diversity can be modeled as a tree, sure, but it is very unlikely that there will be instant separation of the taxa, and so the tree model will be ignoring a large part of the evolutionary action. There will, for example, be ongoing introgression between the diverging taxa, as well as hybridization due to incomplete breeding barriers. These avenues for gene flow can best be modeled as a network, not a tree.

The issue here is that the authors write the paper solely from the perspective of an expected phylogenetic tree, and then feel compelled to explain why they do  not produce such a tree. Indeed, the authors present their paper as a study of "violations of the species tree concept".

For data analysis, they proceed as follows:
To obtain a first estimate of between-species relationships we divided the genome into 2543 non-overlapping windows, each comprising 8000 SNPs (average size: 274kb), and constructed a Maximum Likelihood (ML) phylogeny separately for each window, obtaining trees with 2542 different topologies.
So, only two sequence blocks produced the same tree, presumably by random chance. An example "tree" for 12 OTUs is shown in the diagram. It superimposes a possible mitochondrial trees on a summary of the "genome tree".

Example phylogeny from Malinsky (2012)

The authors continue:
The fact that we are using over 25 million variable sites suggests these differences are not due to sampling noise, but reflect conflicting biological signals in the data. For example, gene flow after the initial separation of species can distort the overall phylogeny and lead to intermediate placement of admixed taxa in the tree topology.
Note that gene flow is seen to "distort" the phylogeny rather than being an integral part of it. In this case, "phylogeny" apparently refers solely to the diversification part evolutionary history, rather than to the whole history.

The ultimate questions from this paper are: "what is a species concept?", and "what is a species tree?". The authors write a lot about species and trees, and yet their data provide very clear evidence that both "species" and "tree" are very restrictive concepts for studying the cichlids of Lake Malawi.

Coincidentally, another recent paper tackles the same problems:
Britta S. Meyer, Michael Matschiner, Walter Salzburger (2017) Disentangling incomplete lineage sorting and introgression to refine species-tree estimates for Lake Tanganyika cichlid fishes. Systematic Biology 66: 531-550.
The authors describe their work, on the same fish group but in a lake further north-west, as follows:
Because of the rapid lineage formation in these groups, and occasional gene flow between the participating species, it is often difficult to reconstruct the phylogenetic history of species that underwent an adaptive radiation. In this study, we present a novel approach for species-tree estimation in rapidly diversifying lineages, where introgression is known to occur, and apply it to a multimarker data set containing up to 16 specimens per species for a set of 45 species of East African cichlid fishes (522 individuals in total), with a main focus on the cichlid species flock of Lake Tanganyika. We first identified, using age distributions of most recent common ancestors in individual gene trees, those lineages in our data set that show strong signatures of past introgression ... We then applied the multispecies coalescent model to estimate the species tree of Lake Tanganyika cichlids, but excluded the lineages involved in these introgression events, as the multispecies coalescent model does not incorporate introgression. This resulted in a robust species tree.
Once again, phylogeny = species tree.

Tuesday, June 13, 2017

Bayesian inference of phylogenetic networks


Over the years, a number of methods have been explored for constructing evolutionary networks, starting with parsimony criteria for optimization, and moving on to likelihood-based inference. However, the development of Bayesian methods has been somewhat delayed by the computational complexities involved.

Network from Radice (2012)

The earliest work on this topic seems to be the thesis of:
Rosalba Radice (2011) A Bayesian Approach to Phylogenetic Networks. PhD thesis, University of Bath, UK.
Apparently, the only part of this work to be published has been:
Rosalba Radice (2012) A Bayesian approach to modelling reticulation events with application to the ribosomal protein gene rps11 of flowering plants. Australian & New Zealand Journal of Statistics 54: 401-426.
The method described requires the prior specification of the species tree (phylogeny), and the position and number of the reticulation events. The algorithm was implemented in the R language.

More recently, methods have been developed that infer phylogenies by using (i) incomplete lineage sorting (ILS) to model gene-tree incongruence arising from vertical inheritance, and (ii) introgression / hybridization to model gene-tree incongruence attributable to horizontal gene flow. ILS has been addressed using the multispecies coalescent.

The first of these publications was:
Dingqiao Wen, Yun Yu, Luay Nakhleh (2016) Bayesian inference of reticulate phylogenies under the multispecies network coalescent. PLoS Genetics 12(5): e1006006. [Correction: 2017 PLoS Genetics 13(2): e1006598]
The method requires the set of gene trees as input, along with the number of reticulations. The algorithm was implemented in the PhyloNet package.

In the past few months, two manuscripts have appeared that try to co-estimate the gene trees and the species network, using the original sequence data (assumed to be without recombination) as input:
Dingqiao Wen, Luay Nakhleh (2017) Co-estimating reticulate phylogenies and gene trees from multi-locus sequence data. bioRxiv 095539. [v.2; v.1: 2016]
Chi Zhang, Huw A Ogilvie, Alexei J Drummond, Tanja Stadler (2017) Bayesian inference of species networks from multilocus sequence data. bioRxiv 124982.
The algorithm for the first method has been implemented in the PhyloNet package, while the second has been implemented in the Beast2 package.

Finally, another manuscript describes a method utilizing data based on single nucleotide polymorphisms (SNPs) and/or amplified fragment length polymorphisms (AFLPs), which thus sidesteps the assumption of no recombination:
Jiafan Zhu, Dingqiao Wen, Yun Yu, Heidi Meudt, Luay Nakhleh (2017) Bayesian inference of phylogenetic networks from bi-allelic genetic markers. bioRxiv 143545.
This method has also been implemented in PhyloNet.

Due to the computational complexity of likelihood inference, all of these methods are currently severely restricted in the number of OTUs that can be analyzed, irrespective of whether these involve multiple samples from the same species or not. In this sense, parsimony-based inference or approximate likelihood methods are still useful for constructing evolutionary networks of any size. However, progress is clearly being made to alleviate the computational restrictions.

Tuesday, March 14, 2017

Detecting introgression versus hybridization


There has been considerable interest in recent years in developing methods that will detect hybridization in the presence of incomplete lineage sorting (ILS), which will allow the construction of a realistic hybridization network. Clearly, both ILS and hybridization create conflicting gene trees, which will lead to a very complex data-display network. However, if the ILS signals in the data can be used to construct a small collection of gene-tree groups, in which the gene trees within each group are congruent with a single species tree (under the ILS model), then the incongruence between groups can be used to construct a hybridization network. This network will then be an hypothesis for a realistic evolutionary network.

Recently, a paper has appeared that uses simulations to evaluate several of these methods:
Olga K. Kamneva and Noah A. Rosenberg (2017) Simulation-based evaluation of hybridization network reconstruction methods in the presence of incomplete lineage sorting. Evolutionary Bioinformatics 2017:13.
I am not a great fan of simulations, because they exist under very restricted and usually unrealistic mathematical conditions. They are, however, useful for exploring the mathematical properties of various methods, even if they are hard to connect to the biological properties.

My interpretations of the results from the particular scenarios explored by Kamneva and Rosenberg are:
  1. Most of the methods improve as the internal network edges increase in length.
  2. Most of the methods improve as the number of gene trees increases.
  3. Under good conditions the maximum-likelihood methods do better than the parsimony and consensus methods.
  4. The maximum-likelihood methods are more affected by gene-tree error than are the other methods.
  5. There are conditions under which none of the methods work well.
I doubt that any of this is controversial, in the sense that model-based methods usually work well when their models apply, but not necessarily otherwise. Reality is more complex than the models, and so the methods are likely to fail for real data.

For me, the most interesting part of the paper is the examination of balanced versus skewed parental contributions to the hybrid taxon. A balanced genetic contribution in the simulations is analogous to homoploid or polyploid hybridization, whereas a skewed contribution is analogous to introgression or horizontal gene transfer (HGT). The simulations seem to show that the methods examined do not deal very well with skewed contributions.

So, these methods may literally be hybridization-network methods only, with separate network methods needed for detecting introgression or HGT — for example, the admixture methods used for genomes (see the recent post on Producing admixture graphs).

This would mean that we cannot first produce networks with reticulations, and then afterwards explore what is causing the reticulations. Instead, we will need to decide on the possible biological mechanisms of reticulation before the analysis, and then mathematically explore possible networks that reflect those mechanisms.

This is not an issue for constructing trees, of course, since the only recognized mechanisms are speciation and extinction, both of which are explored post hoc rather than a priori. This is an important difference of networks versus trees.

Tuesday, September 13, 2016

An old network of the Shepherd and Herdsman's dogs


I have previously noted that the first known phylogenetic network concerned dog breeds, in 1755 (The first phylogenetic network) — a network is needed because many dog breeds are hybrids between other breeds. I have also noted the inappropriate recent tendency to use phylogenetic trees for these breeds, instead (Why do we still use trees for the dog genealogy?). I have also provided a sampling of known phylogenetic networks from the early 20th century (Phylogenetic networks 1900-1990). This post combines all of these themes.

Max Emil Friedrich von Stephanitz was a cavalry captain (rittmeister), but his enduring legacy was as a dog breeder and historian of German dog breeds. His best known book is (available here):
Der Deutsche Schäferhund in Wort und Bild (1921) Ant. Kämpfe, Jena.
This was translated into English as:
The German Shepherd Dog in Word and Picture (1923), translated by J. Schwabacher.
In this book von Stephanitz argued that the German Shepherd was a specific type of shepherd or herdsman's dog. Indeed, over the previous two decades he had tried to standardize the German Shepherd breed as a working dog, rather than as a show dog in the British tradition of the 19th century.

As part of his argument, he presented a stammbaum of related breeds.


The second picture shows the genealogy from the English translation.


Note that the German Shepherd is not involved in a recent reticulate history, as are all of the herdsman's breeds. This is part of von Stephanitz's argument for the importance of preserving the German Shepherd's identity.

You can read a bit more about the German Shepherd and its ancestor the Hoffwart, the farm guard dog, at König the Hovawart founder revisited: the myth of the Hoffwart.

Tuesday, July 12, 2016

Coal — trees and networks of knowledge


The Tree of Knowledge is a well-known concept, and the tree can indeed be used to arrange information. One possible use is to describe the relationships of derivative products (ie. the chemical derivatives of other substances). Indeed, these can be viewed as having a "phylogeny", since the processing follows a time sequence.

The U.S. Geological Survey (in the U.S. Department of the Interior) has provided one such example in Geological Survey Circular 1143 Coal — a Complex Natural Resource. The centerfold of that publication shows:
Coal byproducts in tree form showing basic chemicals as branches and derivative substances as twigs and leaves. [Modified from an undated public domain illustration provided by the Virginia Surface Mining and Reclamation Association.]

However, a tree is a simplification of a network, and the network can thus show more information. In this case, the same information has previously been illustrated using a reticulating network, not a tree.

In the 7th edition (1924) of Joseph Meyer's Große Conversations-lexikon für gebildete Stände (first edition 1840-1855) there is a Steinkohle: Stammbaum der Steintohlenerzeugnisse [Coal: family tree of coal products]:


This has three reticulations, showing coal products produced as a result of combining two different processing routes. This is thus a hybridization network.

Thanks to the Trees of Knowledge page (by Paul Michel) of the "Encyclopedias as Indicators of Change in the Social Importance of Knowledge, Education and Information" web site, for pointing out this unexpected use of trees of knowledge.

Tuesday, July 5, 2016

Hybridization in the world of duplication-transfer-loss


It seems to me that the study of reticulate evolutionary histories currently boils down to two options:
(1) reconstructing a species "tree" from multiple gene trees using a coalescent model that includes hybridization (either homoploid or polyploid);
(2) reconciling multiple gene trees with a known [sic] species tree using a model that includes gene duplication, loss and transfer (as well as speciation) - a DTL model.

This often leads me to wonder where hybridization fits into option (2) and where gene transfer fits into option (1). They must fit somewhere. For example, Jacox et al. (2016. ecceTERA: comprehensive gene tree-species tree reconciliation using parsimony. Bioinformatics 32: 2056-2058) describe their DTL as:
comprehensive as it includes the following evolutionary events: speciation, speciation-loss (speciation followed by a loss of one gene copy), gene duplication, gene loss, gene transfer and transfer-loss (gene transfer with loss of the original gene) between two sampled species, and gene transfer and transfer-loss from/to an unsampled species (i.e. a species that is not represented in the dataset) to/from a sampled one.

Since the model is "comprehensive", then hybridization must be included. The only parts of the model that include reticulate histories are gene transfer and transfer-loss, so this is where hybridization must be. Possibly, polyploid hybridization is included in "gene transfer" (an increase in the number of gene copies), and homoploid hybridization is included in "transfer-loss" (maintaining the same number of genes).

This seems to be a simple example of the idea that different types of reticulation events cannot be distinguished from each other. Genomic material moves from one place to another in contemporaneous organisms, either sexually (introgression, hybridization) or asexually (lateral gene transfer). There is nothing intrinsic about gene trees to tell us which mechanism is involved in any given reticulation, other than the relative positions of the donor and recipient in the "species tree" and the possibility of time inconsistency.

This leads to the question of why horizontal gene movement is called "transfer" in one model (2) and "hybridization" in the other (1).

Tuesday, June 14, 2016

Grape genealogies are networks, not trees


I have noted before that the genealogies for all domesticated organisms are networks not trees, and specifically they are hybridization networks. That is, in sexually reproducing species, every offspring is the hybrid of two parents. If we include both parents in the pedigree, plus all of their relatives, then this will form a complex network every time inbreeding occurs.

I have previously illustrated this phenomenon using genealogies of grape cultivars:
     Are phylogenetic trees useful for domesticated organisms?
     First-degree relationships and partly directed networks

Reconstructing grape genealogies is often a tricky business. This was originally done using phenotypic characters and historical records, of course, but these days we use DNA from whatever cultivars are available for sampling. Perhaps the biggest problem is that many of the cultivars are no longer known (there have been at least 10,000 of them recorded at some time in history), so that the genealogies are full of question marks representing unknown (unsampled) parents.

The practical consequence of this is that the time direction of the genealogy will be ambiguous whenever there is a missing parent. Estimates of identity-by-descent (IBD) are calculated based on linkage analysis for all pairwise comparisons of samples, and complex crossing schemes can generate IBD values that are indistinguishable from sibling relationships. So, in these cases we cannot distinguish parent-offspring relationships from sibling relationships.

A simple example is shown in the most detailed current book on grape cultivars:
Jancis Robinson, Julia Harding, José Vouillamoz (2012) Wine Grapes: a Complete Guide to 1,368 Vine Varieties, including their Origins and Flavours. Allen Lane / Ecco.
This example involves the grand-parentage of the Shiraz grape, usually called Syrah in the effete monarchies of the Old World. The authors present three possible scenarios, as shown here.


There are five sampled cultivars and two inferred unknowns, arranged in an unrooted network. Because the unknowns are inferred to be parents, the network can be rooted in any of three different places, as shown by the three Options illustrated.

The authors (or, more specifically, the third author, who is the one responsible for the genealogies) are in favour of Option A. This means that Mondeuse Noir and Viognier are Syrah's half-siblings rather than either being the grandparent.

This small genealogy is a tree, but when we move to larger genealogies the network nature of the cultivars should become obvious.

However, the authors resort to a standard subterfuge to hide this fact. This strategy is to show cultivars multiple times in the genealogies, to avoid drawing reticulate relationships. I have illustrated this approach a couple of times before in this blog:

     Reducing networks to trees
    Thoroughbred horses and reticulate pedigrees

In the following genealogy of the Pinot cultivar, the authors note: "For the sake of clarity, Trebbiano Toscano and Folle Blanche appear twice in the diagram."


Trees reign supreme as simplifications of networks!

Tuesday, May 24, 2016

Y chromosome and mitochondrial DNA phylogenies — networks?


If one combines a Y chromosome genealogy, which usually shows the paternal ancestry, with a mitochondrial genealogy, which usually shows the maternal ancestry, it is likely that the resulting phylogeny will be reticulate. After all, if a sexually reproducing group of organisms is monophyletic then there is, in theory, a common ancestral pair of organisms, although in practice it is likely to be a small group of inter-breeding organisms. That being so, the ancestry of each descendant individual must consist of a pair of intersecting trees, one maternal and one paternal.

This can be illustrated using this recent paper:
Pille Hallast, Pierpaolo Maisano Delser, Chiara Batini, Daniel Zadik, Mariano Rocchi, Werner Schempp, Chris Tyler-Smith, and Mark A. Jobling (2016) Great ape Y chromosome and mitochondrial DNA phylogenies reflect subspecies structure and patterns of mating and dispersal. Genome Research 26: 427-439.
The authors sequenced autosomal DNA, as well as the Y chromosome (MSY) and the mitochondrial DNA (mtDNA), for each of 19 great ape males (orangutans, gorillas, chimpanzees, bonobos, and humans), and added this to the data for 24 published genomes. For the 19 individuals:
we carried out principal component analysis (PCA) of autosomal SNP variation (∼10,000–48,000 variable sites, depending on species) ... 17 of our 19 sequenced individuals lie within known subspecies clusters ... Two of the sequenced chimpanzees lie mid-way between clusters in the PCA, suggesting recent inter-subspecies hybridization in their ancestry (Tommy: Pan troglodytes verus / Pan troglodytes troglodytes hybrid; EB176JC: Pan troglodytes verus / Pan troglodytes ellioti hybrid).
This seems quite clear in their ordination diagram, as shown here.


Furthermore, for the other two datasets (43 males) the authors note:
PHYLIP v3.69 was used to create maximum parsimony phylogenetic trees for both MSY and mtDNA. Three independent trees were constructed with DNAPARS using randomization of input order with different seeds, each 10 times. Output trees of these runs were used to build a consensus tree with the consense program included in the PHYLIP package. Intraspecific MSY trees were rooted using the ancestral sequence generated and described in the Supplemental Text [basically, the allele matching the outgroup]. Intraspecific mtDNA trees were rooted using the Human Revised Cambridge Reference Sequence.
The two resulting trees for the 19 chimpanzees are shown here, with the MSY tree on the left and the mtDNA tree on the right.


One of the hybrid individuals identified in the autosomal analysis was labelled EBC176JV, and he is clearly shown in a different place in each of the two trees — he is shown as having a Pan troglodytes verus (PTV) father and a Pan troglodytes ellioti (PTE) mother. Consequently, he will be placed at a reticulation node in any attempt to combine the two trees

More oddly, the other individual, named Tommy, does not show this pattern at all. In the two trees he is shown as having both a Pan troglodytes troglodytes (PTT) father and mother, rather than one of them being identified as Pan troglodytes ellioti (PTE), as expected from the autosomes. The authors do not even note this apparently contradictory situation, let alone suggest an explanation. Clearly, however, no reticulation node will be needed in a combined phylogeny.

Monday, April 4, 2016

GeneaQuilts


The drawing of large genealogies is not easy, and phylogeneticists (among others) have tried a number of solutions, including circular diagrams as we as interactively zoomable displays. One interesting solution that does not appear to have yet been used in phylogenetics is the concept of GeneaQuilts.

These were introduced by the Visual Analytics Project:
A. Bezerianos, P. Dragicevic, J.-D. Fekete, J. Bae, B. Watson (2010) GeneaQuilts: a system for exploring large genealogies. In: IEEE InfoVis '10: IEEE Transactions on Visualization and Computer Graphics, Oct 2010, Salt-Lake City, USA.
The web page has a video introducing the concept, which does a better job than I can do here. The basic idea is to abandon the tree / network representation, and to use a diagonally-filled matrix instead, where the rows are individuals and the columns show parent-offspring relationships.

Here is an example genealogy, based on the reported relationships among the Greek Gods.


If the relationships are tree-like then the diagram will be concentrated on the diagonal of the matrix. However, network relationships (inbreeding) will cause off-diagonal elements, two of which are shown in the example: one involves Hades and his niece Persephone.

Several, much larger examples are displayed on the GeneaQuilts website. There is a program that can be downloaded, which takes as its input standard family-history files.

There seems to be no intrinsic reason why this display form could not also be used in phylogenetics.

Monday, October 12, 2015

Buffon and the origin of the tree and network metaphors


I have written before about Georges-Louis Leclerc, Comte de Buffon (1707-1788). (Actually, he was called Georges-Louis Leclerc from 1707-1725, and Georges-Louis Leclerc De Buffon from 1725–1773, before becoming a count.) His role in the development of the theory of organic evolution was such that he is worth considering again here, especially given his important role in introducing the tree and network metaphors in phylogenetics.


Buffon

Buffon is usually credited with being in the top triumvirate of influential people in the development of modern biology, along with Aristotle and Darwin. Buffon followed the lead of the physicist Isaac Newton, by trying to explain natural phenomena solely in terms of other observable natural phenomena, rather than resorting to super-natural explanations. (Indeed, Buffon translated one of Newton's books from LAtin to French.)

This was Newton's main contribution to science, his insistence on empirical explanations. He did not invent this idea, but he was the one who effectively created modern science by consistently applying it. Hence the importance of the apple — the explanation for the small-scale phenomenon of a falling apple, which we can see and study experimentally, is the same as for the large-scale orbits of the planets, which we can see but not experiment upon. Consistency of natural explanations, rather than invoking super-natural forces, creates a coherent scientific whole that is amenable to description, explanation and prediction.

Buffon adopted this same scientific approach and applied it to biology. Once again, he did not invent this idea, but he was the one who applied it consistently across all of biology. He did this principally in his Histoire naturelle, générale et particulière, an ambitious work planned to cover all of nature in 50 volumes (it included geology, anthropology and cosmogeny, as well as biology). Begun in 1749, he and a few collaborators completed 36 volumes before his death in 1788, and 8 more were compiled by others shortly afterwards.

In the process of trying to find natural explanations for all empirically observable biological phenomena, Buffon not unexpectedly encountered the idea of mutation of species, as part of his thoughts about an irreversible history of nature. He thus grappled both with species concepts and with temporal change within and between species. He is thus credited as the first modern evolutionist, because he introduced the time element in comparative biology, so that common structure is explained in terms of common ancestry. However, his ideas, published over many decades, were often inconsistent — sometimes he was an evolutionist and sometimes not. This seems to be, at least in part, due to increasing religious pressure — he was an important person in the ancienne regime of France, and not in a position to easily reject the teachings of the Catholic church.

By modern standards, Buffon was wrong on most things (see Buffon's genealogical ideas), as was Aristotle — being first means that you are also the first to get it wrong, to one extent or another. This does not in any way reduce the impressive nature of his work as a pioneer. He was not a cataloguer of information like his great Swedish rival von Linné — he wanted to explain things, not organize them, as he was interested principally in causes. He also moved away from trying to explain biology in terms of physics (eg. the concept of universal essences), and tried to explain it in terms of itself.

Metaphors

Of principal interest for this blog is Buffon's role in the development of metaphors for biological relationships. Given his role as an early adopter of evolutionary ideas, he was also an early adopter of metaphors to depict those ideas about historical relationships.

Buffon argued for temporal continuity rather than eternal types, modification of both natural and domesticated species through time (but only up to a certain point), and an underlying unity of organismal types. The latter idea suggested common ancestry for all animals, but Buffon considered and rejected this hypothesis. Indeed, he also rejected the idea that species descend from each other, thus accepting only within-species evolution. He did, however, have a broad concept of species, based on inter-breeding, so that some of his species correspond to modern taxonomic families.

In a previous blog post (The first phylogenetic network 1755) I noted that Buffon put his thoughts into action when he considered the within-species evolution of dog breeds in volume V his Histoire naturelle. In doing so, he published what is usually considered to be the first avowedly evolutionary diagram. It shows the origin and diversification of dog domestication as known at the time. It includes both temporal and spatial variation among dogs, since Buffon believed that morphological variation was related to different climates, so that climatic differences were the ultimate cause of biological variation.

Although Buffon labeled the diagram as a "Table", in his text he noted that it is [translated] "a table or, if one prefers, a kind of genealogical tree where one may grasp at a glance all the varieties". In modern terms it is actually a hybridization network, since it shows repeatedly that some dog breeds arose as a result of hybridization between other breeds. It is also, of course, a map, since it shows spatial variation, although the geographical content is not strictly respected. The diagram is thus a hybrid of a network and a map.

Note that Buffon used the idea of a tree long before Simon Pallas (1776), who is usually credited with introducing the tree metaphor. However, Buffon was writing solely about within-species relationships, whereas Pallas discussed a much broader scale (specifically, both plants and animals).

Indeed, Buffon's genealogical ideas had first appeared in volume IV of the Histoire naturelle, in 1753 (the same year as Linné's Species Plantarum). In this volume there is a presentation of his ideas on species in "Discours sur la nature des animaux" [Discourse on the nature of animals] and his ideas about animal genealogy in "L'asne" [The ass]. The latter contains this text:
que l'homme et le singe ont eu une origine commune comme le cheval et l'âne; que chaque famille, tant dans les animaux que dans les végétaux, n'a eu qu'une seule souche, et même que tous les animaux sont venus d'un seul animal qui, dans la succession des temps, a produit, en se perfectionnant et en dégénérant, toutes les races des autres animaux. [that man and ape have had a common origin like the horse and the donkey; every family, both in animals and in plants, had only a single stem [stock], and even all the animals came from a single animal which, in the succession of time has produced by perfection and degeneration, all the races of the other animals.]
Buffon was, however, not consistent in his uses of metaphors. This topic is discussed in detail by Giulio Barsanti (1992), and he has provided a convenient chart of Buffon's metaphors — the following version is taken from Ruse and Travis (2009).


Note that Buffon used the traditional chain analogy most often, since this can be used for ancestor–descendant relationships. However, he simultaneously used the tree and map in 1755 (as discussed above), and he effectively replaced the tree with the map after 1780. The map had previously been introduced by von Linné in 1751 ("All plants show affinities on either side, like territories in a geographical map").

It is interesting to see the rapid rise and fall of the family-tree metaphor in the mid 1700s, before its resurgence a century later. The cluster of tree references in 1766 is from "De la dégénération", in volume XIV of Histoire naturelle. "Dégénération" was Buffon's term for evolution.

References

Barsanti G (1992) Buffon et l'image de la nature: de l'échelle des êtres à la carte géographique et à l'arbre généalogique [Buffon and the image of nature: the scale of being to the map and to the family tree]. In: Gayon J (ed.) Buffon 88: Actes du Colloque International [pour le bicentenaire de la morte de Buffon] (Paris-Montbard-Dijon, 14-22 juin 1988), pp. 255-296. Paris: Librairie Philosophique J. Vrin.

Ruse M, Travis J (2009) Evolution: The First Four Billion Years. Belknap Press, Cambridge MA, p 458.

Monday, September 28, 2015

Complex hybridizations in barley and its relatives


In a former blog post I discussed the complex series of polyploid hybridizations that led to modern wheat cultivars (Complex hybridizations in wheat). A recent paper has discussed the even more complex series of polyploid hybridizations involved in the genus Hordeum, which includes cultivated barley:
Jonathan Brassac and Frank R. Blattner (2015) Species-level phylogeny and polyploid relationships in Hordeum (Poaceae) inferred by next-generation sequencing and in silico cloning of multiple nuclear loci. Systematic Biology 64: 792-808.

The authors note:
With nearly half of the species being polyploids (tetra- and hexaploids), including allo- and autopolyploids, the genus Hordeum is a good model to study speciation through polyploidization ... Studies on polyploid taxa are generally impeded by the complex evolution of these organisms, involving recurrent formation, gene loss or retention, and homoeologous recombination ... [However,] Chloroplast DNA is usually maternally inherited in angiosperms, [and thus] can be used to identify the direction of hybrid speciation in polyploids, that is, to determine maternal parents.
Here we present an analysis that is based on 12 nuclear loci, distributed on six of the seven barley chromosomes, and one chloroplast region ... Phylogenetic analyses were conducted on single loci and concatenated data from all loci ... We included 105 individuals representing all 33 species and most subspecies of the genus.
After aligning the sequences from all loci, (i) models of sequence evolution were determined for each locus. Gene trees were calculated for each locus with (ii) the sequences derived from the diploid taxa by Bayesian phylogenetic inference (BI), and (iii) sequences from all diploid plus, consecutively, single polyploid individuals were clustered by neighbor-joining analysis to determine phylogenetic affiliation (phasing) of the homoeologous gene copies found in polyploid taxa. Concatenated sequences from all loci (supermatrices) were used for BI of (iv) diploid and (v) diploid plus phased homoeologs of polyploid taxa. (vi) A MSC-based [multispecies coalescent] analysis was conducted to infer species trees from gene trees for the diploid individuals. (vii) To date nodes within the Hordeum phylogeny a molecular clock approach was conducted together with the MSC. (viii) A BCA [Bayesian concordance analysis] was conducted on the diploid taxa to estimate gene tree incongruences. Finally, (ix) chloroplast matK sequences were analyzed by BI to detect the maternal lineages in allopolyploids.
The results of this analysis were summarized into a scheme where polyploids were integrated in the modified diploid species tree. The MSC topology was modified to take into account the incongruences between the different methods and to integrate the inferred extinct lineages. The polyploid relationships could mostly be identified with confidence. The wide genetic variety found in some species probably indicates multiple origins of such polyploids.

This was obviously a rather complex procedure; and use of a MUL-tree would be simpler for much of the work. The authors ended up drawing a hybridization network manually, as explained in the legend to their figure. (Note that MSC is the multi-species coalescent and BI is bayesian inference.)


The authors do finally note that "It could also be interesting to test the strategy suggested by Marcussen et al. (2015) to evaluate potential network topologies for such a particularly complex polyploid taxon." This would certainly be a more direct way to produce a phylogenetic network for polyploids.

References

Jakob SS, Blattner FR (2006) A chloroplast genealogy of Hordeum (Poaceae): long-term persisting haplotypes, incomplete lineage sorting, regional extinction, and the consequences for phylogenetic inference. Molecular Biology and Evolution 23: 1602-1612.

Marcussen T, Heier L, Brysting AK, Oxelman B, Jakobsen KS (2015) From gene trees to a dated allopolyploid network: insights from the angiosperm genus Viola (Violaceae). Systematic Biology 64: 84-101.

Wednesday, September 23, 2015

Uses of MUL-trees for evolutionary networks


Creating evolutionary phylogenetic networks is currently a somewhat ad hoc procedure, with a number of competing strategies based on various models of how gene flow occurs.

One possibility is to use multi-labeled trees. Here, multiple gene trees can be represented by a single multi-labeled tree (a MUL-tree), which in turn can also be represented as a reticulating network. A MUL-tree has leaves that are not uniquely labeled by a set of species (ie. each species can appear more than once). This means that multiple gene trees can be represented by a single MUL-tree, with different combinations of the leaf labels representing different gene trees.

The most obvious uses of a MUL-tree are where there are multiple copies of genes within an organism, as each gene copy can be represented independently in the MUL-tree. This will apply when there has been gene duplication, for example, or when there has been polyploidy (ie. multiple copies of the entire genome). Computer programs such as PADRE or MulRF can then be used to derive an optimal single-labeled species network from the MUL-tree.

However, this same strategy can also be used whenever there is conflict among gene trees. In this scenario, the conflicting genes are treated as different leaves in the MUL-tree. One labeled leaf would have the data for the first gene, with the second gene entered as missing data, and the second leaf would then have the inverse situation (the data for gene one are missing and those for gene two are present).

This can be illustrated by a recent example of the Erica (heather plants) genus, from Mugrabi de Kuppler et al. (2015). The authors were interested in whether the observed gene tree conflict in Erica lusitanica could be the result of hybridisation between morphologically dissimilar species, as this has previously been suggested.

They collected sequence data for a number of plastid regions as well as the nuclear ribosomal ITS region. The observed conflict was between the plastid (chloroplast) and nuclear sequences. They note:
A targeted supermatrix strategy was employed, whereby more variable ITS and trnL-trnF spacer sequences were obtained for most samples, and the other, mostly less variable chloroplast markers were added for selected taxa in order to improve resolution of deeper nodes in the chloroplast tree. 
Where gene tree conflict was identified, the taxa with conflicting phylogenetic signals were duplicated in a combined matrix following the approach of Pirie et al. (2008, 2009) in order to infer a single multi-labelled "taxon duplication" tree. [This occurred for only one species. Thus, one leaf label for E. lusitanica has the data only for the chloroplast sequences, and the other leaf has the data only for the nuclear sequence.]


The figure shows the result of the coalescent BEAST analysis of the multi-labeled data, with E. lusitanica appearing twice in the MUL-tree. Inset is the resulting single-labeled network, with E. lusitanica appearing once, as a reticulation.

This is an interesting application of MUL-trees. However, there are two issues that I wish to highlight about the procedure.

First, the reticulation as shown in the example is not actually time-consistent, given that the horizontal axis of the MUL-tree is scaled to time. This could, for example, be resolved by having "E. lusitanica CP" attached to a ghost lineage.

Second, the data matrix from which the MUL-tree is created will have a non-random distribution of missing data, by definition. This non-randomness is known to have a bad effect on likelihood analyses (Simmons 2012). In the example, the non-randomness is exacerbated by further non-randomness in the acquisition of the plastid sequences. So, if this form of MUL-tree analysis is to be pursued then maybe this potential limitation should be investigated.

References

Mugrabi de Kuppler AL, Fagúndez J, Bellstedt DU, Oliver EGH, Léon J, Pirie MD (2015) Testing reticulate versus coalescent origins of Erica lusitanica using a species phylogeny of the northern heathers (Ericeae, Ericaceae). Molecular Phylogenetics and Evolution 88: 121-131.

Pirie MD, Humphreys AM, Galley C, Barker NP, Verboom GA, Orlovich D, Draffin SJ, Lloyd K, Baeza CM, Negritto M, Ruiz E, Cota Sanchez JH, Reimer E, Linder HP (2008) A novel supermatrix approach improves resolution of phylogenetic relationships in a comprehensive sample of danthonioid grasses. Molecular Phylogenetic and Evolution 48: 1106-1119.

Pirie MD, Humphreys AM, Barker NP, Linder HP (2009) Reticulation, data combination, and inferring evolutionary history: an example from Danthonioideae (Poaceae). Systematic Biology 58: 612-628.

Simmons MP (2012) Radical instability and spurious branch support by likelihood when applied to matrices with non-random distributions of missing data. Molecular Phylogenetics and Evolution 62: 472-484.

Wednesday, June 24, 2015

Trees, networks and dogs


One of the perennially most popular posts in this blog has been the one about the domestication of dogs: Why do we still use trees for the dog genealogy?

In that post I noted that, up to 2012, there were three distinct trends in the presentation of the genealogy of dog breeds:
  1. the study of whole-genome data, in which the results are presented solely as a neighbor-joining tree
  2. the study of mtDNA sequence data, in which the results are presented both as a tree and as a haplotype network
  3. the study of combined Y-chromosome and mtDNA sequence data, in which the results are presented solely as a haplotype network.
This pattern has continued. For example, the following diagram is taken from:
Skoglund P, Ersmark E, Palkopoulou E, Dalén L (2015) Ancient wolf genome reveals an early divergence of domestic dog ancestors and admixture into high-latitude breeds. Current Biology 25:1515-1519.

The tree is based on mitochondrial genome data for the highlighted fossil, compared to the mitochondrial sequences of modern-day dogs and wolves, as well as ancient canids. The use of a phylogenetic tree seems to be based on the idea that mitochondria consist of tightly linked genes that are uniparentally inherited. However, neither of these characteristics is universal, and so a network might be more appropriate.

The dog genealogy is recognized as being characterized by introgression with wolves, as the authors themselves note. Also, the origin of dogs is not directly from wolf ancestors, but both modern wolves and modern dogs are derived from a common ancestor. For example, this next diagram is from:
Freedman AH, et alia (2014) Genome sequencing highlights the dynamic early history of dogs. PLoS Genetics 10:e1004016.

The width of each population branch is proportional to inferred population size. Note that wolves and dogs originated at roughly the same time, as the result of bottlenecks in the ancestral population size. Wolves diversified slightly earlier than dogs. Also, Skoglund et al. dispute the dating of the splits, suggesting that the dog-wolf divergence was "at least 27,000 years ago".

As a final note, there is a tendency to credit Charles Darwin with originating just about everything in the study of genealogy, although he was a synthesizer as much as an innovator. For example, David Grimm suggests (Dawn of the dog. Science 348: 274-279):
Charles Darwin fired the first shot in the dog wars. Writing in 1868 in The Variation of Animals and Plants under Domestication, he wondered whether dogs had evolved from a single species or from an unusual mating, perhaps between a wolf and a jackal.
However, the first hypothesized genealogy was actually published more than a century earlier, by Georges-Louis Leclerc, comte de Buffon (see the blog post on The first phylogenetic network), who suggested a common origin with wolves.

Wednesday, May 20, 2015

A limitation of turning splits graphs into reticulate networks


Splits graphs are a useful way of displaying contradictory information within evolutionary datasets, either incompatible characters (ie. those that cannot fit onto a single tree) or incompatible trees. Since the graphs are unrooted, they are usually treated as a form of multivariate data display, rather than interpreted as depicting evolutionary history.

However, it is possible to turn a splits graph into a evolutionary network (sometimes called a reticulation network) once a root is specified (Huson and Klöpper 2007). This is true irrespective of whether the splits are derived from character data (Huson and Kloepper 2005), in which case it usually called a recombination network, or whether they come from a set of trees (Huson et al. 2005), in which case it is usually called a hybridization network.

The SplitsTree4 program (Huson and Bryant 2006) carries out the relevant calculations under algorithms entitled Reticulation Network, Recombination Network or Hybridization Network, although these all produce the same outcome once the set of splits has been determined. These options are no longer available from the menu system (in the current release of the program), but they can still be effected via the Configure Pipeline menu option.

The point of this post is to point out that the calculations are affected by the same limitation that has been pointed out before under other circumstances (see the post A fundamental limitation of hybridization networks?). That is, reticulation cycles with three or fewer outgoing arcs are not uniquely defined with respect to rooted splits — there are three equally optimal mathematical solutions. In practice, this means that in a situation where two taxa are involved in producing a third taxon we cannot decide from the splits alone which is the reticulate taxon and which are the two "parents" (eg. which one is the hybrid).

An example

I will illustrate this point with a simple example. The data are taken from Wendel et al. (1991). The data consist of the presence-absence of 76 nuclear allozyme loci and 13 nuclear restriction sites, for five plant taxa, one of which is the outgroup. The first graph shows the splits graph using the default options in SplitsTree4 — both the NeighborNet and the ParsimonySplits analyses produce the same graph, which identifies a single reticulation.


In SplitsTree4, the outgroup for rooting the splits graph must be the first taxon in the datafile, which in this case is Gossypium robinsonii. The following three graphs are the result of then choosing the ReticulateNetwork analysis. They differ by having, respectively, Gossypium bickii as the final taxon in the dataset, Gossypium sturtianum as the final taxon, and Gossypium australe + Gossypium nelsonii as the final two taxa. Note that the ReticulateNetwork algorithm always identifies the dataset's final taxon as the reticulate one.




So, the hybrid taxon is indeterminable from the data given, and the algorithm simply makes a (consistent) choice from among the three possibilities. [That is, the algorithm chooses as the reticulate arc whichever of the three outgoing arcs is latest in the dataset.]

The original authors suggest that the nuclear and other data "indicate a biphyletic ancestry of G. bickii. Our preferred hypothesis involves an ancient hybridization, in which G. sturtianum, or a similar species, served as the maternal parent with a paternal donor from the lineage leading to G. australe and G. nelsoni." This doesn't quite match any of the three rooted networks shown above.

References

Huson DH, Bryant D (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23: 254-267.

Huson DH, Kloepper TH (2005) Computing recombination networks from binary sequences. Bioinformatics 21: ii159-ii165.

Huson DH, Klöpper TH (2007) Beyond galled trees – decomposition and computation of galled networks. Lecture Notes in Bioinformatics 4453: 211-225.

Huson DH, Klöpper T, Lockhart PJ, Steel MA (2005) Reconstruction of reticulate networks from gene trees. Lecture Notes in Bioinformatics 3500: 233-249.

Wendel JF, Stewart JM, Rettig JH (1991) Molecular evidence for homoploid reticulate evolution among Australian species of Gossypium. Evolution 45: 694-711.

Monday, April 20, 2015

Domestication networks are complicated


Phylogenetic networks were developed as a professional tool for displaying complicated evolutionary histories. However, this does no mean that such networks cannot be used elsewhere.

As an example, Pete Buchholz produces drawings of dinosaurs as the artist Ornithischophilia at the DeviantArt web site. Among these drawings are some phylogenies, and two of them are networks.

The first one is labelled Citrus is complicated, and refers to the origin of citrus cultivars.


The phylogenetic tree at the left is sourced from the American Journal of Botany, while the network at the right is from information in Wikipedia. The combination of the two appears to be original to the artist. The network is read from left to right — for example, the Limequat is a hybrid of the Key Line and the Kumquat. Compared to the original Wikipedia text, the picture speaks a thousand words.

The second network is labelled Apples are complicated, and refers to the origin of some of the apple cultivars.


No source is given for the information, but I assume that it also comes from Wikipedia. Note that, as before, the network is read from left to right, but this time there is a time scale at the top. The artist refers to it as a "spaghetti diagram", and notes that:
Colors are based on the major parent that the "story" revolves around; purple for Honeycrisp, Yellow for Golden Delicious, Red for Jonathan, Maroon for Red Delicious, Orange for Cox's Orange Pippin, Teal for McIntosh, Green for Granny Smith, and Blue for Topaz.

Wednesday, April 8, 2015

Using networks, not trees, to display hybrids


Phylogenetic networks are intended to display reticulate evolutionary histories, rather than strictly divergent or transformational histories. This idea applies both to species and higher taxa (where the ancestors might be inferred), and to individuals and populations (where some of the ancestors might be sampled). However, the literature is still replete with studies that use one or more phylogenetic trees for displaying reticulate phylogenies.

A recent example is shown by: Umer Chaudhry, Elizabeth M. Redman, Muhammad Abbas, Raman Muthusamy, Kamran Ashraf, John S. Gilleard (2015) Genetic evidence for hybridisation between Haemonchus contortus and Haemonchus placei in natural field populations and its implications for interspecies transmission of anthelmintic resistance. International Journal for Parasitology 45: 149-159.

These authors sampled nematode parasites from sheep, goats, cattle and buffaloes at abattoirs in Pakistan and southern India. These parasites were morphologically characterized as being predominantly either Haemonchus contortus or Haemonchus placei. The worms were then genotyped in several ways, including: SNPs of rDNA ITS-2, microsatellite markers, sequences of nuclear isotype-1 of β-tubulin, and sequences of mitochondrial NADH dehydrogenase subunit 4. The genotyping revealed several individual worms that were considered to be inter-species F1 hybrids.

The phylogenetic tree from the β-tubulin sequences is shown in the first figure. There were 25 haplotypes identified among the worms. Most of the worms were homozygous, with haplotypes that were identified as either H. contortus or H. placei. However, five worms were discovered to be heterozygous, with one haplotype considered to have come from each of the species.


The hybrid status of the worms is shown in the phylogenetic tree by having the hybrids appear twice, once for each of their haplotypes, with the other worms appearing only once. Thus, the actual reticulate history is not made visually obvious.

A better approach would be to use a phylogenetic network. This is straightforward in this case. From the perspective of the worms (rather than the haplotypes), the phylogenetic tree is a so-called MUL-tree, in which some of the taxon labels appear multiple times (and some appear only once). The labels that appear once represent homozygous worms, which can be seen as being "monoploid" for this locus. The labels that appear twice represent heterozygous worms, which can be seen as being "diploid".

MUL-trees where the labels represent different ploidy levels can easily be turned into a network using the Padre program. The result is shown in the next figure, which is therefore a hybridization network.


The actual history of the worms is now clear. Interestingly, one of the hybridization events seems to be older than the other four.

As an aside, it is also worth pointing out a mis-interpretation of the phylogenetic tree produced from the mitochondrial ND4 sequences. This tree is shown in the next figure — I have added the annotations at the right.


The phylogeny shows 12 haplotypes considered to be H. contortus and 14 haplotypes considered to be H. placei. One of the hybrids clearly has a H. contortus haplotype, indicating that its maternal parent came from this species. However, the other four hybrids cannot be unequivocally identified as having H. placei mothers (as claimed by the authors), as their haplotypes are all sisters to the H. placei haplotypes — all of the H. placei haplotypes share a common ancestor that is not shared with the hybrids. Given the root of the tree, H. placei is a more likely identification than is H. contortus, but the tree does not provide unequivocal evidence.

Wednesday, March 25, 2015

Network of Australian marsupials


In the literature, phylogenetic trees often appear even when the paper is discussing non-tree evolutionary histories.

A case in point is the paper by: Susanne Gallus, Axel Janke, Vikas Kumar, Maria A. Nilsson (2015) Disentangling the relationship of the Australian marsupial orders using retrotransposon and evolutionary network analyses. Genome Biology and Evolution, in press.

The authors discuss the relationship between the four Australian marsupial orders, and use data from transposable element (retrotransposon) insertions for resolving the inter- and intra-ordinal relationships of the Australian and South American orders. They plot the retrotransposon presence/absence onto a tree derived from alignments of 28 nuclear gene fragments. This is shown in the first figure, with the retrotransposons indicated as dots on the internal branches.


For comparison, the next figure is a Median-Joining network based on the presence/absence of the retrotransposons.


With the exception of the Monito del monte, Shrew opossum and Western quoll, the network matches the basic tree structure. However, it emphasizes more strongly the fact that the retrotransposons do not resolve the relationships among the Marsupial orders. As the authors note:
The retrotransposon insertions support three conflicting topologies regarding Peramelemorphia, Dasyuromorphia and Notoryctemorphia, indicating that the split between the three orders may be best understood as a network ...The rapid divergences left conflicting phylogenetic information in the genome possibly generated by incomplete lineage sorting or introgressive hybridisation, leaving the relationship among Australian marsupial orders unresolvable as a bifurcating process million years later.

Monday, March 23, 2015

Phylogenetic network of pairwise alignment methods


Phylogenetic networks can be used to illustrate the history of any set of objects or concepts, provided that this history is a divergent one (ie. the history is not simply the transformation of objects through time).

Since I have recently been writing about sequence alignments, it is worthwhile to show an example of applying a network to sequence alignment programs. This comes from the paper by Chaisson MJ, Tesler G (2012) Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13: 238.

The authors discuss programs that map reads from a sample genome onto a reference sequence. They note: "the relationship between many existing alignment methods is qualitatively illustrated in the figure."


Their legend reads:
The applications / corresponding computational restrictions shown are: (green) short pairwise alignment / detailed edit model; (yellow) database search / divergent homology detection; (red) whole genome alignment / alignment of long sequences with structural rearrangements; and (blue) short read mapping / rapid alignment of massive numbers of short sequences. Although solely illustrative, methods with more similar data structures or algorithmic approaches are on closer branches. The BLASR method combines data structures from short read alignment with optimization methods from whole genome alignment.
The reticulation refers to their new program, which "maps reads using coarse alignment methods developed during WGA [whole genome alignment] studies, while speeding up these methods by using the advanced data structures employed in many NGS [next generation sequencing] mapping studies."