Showing posts with label ambiguous support. Show all posts
Showing posts with label ambiguous support. Show all posts

Monday, June 22, 2020

Can we dig too deep? Signal conflict in mitochondrial genes of land plants


In an earlier post, Supernetworks and gene tree incongruence, I illustrated what Supernetworks can tell us about incongruent mitochondrial gene trees, using the dataset of Sousa et al. (PeerJ 8: e8995, 2020). Here, I will take a closer look at these data, in order to illustrate another point.

Fig. 1 A Supernetwork based on 34 individual mitochondrial gene trees (atp1 and atp8 missing due an alignment glitch). Groups (clades) referring to splits not found in any of the gene trees, including the "Septaphyta", are shown in gray font, in blue, groups referring to clades seen in Sousa et al.'s preferred tree.

Sousa et al.'s set of analyses aimed to filter signal in order to get a better all-inclusive tree, and succeeded to produce support for a "Septaphyta" clade, comprising liverworts and mosses, which is a split not found in any of the inferred (Bayesian) gene trees.

Fig. 2 Comprehensive but branch-length and frequency ignorant Supernetwork of Sousa et al.'s Bayesian MRC gene trees (trees are provided as supplementary online data on zenodo), inferred from nucleotide sequences. The trees show several alternatives (colored and labeled) regarding the sister lineages of mosses and liverworts. Any split found in at least one gene tree, is represented in this Supernetwork.

This split did occur, however, when the amino acid sequences were used, instead of the nucleotide sequences.

Fig. 3 Comprehensive branch-length and frequency ignorant Supernetwork of Sousa et al.'s amino acid Bayesian MRC gene trees. The sister taxon of liverworts are either the mosses (= Septaphyta clade) or the outgroup green algae Coleochaete (further inclusive splits include subsequently more outgroups, placing the liverworts as sister to all other land plants). Realized alternatives for mosses include further being a sister of hornworts (in which case liverworts would be sister to higher land plants), or sister to all land plants (brown split: green algae + mosses | rest).

The alternative to the Septaphyta clade, which does appear in the gene trees, recognizes the liverworts as the closest relative of the vascular plants, while the mosses are resolved as the first branch. As Sousa et al. point out:
The tree inferred from the concatenated nucleotide data set of 36 mitochondrial genes shows mosses as the sister-group to the remaining land plants, as previous analyses of mitochondrial nucleotide data have shown (Liu et al., 2014). ...
The concatenated tree hence only reflects a minor aspect of the Supernetwork (Fig. 1) of the individual gene trees:
... However, the mosses are replaced by the liverworts in the same position when analysing codon-degenerate recoded data.
This seems to be the preferred placement when summarizing the gene trees using the Supernetwork.

In this post, we will take a closer look. Is there a deep, easily obscured signal for the Septaphyta clade in the mitochondria of plants? A signal that only surfaces in some amino acid gene trees (Fig. 2) and the filtered concatenated tree (Sousa et al.'s fig. 2), or is it just a branching artifact?

An example for a (low-supported) artificial clade (cf. Schliep et al., Methods Ecol. Evol. 8: 1212–1220, 2017). The trees in (a), (b) and (c) give the paternal (Y-chromosome), maternal (mt genes) and biparental (nuclear autosomal introns) genealogies. While the paternal and biparental genealogies are compatible (congruent), the maternal is in strong conflict. When combining these three data sets, this substantial conflict decreases branch support and results in the artificial red clade. The support for the artificial clade is low but worrisome: the tips differ substantially from each other and the hypothetical, alternative common ancestors and there are literally no patterns in the data supporting a sister relationship of Sloth and Sun bears. The artificial red clade is a secondary product of the inference: the sister relationship of Polar and Brown Bear is trivial (data and inference wise), the American Black Bear the more likely sister (note the length of competing branches in a/c vs. b). The signal for a clade of Asian Black and Sloth bears is less pronounced, here the mt genealogy clade is strongly incongruent and forces the combined tree to resolve the conflict by introducing the artificial red clade.

Starting simply

The Supernetwork in Fig. 1 shows that, no matter which gene we look at, liverworts and mosses were originally most similar to each other, and, absolutely speaking, still close to the (hypothetical) mitochondrion of the ancestor of all land plants. We can illustrate the general situation about the signals using a Neighbor-net inferred from the concatenated data of all 36 genes.

Fig. 4 Neighbor-net based on uncorrected p-distances inferred from the concatenated gene data.

Note that we used a substitution model via a naive distance matrix for a set of coding genes that include saturated third codon positions. Some phylogenetic relationships are obviously based on trivial signals: the Neighbor-net in Fig. 4 includes ± prominent edge bundles defining neighborhoods in line with generally accepted clades (in bold). To capture these evolutionary lineages (some going back nearly half a billion of years), we just need the raw data but no sophisticated phylogenetic analysis.

In the case of the probably monophyletic gymnosperms, the gymnosperm neighborhood competes with a neighborhood excluding the Gnetidae Welwitschia, which is the most distinct of the seed plants in this taxon set (this applies to Gnetidae in general, no matter which data are used). In addition, we see a neighborhood defined by the pink edge bundle: a split of green algae + Welwitschia versus all other land plants. This is a case of obvious long-edge attraction, enforced (here) by missing data (Welwitschia lacks data for 12 out of the 34 genes).

The center of the graph with respect to all tips would be a candidate for the ancestral mitochondrion of the common ancestor of all land plants. Closest to this point are the Septaphyta (mosses + liverworts) and the lycophyte Huperzia (the better represented taxon only missing out on five genes, while Isoetum miss 15).

One can depict a phylogenetic hypothesis by just dropping the less pronounced neighborhoods in Fig. 4:
  • Most prominent edge bundles define three main cluster (= lineages): green algae, seed plants, and other land plants.
  • Within green algae:
    • Closterium is sister to Gonatozygon, next is Roya → Zygnematales
    • there is no prominent edge bundle connecting Chaetospharidium with the Zygnematales; the closest relative is however the last green algae (→ Coleochatales; only group without a neighborhood).
  • Within seed plants:
    • Brassica may be the sister of Liriodendron (more prominent edge bundle), Oryza complements the clade as first diverged member → angiosperms
    • Cycas is the sister of Gingko, the two are sister to the angiosperms
    • this leaves Welwitschia as the first diverged branch.
  • Taking the green algae as outgroups:
    • the ferns are the sister group of the seed plants (edges longer than the alternative of a primitive land plant clade)
    • mosses are sister to liverworts (→ Septaphyta); Huperzia shares the same edge bundle but is apparently sister of Isoetes (→ lycophytes), and the lycophytes appear to be ± primitive sisters of ferns and seed plants
    • this leaves the hornworts, a highly coherent group sharing no prominent edge bundles with any other member of the land plant cluster, and hence are a candidate for the first diverging land plant lineage.
This is a tree hypothesis that is strikingly similar with Sousa et al.'s preferred tree.

Sousa et al.'s fig. 2

The only differences lie in terminal subtrees (Oryza as sister to Liriodendron; Marchantia-Treubia grade, the position of the latter two within liverworts being unclear based on the Neighbor-net).

Something that is easily overlooked in Sousa et al.'s rooted tree, but that is apparent from the Neighbor-net, is that we should be aware of ingroup-outgroup long-branch attraction (LBA). The green algae are not only highly divergent but also very distant from all ingroup taxa, the land plants.; and the first ingroup branch in Sousa et al.'s tree has the longest root.

Additive and subtractive support

In principal, when comparing single gene tree samples to combined trees, we face four sorts of signals in our data:
  • Very strong signals imprinted in one or a few genes; they will outcompete, and possibly even be re-enforced by any conflicting signal. Walker et al. (PeerJ 7: e7747, 2019), studied this phenomenon for the case of angiosperm plastomes (see also our miniseries The emperor has no clothes on).
  • Phylogenetically sorted, weak but consistent signals; they will add up, as branch support will increase with each gene added. In this category fall signals reflecting deep splits obscured by terminal noise, when analyzing a single gene or few genes – like the one found by Sousa et al. supporting a Septaphyta clade.
  • Disparate gene histories; eg. because of intergenomic recombination. The support will be diminished with every added gene not sharing the same history.
  • General conflict; eg. when combining data from different genomes reflecting different genealogies, such as combining chloroplast (product of biogeographic history) and nuclear data (product of speciation processes) of tree genera. This will be expressed by split bootstrap (BS) support, and may result in artificial clades in the combined/concatenated tree (eg. bear example shown above).
Adding to this is the absence of signal: short-branch culling, a special case of long-branch attraction, which could also explain the inference of a (paraphyletic) Septaphyta clade. If there are few tips in the data that are close (absolute, not only regarding their phylogenetic distance) to the all-ancestor without clear affinities, they may be collected in a subtree, being leftovers from optimizing all other tips with certain affinities and higher distance to the all-ancestor.


Fig. 5 Short-branch culling. Let's assume liverworts are the sister clade of higher land plants (an alternative with near-unambiguous support from cox1). The signal for this in mitochondrial data is weak (short root). On the other hand, there is a high risk for ingroup-outgroup long-branch attraction (LBA) leading eventually to an artificial Septaphyta clade. Because of (inevitable) LBA, even though the false branch is very short, its support can be high (unambiguous when using Bayesian inference).

By compiling the support for all alternatives, we can assess where the support is additive or subtractive. We do this using my re-analysis not Sousa et al.'s Bayesian analysis because:
  1. BS support is more sensitive to internal signal conflict than Bayesian PP,
  2. to extract this information, we need the tree samples used to establish the branch support.
When doing this, we find that the split defining the Septaphyta clade is not only missing from the nucleotide genes trees but also rarely found in the BS pseudoreplicate samples. Only for seven gene regions (atp4, atp8, nad2, rpl16, rps2, rps3, rps13) do we find BS ≥ 25; the highest support comes from rps3 (BS = 65; however, the split is not found in the corresponding Bayesian MRC of Sousa et al.).

On the other hand, the main alternatives find much higher and more consistent support, as shown here.

Fig. 6 Competing support for (purple) and against a Septaphyta clade (greens and yellows). Placing the hornworts as sister to all other land plants (pink) is compatible with the hypothesis of a Septaphyta clade as well as the competing alternative of placing the liverworts as sister to higher land plants; note the high support from cox1 gene for an according tree. *, for these genes no hornwort data were included/have been available.

Short-branch culling, a special form of ingroup-outgroup LBA

Now, my BS analyses were deliberately naive, because they did not apply any data partitioning. However, both liverworts and mosses have short-branches while the outgroup, the green algae, are extremely long-branched. If substitution saturation is an issue for misplacing either liverworts or mosses as sisters to all other land plants, then there should also be ingroup-outgroup LBA. A false split of liverwort + outgroup versus the rest, or moss + outgroup versus the rest, has a lower chance to be supported than would a false hornwort + outgroup versus the rest split. The latter directly opens the door for a Septaphyta clade (see Fig. 5).

Let's have a look at the trees of the four genes supporting the Septaphyta split, as the best alternative. ("AA tree/PP" is the amino acid tree provided by Sousa et al.; BS support refers to my unpartitioned ML analyses)
  • atp8 — The AA tree is a star tree (comb), strongly distorted by LBA: a Coleochaetales + seed plants | Zygnematales + all other land plants splits has a PP = 1; the short- and long-branched lycophytes are not resolved as sisters.
  • rpl16 — Also here, the AA tree is star-like regarding deep relationships: (i) green algae (unresolved), PP = 1; (ii) liverworts, very long root, little internal resolution PP = 1; (iii) mosses (unresolved), root half as long as for liverworts, PP = 1; (iv) higher land plants, short root, PP = 0.88.
  • rps3 — No ingroup-outgroup LBA, shortest-branched ingroup, liverworts, resolved as sister to mosses + rest (PP = 0.77); thus, AA tree, not affected by saturation issues, rejects the Septaphyta (PP < 0.23).
  • rps13 — Again, the AA tree is star-like, with five tips: (i) green algae (PP = 1); (ii) long-rooted hornworts (PP =1); (iii) liverworts, relatively short root (PP =1); mosses, longer root (PP = 1); (v) higher land plants, shortest root (PP = 0.89).
The Septaphyta root is either extremely short or non-existent, as we would expect for a false clade, because there are no character splits in the matrix that support the taxon split.

Fig. 7 Sousa et al.'s amino acid Bayesian MRC trees (top row) compared to codon-naive nucleotide ML trees (bottom row) producing highest BS support for a Septaphyta clade. Note that in two cases, rpl16 and rps13, the 'best-known' ML tree shows a competing split with much lower support.

Typically, since we are looking at a deep split, we would expect that support increases when shifting from (codon-naive) nucleotide to amino acid analysis, because we eliminate terminal noise. However, we observe the opposite (Bayesian PP more easily converges to unambiguous support than BS values). The difference between our codon-naive nucleotide ML and Sousa et al.'s amino acid MRC trees tells us that it is mostly information from the 3rd codon position that triggers a Septaphyta versus the rest split for these four genes — ie. potentially synonymous substitutions that Sousa et al. filtered against.

Where does the high support comes from for the Septaphyta clade in their combined tree? That tree is based on a matrix, that should have a signal in-between our codon-naive nucleotide and their amino acid analysis.

A five-taxon problem with a glitch

Sousa et al.'s study is exemplary, in that it provides a careful, and well documented, analysis of the combined data. If you want to infer a potentially good tree, this is one way to do it.

However, their Septaphyta clade is most likely a branching artifact. It still combines data that, genuinely, provides not only diffuse but conflicting information about how the main lineages of land plants diverged from each other (Fig. 6). No analysis, no matter how sophisticated and well-crafted, can compensate for the deficits of the underlying data. By filtering out "noise", one also filters out actual conflicting signal. In this case, this is about how liverworts, mosses, and hornworts stand in relation to the extremely long-branched and divergent outgroup, the green algae, and their increasingly evolved siblings, the higher land plants (lycophytes, ferns, and seed plants). It is another example of what I pointed out in last week's post: Big Data invites big (ie. well supported) errors.

It is important to realize that, although we use many more OTUs, we are still looking at a five-taxon problem. When our data supports one split (or prefers it, being biased or not), there are only three more alternatives to select from.

Ingroup-outgroup LBA draws the hornworts, as the genetically most distinct (longest-rooted) lineage of the "bryophytes", away from liverworts and the lycophyte Huperzia, which connects the much more diverged higher land plants to the bryophytes. This leaves three alternatives:
  1. Liverworts are the sister of higher land plants. Their mitochondria show some affinity, but only to the lycophytes, mostly the low-divergent and better sampled Huperzia; and often together with the hornworts, ie. a split incompatible with the hornwort-green algae versus the rest split.
  2. Mosses are the sister of higher land plants, but their mitochondria show very little affinity to any of them (including Huperzia). In fact, they seem to have the most primitive of all land plant mitochondria.
  3. Septaphyta are monophyletic, as the trade-off with the least conflict. Being (much) less diverged than the higher seed plants, they are genetically closer, and ± equally close, to the hornworts and the least-evolved higher land plant, the lycophyte Huperzia.
Sousa et al.'s codon-degenerate approach enforced ingroup-outgroup LBA between the hornworts (the worst-sampled ingroup) and the green algae, while decreasing the absolute distance between liverworts and mosses, and increasing their distance to the higher land plants. That is, Alternative 3 outcompetes Alternative 1. Alternative 2 has no support in the data.

Are the mosses sister to all land plants?

Probably not. Just because the Septaphyta clade is an artifact, it doesn't mean the Septaphyta cannot be monophyletic — it just means the mitochondrial genes don't provide any clear signal to support or reject such a hypothesis, or any other alternative. The same applies to the mosses as the first diverging lineage; their position in earlier trees is likely also to be an artifact — not a branching, but a data artifact. If their mitochondrial genomes are still very similar to that of the common ancestor of all land plants, then they should be placed like an ancestor in the tree — as a short-branched sister to all of their "offspring", the remaining land plant mitochondria.

Eight of the nine genes that support a moss + outgroup versus the rest split, fail to resolve a moss clade. This is a clear indication that the moss mitochondria are simply primitive (at all gene positions that matter). What divides them from most (or all) other land plants are symplesiomorphies — shared but ancestral sequence patterns. The only gene that prefers both splits at once, mosses as sister to all other lands plants as well as a moss clade, is nad4 (BS = 67 and 62, respectively); but only when using nucleotides.

Fig. 8 A small but important difference: the codon-naive ML nucleotide (nt) tree (left) shows a moss clade as sister to all other land plants. The Bayesian amino acid (aa) MRC tree for the same gene shows a wrong split (purple internode) between green algae + ferns + angiosperms (long-branch, prominent roots) and bryophytes + lycophytes (mostly short-branched, short roots). By translating nucleotides into amino acids one may eliminate genuine discriminative signal encoded in synonymous substitutions, while in other, faster evolving parts of the tree, the same site/gene is oversaturated/biased. The poorly supported sister relationship of Roya and Gonatozygon within the green algae in the nt ML tree is an artifact, correctly resolved by the aa tree based on the same gene.

The shift from nucleotide data (ML / BS) to amino acid data (Bayesian MRC) triggers ingroup-outgroup LBA between green algae and ferns + seed plants (PP = 0.53; 'short-branch culling' of bryophytes and lycophyte Huperzia), and results in a branching artifact — the monophyly of higher land plants is well established, and hence they should form a clade.

By contrast, the genes providing strong support for a moss clade (such as atp1, atp8, ccmB, cob, cox1, cox3, nad2, nad5, rpl6, rpl16, rps3, rps13, and rps14) fail to resolve any deep relationships at all, or prefer different alternatives (including the Septaphyta hypothesis: atp8, rps3, rps13). The combined tree's solution is therefore a least-conflicting one, again — a moss clade (based on a consistent signal in the majority of genes: 13 with BS ≥ 90; in total 24 with BS ≥ 58) as sister to the rest of the land plants (based on a signal found in other genes not reflecting the monophyly of mosses). This solution adds to the phenomenon that moss mitochondria are generally primitive (ie. show a variant basically ancestral to all other land plants), and doesn't conflict with a wide range of otherwise conflicting splits strongly supported by individual genes (in contrast to the Septaphyta clade, see Fig. 6).

Conclusion

Having spent some time with the data and gene trees, I have little hope that mitochondrial data can be used to resolve the deep relationships between land plants. Each tweaking may result in something different, and the support-after-tweaking will be inflated.

Nevertheless, it will be worthwhile to close the data gaps, especially for the hornworts. This may not solve the 5-taxon problem,* but may give unique insights in how the mitochondrial genome evolved and sorted during the initial radiation of land plants.

Notably, the mitochondriomes of land plants can differ in the arrangement of their genes; which means that they recombined with or within the nucleome (or even plastome). While in some plants the mitochondriome is passed on via both parents (like in Ginkgo or Cycas), in others it is only the mother (most, maybe all, angiosperms). Plants may have colonized land more than once, and expanded quickly, so that lineage crossing and also lineage sorting may be an issue — marine species can be cosmopolitan and genetically heterogeneous (cryptic speciation). Thus, some mitochondrial genes may tell different stories from others. Instead of trying to solve which of the alternatives is correct (which is what most phylogenetic literature revolves around), we should find out which gene or part of the genome agrees with which alternative, as they may be all true.

The question to address with mitochondrial data cannot be whether mosses, liverworts or hornworts are the first diverging branch of extant land plants, but should be why moss, liverwort and hornwort mitochondriomes show different stages of evolution, as exemplified by the nad4 trees in Fig. 8.

Data availability

An archive including the support consensus networks (in Splits-NEXUS format) and inferred gene ML trees (plain NEWICK), as well as the comprehensive split support table (XLSX format), has been uploaded to figshare.



* It may help to have an in-depth analysis of a more focused taxon set with no data gaps that minimizes the risk of LBA. This starts with a better selection of taxa representing the higher land plants:
  • Oryza (the rice) is a domesticated, much cultivated, and thus extremely evolved and polyploid monocot. If there is any deep signal embedded in the mitochondria of seed plants, the mitochondrion of rice is probably the last place to look for it.
  • When trying to resolve the deepest land relationships, including a Gnetidae like Welwitschia (a genus that is an evolutionary oddball to start with), makes equally little sense — like any of the three surviving genera of this unique gymnosperm lineage, it is genetically the outer-most tip of an iceberg. Each mutation in its genome is the product of an unknown number of divergences in the past.
  • If any seed plant should be included at all, would be more than sufficient to have: Liriodendron, a magnoliid, and thus a member of the least-diverged angiosperm lineage, plus Cycas, as a representative of an ancient, slow-evolving gymnosperm lineage. These are much more recent additions to the plant Tree of Life.
  • Being a tip of an iceberg applies even more to Isoetum. It is strikingly similar only to the other lycophyte, but it has more data gaps and is much more diverged, and thus can invite branching artifacts. When one wants to dig deep, the much more primitive Huperzia is obviously the better representative.
  • Last, the green algae are the only possible outgroups for inference, but they are poor for this – apparently, their mitochondria have evolved much farther from the common ancestor than those of the land plants. Rather than inferring trees including them, one should infer trees without them, and then optimize their position within trees that will then potentially be unbiased by outgroup-LBA — eg. using the evolutionary placement algorithm, to test the land plant root. An interesting experiment could also be to infer the sequence of the common ancestor(s) of modern-day green algae (lacking a time machine to sample it), and use them instead. The new RAxML-NG, for example, allows for ancestral state reconstruction of nucleotides.
In addition, standard 4-base substitution models are not the best choice when analyzing matrices with a high proportion of ambiguous base calls, like Sousa et al.'s codon-degenerate matrix (note that Sousa et al. already applied models that compensate for substitutional bias). This is especially so, given the importance of synonymous mutations to resolve relationships in the slow-evolving lineages, and slow evolving genes. One could try to use ambiguity-aware substitution models instead. The newest releases of RAxML-NG (Kozlov et al. 2019, Bioinformatics 35: 4453–4455) include models for "phased" and heterozygous data — ie. models that can make use of ambiguity codes as additional information during tree inference (see also Potts et al. 2014, Syst. Biol. 63:1–16).

Monday, November 18, 2019

Why the emporer has no clothes on – conflict or not?


In the final part of this series dissecting angiosperm gene trees (see: Why the emporer has no clothes on — part 1 and part 2), we will enter muddy ground. Using our example data set, we will try to make a call on whether or not there has been any (detectable) major reticulation in the deep branches of the angiosperm tree.

What triggers conflicting gene histories

Before we look at the data, it may be a good idea to set the scene using simple theoretical examples of what we may look at.


Our two genes, represented by circle and pentagon (could be multigene regions or entire genomes), both follow the same evolutionary history (the gray background tree). In the left lineage, we have a bit of incomplete lineage sorting, because the ancestor was polymorphic for the circles. In the right lineage, we have different fixation rates: the circles evolve faster than the pentagons. With molecular data we usually don't have the ancestors, making any inference straightforward; we only have the tips.


Because of incomplete lineage sorting and different fixation rates in the left and right lineages, the circle gene tree gets the phylogeny pretty wrong. The pentagon gene tree comes closer to the reality – we only infer two sister clades where there is a grade. (With real-world data, the branch support values could give one a clue that three of the inferred blue clades have a higher quality than the fourth supporting a pseudo-monophylum.) The circle and pentagon trees are largely incongruent despite sharing the same history; and we may infer a pseudo-hybrid (the first diverging lineage within the right clade).

Combining these data may allow us to infer a tree that fits the real tree much better. In the left clade the trivial pentagon signal can out-compete the misleading circle signal, and avoid the misplacement of the first diverging lineage of the right clade. In the right clade, the circle signal can help to correct for the pseudo-clade.

Now we can add a late reticulation, and re-infer the gene trees.


Because of the reticulation (the circles are biparentally inherited, the pentagons maternally), the gene trees are more congruent then in the example above (circle and pentagon get it a bit wrong in the left clade), except for the hybrid and its pseudo-hybrid parent. The gene conflict in placing the lineage cross (part of the left clade in the circle-based tree, part of the right clade in the pentagon tree) well reflects its hybrid origin.

Different histories of nuclear genes vs. plastid / mitochondrial genes?

The easiest way to catch reticulation is to compare trees based on plastid / mitochondrial data (maternally inherited) vs. nuclear data (biparentally inherited). If reticulation happened in the past, we can expect that the maternal and biparental genealogy diverge from each other (see part 2).

Strict Consensus network of the plastid (data from 3 protein-coding genes +1 partly coding gene region), mitochondrial (3 protein-coding genes) and nuclear trees (2 nrDNAs). The bold lines represent generally accepted phylogenetic splits (APG IV tree, see also Steven's comprehensive Angiosperm Phylogeny Website).

This network is much more box-like compared to what one would have expected based on the combined tree that can be inferred from the data (Part 1). But are we looking on largely decoupled histories?

This mess is hardly surprising. The combined tree is constrained by the plastid tree, specifically by the signal from the matK gene (Part 1), while the remaining plastid genes (from a different part of the plastome) fall into line. The mitochondrial tree combines genes that on their own inform poorly resolved trees riddled with branching artifacts (Part 2). The nuclear tree, on the other hand, combines the most and least divergent nuclear genes widely known. Because of this, they show topological conflict between each other.

18S-25S rDNA tanglegram. The branch numbers show each gene's bootstrap support (BS) deviating from the combined BS support for the respective branch (indicated by line thickness): green, increased BS support when combining both genes, red, decreased BS support.

However, they are part of the same multi-copy coding unit (the 35S nuclear rDNA) that has very particular evolutionary constraints, such as structural constraints, affected by completeness of concerted evolution and intra-genomic recombination. Polyploid grasses, for example, can have up to three different collections of 35S rDNA, reflecting four different evolutionary origins, being part of the A, B, C or D genomes. You end up with what is called a multi-labelled tree: the A, B, C and D-genome variants of the same taxon pop up (consistently) in different parts of the tree, and you can have recombinants. If we look into the 18S vs. 25S data, however, we find no consistent sequence patterns supporting the topological conflicts between the two trees, or examples for recombination.

As in our theoretical example, each of the trees has certain strengths, and its own set of weaknesses, some of which can be overcome when combining the data (eg. branches with increased combined support in the 18S-25S tanglegram)

Bootstrap (BS) Consensus networks for the combined cp (upper left), mt (upper right), nc (lower left) and full data (lower right). Branches without numbers: BS = 100. Splits conflicting with those present based on the full data highlighted by red font (all with BS < 100).

In contrast to the boxy network appearance and the substantial conflict between the single gene trees (Part 2), most of the relationships (eg. the major clade roots but also many intra-clade relationships) receive high or unambiguous support in all three trees*. Aside from the disparate signals, the data seem to converge on a coalescent. If the genomes had different histories, they wouldn't converge so easily. Also, we would expect to see more consistent conflict between the "genome" trees than between the single-gene trees of the same genome, since the nuclear rDNA is biparentally inherited while the plastid and mitochondrial DNAs are passed on via the mothers only. Many of the angiosperms in our data reproduce sexually.

So far, no conclusive evidence for reticulation

Mere gene-tree incongruence is a poor basis to conclude about decoupled gene histories. We need to dig for sequence-based evidence for reticulation and recombination. For instance, we might find a clearly derived sequence pattern exclusive to the right clade in a member of the left clade.

The importance of rare genomic changes when interpreting conflicting gene trees. The left and right clades obtained a unique and conserved gene or sequence feature before they diversified. The hybrid is the only taxon showing both.

This is where the Walker et al. (2019) and Sullivan et al. (2017) studies seem to fall short — they don't give any example, gene, gene region, or recognizable lineage-diagnostic sequence pattern that could be used as direct evidence for decoupled gene histories and/or reticulation.

For my data set, I cannot pinpoint such evidence either. All high(er)-supported conflict seems to be related to lineage sorting and data/signal issues, the inability of certain gene regions to resolve relationships in parts of the angiosperm tree, or falling prey to (more local than global) long-branch attraction. When looking at the sequences, there's no reason to question, for example, the assumed monophyly of the main lineages and orders, in spite of the topological conflict we face when analyzing these data. If there was reticulation between the ancestors of angiosperm lineages, or later on between the already formed lineages, it left no obvious imprint in the data.

Thus, after having investigated aspects of the seeming conflict by going back to the data (checking highly divergent and conserved sequence patterns, tabulating the partly competing BS support of the single genes, and minus-one gene analyses), I did not hesitate to combine these data and use a Bayesian total-evidence dating procedure. (We never published the results because mid-Cretaceaous angiosperm fossils have much too derived morphologies for total evidence dating; when left unconstrained, MrBayes optimized towards an angiosperm root age of 4.5 Ba, which was the in-built maximum).

A total-evidence Bayes tree based on the full data set. Stars indicate the position of fossil taxa (mid-Cretaceaous). Note their relative long terminal branches, a situation total-evidence dating cannot handle. The matrix can be found at figshare: A basic total evidence matrix for basal angiosperms — combining Soltis et al (2011) with Doyle & Endress (2010).

An example for actual reticulation resulting in gene tree conflict

Working at the coal-face of evolution, I have encountered examples of apparently real reticulation (when analysing biparentally inherited nuclear data). The most compelling was probably the ancient relictual genotypes and pseudogenes that point towards ancient reticulation in the widely known plane trees, Platanus. Platanus subgenus Platanus (which includes all but one species, P. kerrii, a relict of a distant lineage growing in tropical-hot subtropical lowland forests of North Vietnam) falls into two main lineages characterized by unique sets of genotypes, the ANA clade (Atlantic-facing North and Mesoamerica) and the PNA-E clade (NW. Mexico, California and Mediterranean).

Haplo/-genotypic composition of Platanus (Grimm & Denk, Taxon, 2010, ES2 [PDF]). Platanus kerrii represent the sole surviving relative within the Platanaceae (genetically very distinct), an old lineage of angiosperm trees (going back deep into the Cretaceous). Their next kin today are, according to angiosperm molecular trees, the enigmatic Proteaceae, a Gondwanan relict (represented in our angiosperm data by Petrophile). For an even more comprehensive genotypic study that also covers plastid markers check out De Castro et al., Ann. Bot., 2013 [open access])

Individuals in the contact zone between species of the two main lineages (including hybrids) can be heterozygotic / polymorphic for at least one of the sequenced nuclear regions, so that identification of recent hybrids is straightforward. Beyond this, genetically inconspicious members of the ANA clade may show ITS pseudogenes from the PNA-E clade (stippled line in the figures above and below). Furthermore, two of the ANA clade species show (predominately), a PNA-E LEAFY genotype — P. palmeri (pa) and P. rzedowskii (rz), which grow closest to the populations of the PNA-E clade. However, this is not the genotype found in the close-by American PNA-E species (ra, ge), which is one that's sequence is phylogenetically closer to the Mediterranean species, P. orientalis (or), on the other side of the globe.

Overlay of the LEAFY, 5S-IGS and ITS histories in Platanus. This doodle is based on tree- and network-inferences coupled with PCR-RFLP-based genotyping and in-depth analysis of mutation patterns in length-polymorphic sequence regions (Grimm & Denk 2010, ES1). P. x hispanica is the well-known ornamental alley/park tree, the 'London plane'. A cultivated historical hybrid (mid 18th century) of the most hardy North American plane, P. occidentalis, and the frost-vulnerable Mediterranean plane, P. orientalis. In the Mediterranean, due to frequent backcrossing, one can find morphologically mixed individuals showing only the P. orientalis genotypes or homogenous (American or European) type individuals showing occidenatlis and orientalis genotypes (see eg. Pilotti et al., Euphytica, 2009

Further reading

An animal example, of seemingly incongruent single-gene trees that may well be the product of a largely shared evolutionary history, is the autosomal intron data compiled for bears by Kutschera et al. (2014. Bears in a forest of gene trees: Phylogenetic inference is complicated by incomplete lineage sorting and gene flow. Mol. Biol. Evol. 31:2004–2017). Rather than a "forest of trees", each gene tree is poorly resolved but, when combined, allows inferring a phylogeny that matches quite well the parental genealogy based on Y-chromosome data, both in strong conflict with the maternal genealogy inferred from mitochondriomes (see Part 2).

In Supplement File S6 [PDF] of Grímsson et al. (2018, Grana 57:16–116), I outline how ambiguous signal from combined gene regions relate to the poor support of critical branches in the Loranthaceae tree; see also the related posts: Using consensus networks to understand poor roots and Trivial but illogical – reconstructing the biogeographic history of the Loranthaceae (again). Some gene-tree conflicts are possibly linked to different histories (nuclear vs. chloroplast data), while others are a mix of insufficient signal and missing data (between chloroplast genes).

In a previous post (All solved a decade ago: the asterisk branch in the Fagales phylogeny), I give another example using an old Fagales matrix, which resulted in a tree that, even today, is the gold standard of Fagales phylogeny. The matrix combines a highly conserved nuclear gene (18S) conflicting with the plastid genes and complemented by an entirely uninformative mitochondrial gene (matR) to provide a "tree based on all three genomes". Also in this case the three-genome tree is essentially the matK tree.



* That doesn't mean that all highly supported, unconflicted relationships must be true. Note that just by combining a few genes, we obtain a near-unambiguous support for the split between Mesangiosperms and the ANA-grade + gymnosperms, one of the splits defining the root and "basal" part of the angiosperm tree. The outgroup-inferred root is well fixed. Even when using nuclear data, despite the fact that the 18S signal (the one showing the least ingroup-outgroup genetic distance) doesn't support such a root but the 25S does (see part 2), being more divergent and prone to ingroup-outgroup long branch attraction (LBA). That we have LBA issues with the data is obvious from a tiny detail: Ginkgo is supported with BS > 70 as sister of Podocarpus, which is wrong, based on all we know about gymnosperms,(see also Earle's gymnosperm database and literature cited therein). The likely correct split, Ginkgo as sister to Cycas, is present in the nc tree, but represents a much less supported alternative (BS <= 25). It is also obvious when one looks at the alignment(s): Cycas and Ginkgo share some potential genetic 'synapomorphies' in the low-divergent, generally conserved regions (eg. 18S, stem-regions of 25S), but there are essentially none for Gingko + Podocarpus.

Monday, November 4, 2019

Why the emperor has no clothes on – a thicket of trees


A critical question in phylogenetics, and this applies to both the detection and inference of reticulation, is: How much trust do we put in the inferred tree? A phylogenetic tree is just the simplest of all possible phylogenetic networks. Let's assume that there was some phylogenetic reticulation in the past (lineage mixing and crossing), then, in the best-case scenario, our inferred tree shows one of the intertwining pathways but misses the tangles, the crossroads.

An example of simple reticulate evolution: pink is the product of very recent lineage crossing between an early diverged (and otherwise lost) member of the blue lineage and the more recently, hence genetically more coherent, red lineage. Bold lines show the tree we would likely infer in such a situation.

In the worst case, summarizing data with substantially different signals will give us branching artifacts such as:
  • terminal branches that are too long,
  • too long internal branches with conspicuously low support (ie. BS << 100, PP < 1.0),
  • artificial branches representing the least-conflicting solution for the conflicting data,
  • low branch support in general.
See eg. the bear data we used as a real-world example for our Intertwining trees and networks paper (Schliep et al. 2017, open access).

Three possible trees for bears, (a), Y-chromosome, paternal, and (c), nuclear-encoded autosomal introns, biparentally inherited, are congruent but disagree with the maternal genealogy (b), based on the mitochondrial genes. When fusing all three data sets, we get a (low) supported sister relationship for Sloth and Sun bears (red clade), not supported by any of the three fused data set – a branching artifact.

Topological incongruence between gene trees and parental genealogies (as above) is commonly taken as evidence for reticulation. If one gene provides high support for taxon A as sister to B, and another gene has high support for B as sister to C, then B is likely the product of reticulation (eg. hybridization)

One simple possibility to put together a phylogenetic network is to summarize all of the trees in the form of a Consensus network, as shown next. (Technically this is a splits graph, it becomes a phylogenetic network as soon as we determine a root, which, here, would be at the edge leading to the Giant Panda.)

A strict Consensus network of the paternal, biparental, and maternal bear genealogies.
The numbers show the non-parametric bootstrap support for each (competing) split.

In this case, low support for a branch in a combined tree (the values on top) can result from strong conflict. For instance, the brownish splits, which are poorly supported using the combined data (BS = 21, 29), receive near unambiguous support from the mitochondrial genes, but are largely or entirely rejected by the Y-chromosome and nc-intron data. In the combined tree, this deep conflict is resolved by introducing the artificial red clade, with similarly low support: the signal in the data is ambiguous and they support splits between equally possible alternatives.

We know lineage crossing took place in bears (the mitochondrial and Y-chromosome tree are very much in conflict). However, does the above mean that earliest bear-ish creatures hybridized, too? Note that the conflict is associated with a short-branched part of the graph, where apparently little evolution happened. Fast ancient radiations usually come with incomplete lineage sorting and diffuse signals. The only data set producing longer roots, but with notably lower support, are the biparentally inherited introns.

We are closing in our own tail and have to ask again: Is this low support in the autosomal intron tree due to internal conflict, (sets of) introns preferring different topologies, supporting an ancient mixing hypothesis, or just reflecting lack of resolution? Check out the original paper by Kutschera et al. (2014, Mol. Biol. Evol. 31: 2004–2017), and make up your own mind.

On to the angiosperms

In my last post, I exemplified what Walker et al. (PeerJ, 2019) found in their angiosperm study: when we look at a plastome tree we are not looking at a summary of all gene trees but instead at a topology forced by very few of the genes in the chloroplast genome, such as the matK. We also have seen that one misplaced sequence (outgroup Podocarpus-matK) doesn't affect at all the combined analysis — it didn't even reduce the ingroup - outgroup split support. Also, I noted that the low-supported part of the combined tree goes hand in hand with lack of decisive signal from the matK.

It's time to take a look at what the other genes in this example data set come up with.

The eight gene trees. Terminal subtrees collapsed. Scales fit to size, scale bar = 0.1 expected substitutions per site. Upper left, matK tree which is very similar to the combined tree using all gene regions (cp = chloroplast, mt = mitochondrial, nc = nuclear genes). Note the low performance of the mt genes.

One thing is obvious: for most genes (except the nuclear-encoded rRNA genes) including the outgroup taxa adds little ingroup information of use — they are just too distant to any of the ingroup taxa. Outgroup rooting is tricky for angiosperms. Outgroup taxa will always be attracted to the ingroup taxon that is the least similar to any other part of the ingroup: Amborella in this case.

Generally, all of the ANA-grade water plants are genetically distinct and topologically isolated; any outgroup-inferred root must be placed in this part of the tree (all other living seed plants are very distant relatives of angiosperms looking back at, at least, ~250 million years of independent evolution, see eg. Age of Angiosperms... and What is an angiosperm pt. 2). The relatively conserved plastid rbcL and mitochondrial matR prefer an Amborella-Nympheales clade as sister to all other angiosperms, while the more divergent atpB, plastid, nad5, mitochondrial, and 25S, nuclear, prefer the Amborella-root — this is a direct indication for ingroup-outgroup long-branch attraction. Any other placement of the outgroup subtree within the ingroup would necessarily decrease the likelihood of the tree (but note the position of the root in the 18S tree, lower-left the tree based on the most-conserved, evolution-constrained gene in our sample; see also All solved a decade ago..., fig. 4A).

We can look at these trees with the strict consensus network, using uninformed edge lengths— that is, the network counterpart to the strict consensus cladograms still common in plant phylogenetic literature.



This is a nice piece of computer-art, but is scientifically quite useless (the boxiness and general graph structure is, however, reminiscent of strict consensus networks of most-parsimonious tree samples inferred for extinct animals, one example, and plants).

We can add some discriminatory information by counting how often each split occurs in the set of gene trees.

Same set of tree, different way of summarizing it. Note how the main clades emerge: one or two genes may have misplaced the one or other OTU but the others get it right.

Alternatively, we can average the actual tree branch lengths to inform the edge length of the consensus network.

The light green, sand-colored, light brown and dark olive (clockwise) splits are likely branching artifacts. The light blue split is the one that supports the ANA-grade when the (combined) tree is rooted with the very distant outgroups.

A pretty little thicket of trees. Some agreement is found towards the leaves, but even here we have conflict among the gene trees. In some trees, there are long branches grouping non-related OTUs, obvious tree inference artifacts. The general rule is that the deeper we go (ie. the farther back in time), the messier it gets. Adding to this is that, irrespective of which gene is used, some OTUs are much closer to the hypothetical common ancestor (of Mesangiosperms, ie. all but ANA grade) than others – in the eudicots, the least-evolved taxa are Platanus (very old tree genus) and Euptelea (the basalmost Ranunculales); in the Magnoliales, the only angiosperm clade that lacks synapomorphies, it's Magnolia and Liriodendron (again, very old and primitive tree genera). Darwin's Abominable Mystery, the sudden appearance and quick dominance of angiosperms, resulted in an abominable chaos of gene trees and signals. How can they possibly converge to a single tree with amply high support along most branches?

The combined tree from the first post.
When compared to the bears, the answer may well be: because there has been very little to no reticulation between these lineages. Our thicket may be not a forest of trees but just a poorly trimmed, wildly overgrown bush. They genes share the same history, but when being analyzed one-by-one, each of their trees get some aspects right, and some others (severely) wrong. Misplacing one OTU (e.g. the light green, dark olive, sand-colored and dark yellow splits in the averaged Consensus network) may have further topological effects; it didn't matter for the matK gene, because we misplaced only one very alien OTU in a data set that otherwise is hardly affected by adding or removing OTUs.

I argue here that, if there had been substantial reticulation messing up the signal of contemporary lineages and reflecting decoupled histories (like in the case of bears), we would expect at least some (artificial) branching patterns with low support in the combined tree, as well. This would also be the case looking at the gene-tree consensus networks, not only in the deepest parts but also closer to the leaves.

We will be explore this alternative hypothesis in the next (and final) post of this mini-series.

Monday, January 14, 2019

Phylogenetic ambiguity: data gaps, indifference and internal conflict

A tweet by my favourite journal (not only because they insist that authors make their data available) pointed me to their most viewed paper of 2018, with a nice title (for a network-fan):
Genus-level phylogeny of cephalopods using molecular markers: current status and problematic areas, by Sanchez et al. (PeerJ, 6:e4331).
"Problematic areas" are exactly my cup of tea. However, the graphical representation of these falls a bit short. The authors show three maximum-likelihood phylograms, one for the Cephalopoda with support annotated at some branches (their Fig. 1), and one each for two of the constituent lineages, the Decabrachia (their Fig. 2) and the Octobrachia (Fig. 3, reproduced below, because we will take a look at the data behind it).

Original: "Figure 3: Maximum-likelihood tree of the Octobrachia under the
GTR + Gamma model with the morphological character set mapped onto the tree.
Taxa highlighted in red represents discrepancy to previously published studies."

Unfortunately, we don't know the actual support for each of the branches — there is a legend in the lower right, but no signatures etc. associated with it. You will find some information throughout the text, of course. For example:
The use of concatenated sequences of all markers (Fig. 2) resulted in a resolved topology for monophyly of the Octobrachia (BS = 58%), and strong support for monophyly of the Decabrachia (BS = 98%), with both clades strongly supported by the Bayesian approach with PP = 0.78 and 0.75 respectively
The latter is quite strange, as PP are expected (methodologically) to be ≥ BS.
Although monophyly was demonstrated for several families contained within both superorders, the relationships of the families contained within Octobrachia were better supported than those in Decabrachia (Fig. 2). Of the 37 nodes in the Octobrachia portion of the general tree containing all taxa, the majority were resolved above the 50% level (31 nodes with BS > 50%); but only 28 out of 80 nodes in the Decabrachia were resolved at BS >50%, most of which were located at family level.
BS = 51 could be lack of signal (all other alternatives BS ~ 0) or conflict (one alternative has a BS = 49).

What we can infer directly from the alignment

Let's have a look at the first three gene regions in the matrix provided, using Mesquite's bird-view option.


We can see from the alignment that the first gene (left; mitochondrial 12S rDNA) splits the taxon set (the taxon order seems to be arbitrary) into two (three if we include those with no data) main groups with substantially divergent 12S rDNAs. However, in the second, much more homogeneous gene, no such differentiation is obvious, with the exception of two accessions that remain very different from the rest. This is quite puzzling, because the second gene is the (close-by) mitochondrial 16S rDNA.

Without going into details, the 12S rDNA unambiguously supports (and enforces) an Octopodia core clade defined by a 12S rDNA entirely different from that of other taxa, and comprising five of this order's families, in which Amphioctopus and Octopus make up a subclade with strongly derivating 16S rDNA.

With respect to the tree, we also have to assume that the 12S rDNA of the Octopodia core clade is derived, strongly evolved, whereas it remained largely unmodified (ie. is primitive) in the other, earlier diverged (according to the tree) lineages. However, some of these lineages have equally long terminal branches: there has been more evolution going on in other genes.



The third gene, the nuclear-encoded gene for the 18S rRNA (18S nrDNA), shows another pattern (and quite typical). Large stretches with very little variation, hence, devoid of differentiating signal that would allow the tree algorithm to make a decision (and letting Bayes get lost in the treespace resulting in PP < 1.0).


For half of the taxa, no information is available, but this hardly matters because even genera with strongly different mt 12S rDNA have nearly the same 18S nrDNA. There is a little hickup in the second part in one accession (a gap in Cirrothauma with a small, off-alignment strand in between), but this could just be a sequencing artefact. Limited to a single taxon, it has no topological effect (we at least need four to make a call), it will only increase the length of the terminal branch.

The remainder of the matrix mirrors the situation in the first three partitions, eg. in the well-sampled (only six taxa missing) mt coI gene, Callistoctopus is visibly distinct from all other genera, while most general variation is concentrated at the 3rd codon position. All other mt-genes, accounting for 58% of the matrix' characters, are covered for four of the taxa (the sister taxon used to root, Vampyrotheutis, and three of the core Octopodia, hence, can only support a single split within this group and be used to test for its alternatives.



What networks could have shown

The matrix provided for the shown tree (made available via figshare) has 40 taxa and 16104 characters, quick to run these days. Here's the tree with branch support annotated along branches.

ML phylogram inferred from Sanchez et al.'s matrix, taxa ordered as in the original fig. 3. Members of the same taxon (order, superfamily, family, as annotated in Sanchez et al.'s fig. 3) colored accordingly. Values at branches indicate ML-BS  support using a single partition for the entire data ("unpart.") or using the gene-wise partition scheme provided in the figshare submission ("part.")

Even though I run an unpartitioned analysis, my tree is very similar to the original tree, with a near identical topology except for Ameloctopus being moved one node up and placed as sister to Hapalochlaena (ML, unpartioned-BS = 52 vs. 46[!] for the alternative seen in Sanchez et al.'s fig. 3). I never understood the fuzz about model and partition testing, when we usually work with data where any model will inevitably be suboptimal (see alignments). As a geneticist, I also believe data partitions should be informed by function, not computer programmes (eg. one for 1st and 2nd codon position, another for the 3rd codon position, and one for the rDNAs).

We have unambiguously supported branches (BS ~ 100), and others, the "problematic areas" (BS << 100). Ambiguity in support values for branches of a tree can have two reasons:
  1. Lack of signal, the data is indifferent regarding the placements of certain taxa and/or subtrees (PP < 1.0 are indicative for lack of signal).
  2. Conflicting signal, parts of the data (data partitions) prefer one topological alternative, others a (partly) conflicting one (keep in mind that even in the presence of substantial signal conflict, PP ~ 1).
Short branches with low (BS) support point to the former, long branches with low (BS) support are a direct indication of the latter. Two apparent sources of conflict would be that the data include gene regions from the biparentally inherited nucleome and the (usually maternally inherited, not sure how this is in squids) mitochondriome and combine protein-coding genes (amino-acids coded by codons) with rRNA genes (directly encoding a certain secondary, tertiary structure).

In our tree here, we notice a general correlation between the branch lengths and the support; the shorter the branch, the lower the support. With a few exceptions, eg. the Octopodida core clade, triggered by the unique, strongly diverged sequences of the 12S rDNA, has a long root branch with compartively low support (ML-BS = 63; collapses when using the authors' partitioning scheme that treats each gene region as individual partition).

Full BS Consensus network based on 450 ML pseudoreplicates (result of the unpartitioned analysis). Edge lengths are proportional to the BS support (frequency of the splits in the BS tree sample), trivial splits not collapsed. Arrow points to the root (cf. Sanchez et al.'s fig. 1).

The BS Consensus network shows us that some of the "problematic areas", ie. branches with ambiguous support, are not really problematic (alternatives have no to very little support), but others are. Including the 12S rDNA-based Octopodida core clade, and connected to this, the division of the Megaleledonidae, as annotated in Sanchez et al.'s fig. 3, into two clades (not discussed in the paper). A clade including all Megaleledonidae has a BSunpart./part. = 34/55 and competes with the 12S rDNA split (BS = 63/37) and the placement of Cistopus as sister to the Octopodida core clade (BS = 52/34). It doesn't conflict with the alternative topology placing Cistopus as sister to all of them (BS = 38/50). The reason for this is, of course, that by using a different partion for the highly divergent mt-12S rDNA, we allow RAxML to estimate high probabilities for all mutations, effectively down-weighting each mutation in this gene compared to those in other, more conservatively structured gene regions, which seem to prefer alternative splits.

Vice versa, the poorly supported sister relationship (BS = 45/21) of Bathypolypus with the Enteroctopodidae (light green) + part of the Argonautoidea (pink) stands unopposed, alternative splits have BSunpart. < 10. In the partitioned analysis, however, there is an equally poor supported alternative sticking out a bit: Bathypolypus as sister to the (all-including) Megaleledonidae clade (BSpart. = 23).

While we see little effect on the tree topology, partitioning affects some of the support values. An nice example is the structure of the Megaleledonidae s.str. subtree. The root is unambiguously supported, as is the sister relationship of Graneldone and Bentheledone. The remaining branches have ambiguous support.


Here, the partitioning scheme is a game changer. Unpartioned, the favored alternative is a Adelieledone-Pareledone-Megaeledone (APM) grade "basal" to Graneldone and Bentheledone (BS = 68/49); using the authors' partitioning scheme, the data favors an APM clade sister to the latter two (quite a difference, since we often equal clades with monophyly and grades with paraphyly).

It doesn't matter whether a clade has a BS support of 30, 50 or 70. We need to know, if the remaining 70%, 50%, or 30% of bootstrap replicates show random or the same alternative(s). When a tree has ambiguously support branches, BS Consensus networks should be obligatory.

Instead of reading sentences like this:
Benthic families possessing a double row of suckers (i.e., Enteroctopodidae, Octopodidae and Bathypolypodidae) together with the Megaleledonidae (possessing a single row of suckers) formed a well-supported monophyletic group (BS = 72%, PP = 0.61).
we should read this:
A clade including all benthic families possessing a double row of suckers (i.e., Enteroctopodidae, Octopodidae and Bathypolypodidae) and the Megaleledonidae (possessing a single row of suckers) received ambiguous support (BS = 72%, PP = 0.61), but potential alternatives received no support at all. The combination of a relative high BS but low PP points towards a faint, but consistent signal in the available data.
And include the Consensus networks at least in the supplement.

When we aim to map morphological traits (which a nice touch of Sanchez et al.'s paper), why not consider the topological alternatives we see there?

Running single-gene trees is never wrong, too. But, in the case of these data, that would be the topic of another post, using a different type of network: a super-network.

Final note. This post is not intended to criticize Sanchez et al.'s paper (my squid-expertise ends with having seen them in aquaria). My impression is they put a lot of effort into getting the matrix together. Having been forced to harvest molecular data myself in the past, I know how important and tedious this work is. Instead, this post stresses and shows, using an easy-to-access example that raised a lot of interest (attracted many views), that we often have to work with suboptimal data not providing trivial results in the form of fully resolved trees. This is a situation in which easy to generate networks offer a lot. No peer reviewer should, in such a case, be content with seeing just a tree (although they, to my experience, always are).