The Genealogical World of Phylogenetic Networks: Modeling

Showing posts with label Modeling. Show all posts

Monday, March 23, 2020

Evolution unchained: The development of person names and the limits of sequences

What do person names like Jack and Hans have in common, and what unites Joe and Pepe? Both name pairs go back to a common ancestor. For Jack and Hans, this would be John (ultimately going back to Iōánnēs in Greek), and for Joe and Pepe, this would be Josef (originally from Hebrew). Given the striking dissimilarity of the names in their current form, the pathways of change by which they have evolved into their current shape are quite complicated.

While the German name Hans can be easily shown to be a short form of the German variant Johannes, the evolution of Jack is more complicated. First (at least this is what people on Wikipedia suppose), Iōánnēs becomes John in English, similar to the process that transformed German Johannes into Hans. Then, in an ancient form of English, a diminutive was built for John, which yielded the form Jenkin, with the diminutive suffix -kin that has a homologous counterpart in German -chen (which can be attached to Hans as well, yielding Hänschen). Etymologically, Jack is little Johnny.

While Joe in English is a shortening of Josef, the development of Pepe is again a bit more complex. First, we find the form Giuseppe as an Italian counterpart of Josef. How this form then yielded Pepe as a diminutive is not completely clear to me; but since we find the pe in the Italian form, we can think of a process by which Giuseppe becomes Giuseppepe, leaving Pepe after the deletion of the initial two syllables.

The complexity of person-name evolution

Even from these two examples alone, we can already see that the evolution of person names can easily become quite complex. If all words in all spoken languages in the world evolved in the same way in which our person names evolve, we would have a big problem in historical linguistics, since the amount of speculation in our etymologies would drastically increase.

When comparing etymologically related words from different languages, we generally assume that they show regular correspondences among their sound segments. This presupposes that there is still enough sound material that reflects these correspondences, allowing us to detect and assess them. But since the evolution of person names rarely consists of the regular modification of sounds, but rather results in the deletion, reduplication, and rearrangement of whole word parts, there is rarely enough left in the end that could be used as the basis for a classical sequence comparison.

With the name Tina in German being the short form of Bettina, Christina, and at times even Katharina, and with Bettina itself going back to Elisabeth, and with Tina becoming Tinchen, Tinka, or Tine, we face an almost insurmountable challenge when trying to model the complexity of the various patterns by which names can change.

Modeling word derivation with directed networks

That words do not evolve solely by the alternation of sounds, but also by different forms of derivation, is nothing new for historical linguistics. We face the problem, for example, when looking for etymologically related words in the basic lexicon of phylogenetically related languages. However, these phenomena can be easily investigated by enhanced means of annotation. The evolution of person names, on the other hand, presents us with larger challenges.

While working as a research fellow in France in 2015-2016, I had the time to develop a small tool that allows us to represent derivational relations between related words with help of a directed network, and thus allows us to model these relations in a rough way. Such a graph is directed, and our words are the nodes in the network, with the edges drawn between the assumed ancestor word forms and their descendants. This tool, which I then called DeriViz, is still available online. and makes it possible to visualize network relations between words.

I have now conducted a small experiment with this tool, by taking name variants of Elisabeth, as they are listed in Wikipedia, and trying to model them in a directed network, along with intermediate stages. You can do this easily yourself, by copying the network that I have constructed in text form below, and pasting it into the field for data entry at the DeriViz-Homepage. The network will be visualized when you press on the OK button; and you can play with it by dragging it around.

Elisabeth → BETT
BETT → Betty
BETT → Bettina
BETT → Bettine
BETT → Betsi
Elisabeth → ELISABETH
ELISABETH → Elise
ELISABETH → Elsbeth
ELISABETH → Else
ELISABETH → Elina
Elisabeth → ILSA
ILSA → Ilsa
ILSA → Ilse
Elisabeth → Isabella
Elisabeth → LISA
LISA → Lieschen
LISA → Liese
LISA → Liesel
LISA → Lis
LISA → Lisa
LISA → Lisbeth
LISA → Lisette
LISA → Lise
LISA → Liesl
Elisabeth → LILA 
LISA → Lila
LISA → Liliane
LISA → Lilian
LISA → Lilli
Elisabeth → Sisi

I intentionally reduced the amount of data here, in order to make sure that the graphic can still be inspected. But it is clear that even this simple model, which assumes unique ancestor-descendant relations among all of the derived person names, is stretched to its limits when applied to names as productive as Elisabeth, at least as far as the visualization is concerned.


Derivation network of names derived from Elisabeth

If you now imagine that there are various processes that turn an ancestral name into a descendant name, and that one would ideally want to model the differences between these processes as well, one can see easily that it is indeed not a trivial problem to model the evolution of person names (and we are not even speaking of inferring any of these relations).

How names evolve

Names evolve in various ways along different dimensions. With respect to their primary function, or their use, we tend to use, among others, nick names. Formally, nick names are often a short form of an original name, but depending on the community of speakers, it is also possible that there is a formal procedure by which a nick name can be derived from a base name. Thus, every speaker of Russian should know that Jekaterina can be turned into Katerina, which can be turned into Katja, which can be turned into Katjuscha, or, in the case of a Vocative, into Katj. Once the primary function of a name changes, its form usually also changes, as we can now see in many examples.

But the form can also change when a name crosses language borders. If you go with your name into another country, and the speakers have problems pronouncing certain sounds that occur in your name, it is very likely that they will adjust your name's pronunciation to the phonetic needs of their own language, and modify it. Names cross language borders very quickly, since we tend not to leave them at home when visiting or migrating to foreign countries. As a result, a great deal of the diversity of person names observed today is due to the migration of names across the world's larger linguistic communities.

How we change names when building short forms or nick names, or when trying to adapt a name to a given target language, depends on the structure of the language. The most important part is the phonology of the language in which the change happens. For example, when transferring a name from one language to another, and the new language lacks some of the sounds in the original name, speakers will replace them with those sounds which they perceive to be closest to the lacking ones.

But the modification is not restricted to the replacement of sounds. My own given name, Mattis, for example, usually has the stress on the first syllable, but in France, most people tend to call me Matisse, with the accent on the second syllable, reflecting the general tendency to stress the last syllable of a word in French. In Russian, on the other hand, Mattis could be perfectly pronounced, but since people do not know the name, they often confuse it with its variant Matthias, which then sounds like Matjes when pronounced in Russian (which is the name for soused herring in Germany). There are more extreme cases; and both English and German speakers are also good at drastically adjusting foreign names to the needs of their mother tongues.

It would be nice if it was possible to investigate the huge diversity in the evolution of person names more systematically. In principle, this should be possible. I think, starting from directed networks is definitely a good idea; but it would probably have to be extended by distinguishing different types of graph edges. Even if a given selection may not handle all of the processes known to us, it might help to collect some primary data in the first place.

With a large enough set of well-annotated data, on the other hand, one might start to look into the development of algorithms that could infer derivation relationships between person names; or one could analyze the data and search for the most frequent processes of person name evolution. Many more analyses might be possible. One could see to which degree the processes differ across languages, or how names migrate from one language to another across times, usage types, and maybe even across fashions.

Outlook

I assume that the result of such a collection would be interesting not only for couples who are about to replicate themselves, but would also be interesting for historical research and research in the field of cultural evolution. Whether such a collection will ever exist, however, seems less likely. The problem is that there are not enough scholars in the world who would be interested in this topic, as one can see from the very small number of studies that have been devoted to the problem up to now (as one of the few exceptions known to me, compare the nice overview of person name classification by Handschuh 2019). I myself would not be able to help in this endeavour, given that I lack the scholarly competence of investigating name evolution. But I would sure like to investigate and inspect the results, if they every become available.

Reference

Handschuh, Corinna (2019) The classification of names. A crosslinguistic study of sex-specific forms, classifiers, and gender marking on personal names. STUF — Language Typology and Universals 72.4: 539-572.

Monday, May 7, 2018

Keeping it simple in phylogenetics

This is a post by Guido, with a bit of help from David.

There's an old saying in physics, to the effect that: "If you think you need a more complex model, then you actually need better data." This is often considered to be nonsense in the biological sciences and the humanities, because the data produced by biodiversity is orders of magnitude more complex than anything known to physicists:

The success of physics has been obtained by applying extremely complicated methods to extremely simple systems ... The electrons in copper may describe complicated trajectories but this complexity pales in comparison with that of an earthworm. (Craig Bohren)

Or, more succinctly:

If it isn’t simple, it isn’t physics. (Polykarp Kusch)

So, in both biology and the humanities there has been a long-standing trend towards developing and using more and more complex models for data analysis. Sometimes, it seems like every little nuance in the data is important, and needs to be modeled.

However, even at the grossest level, complexity can be important. For example, in evolutionary studies, a tree-based model is often adequate for analyzing the origin and development of biodiversity, but it is inadequate for studying many reticulation processes, such as hybridization and transfer (either in biology or linguistics, for example). In the latter case, a network-based model is more appropriate.

Nevertheless, the physicists do have a point. After all, it is a long-standing truism in science that we should keep things simple:

We may assume the superiority, all things being equal, of the demonstration that derives from fewer postulates or hypotheses. (Aristoteles)

It is futile to do with more things that which can be done with fewer. (William of Ockham)

Plurality must never be posited without necessity. (William of Ockham)

Everything should be as simple as it can be, but not simpler. (Albert Einstein)

To this end, it is often instructive to investigate your data with a simple model, before proceeding to a more complex analysis.

Simplicity in phylogenetics

In the case of phylogenetics, there are two parts to a model: (i) the biodiversity model (eg. chain, tree, network), and (ii) the character-evolution model. A simple analysis might drop the latter, for example, and simply display the data unadorned by any considerations of how characters might evolve, or what processes might lead to changes in biodiversity.

This way, we can see what patterns are supported by our actual data, rather than by the data processed through some pre-conceived model of change. If we were physicists, then we might find the outcome to be a more reliable representation of the real world. Furthermore, if the complex model and the simple model produce roughly the same answer, then we may not need "better data".

Modern-day geographic distribution of Dravidian languages (Fig. 1 of Kolipakam, Jordan, et al., 2018)

Historical linguistics of Dravidian languages

Vishnupriya Kolipakam, Fiona M. Jordan, Michael Dunn, Simon J. Greenhill, Remco Bouckaert, Russell D. Gray, Annemarie Verkerk (2018. A Bayesian phylogenetic study of the Dravidian language family. Royal Society Open Science) dated the splits within the Dravidian language family in a Bayesian framework. Aware of uncertainty regarding the phylogeny of this language family, they constrained and dated several topological alternatives. Furthermore, they checked how stable the age estimates are when using different, increasingly elaborate linguistic substitution models implemented in the software (BEAST2).

The preferred and unconstrained result of the Bayesian optimization is shown in their Figs 3 and 4 (their Fig. 2 shows the neighbour-net).

Fig. 3 of Kolipakam et al. (2018), constraining the North (purple), South I (red) and South II (yellow) groups as clades (PP := 1)

Fig. 4 of Kolipakam et al. (2018), result of the Bayesian dating using the same model but not constraints. The Central and South II group is mixed up.

As you can see, many branches have rather low PP support, which is a common (and inevitable) phenomenon when analyzing non-molecular data matrices providing non-trivial signals. This is a situation where support consensus networks may come in handy, which Guido pointed out in his (as yet unpublished) comment to the paper (find it here).

On Twitter, Simon Greenhill (one of the authors) posted a Bayesian PP support network as a reply.

A PP consensus network of the Bayesian tree sample, probably the one used for Fig. 3 of Kolipakam et al. 2018, constraining the North, South I, and South II groups as clades (S. Greenhill, 23/3/2018, on Twitter).

Greenhill, himself, didn't find it too revealing, but for fans of exploratory data analysis it shows, for example, that the low support for Tulu as sister to the remainder of the South I clade (PP = 0.25) is due to lack of decisive signal. In case of the low support (PP = 0.37) for the North-Central clade, one faces two alternatives: it's equally likely that the Central Parji and Olawi Godha are related to the South II group which forms a highly supported clade (PP = 0.95), including the third language of the Central group (one of the topological alternatives tested by the authors).

A question that pops up is: when we want to explore the signal in this matrix, do we need to consider complex models?

Using the simplest-possible model

The maximum-likelihood inference used here is naive in the sense that each binary character in the matrix is treated as an independent character. The matrix, however, represents a binary sequence of concepts in the lexica of the Dravidian languages (see the original paper for details).

For instance, the first, invariant, character encodes for "I" (same for all languages and coded as "1"), characters 2–16 encode for "all", and so on. Whereas "I" (character 1) may be independent from "all" (characters 2–16), the binary encodings for "all" are inter-dependent, and effectively encode a micro-phylogeny for the concept "all": characters 2–4 are parsimony-informative (ie. split the taxon set into two subsets, and compatible); the remainder are parsimony-uninformative (ie. unique to a single taxon).

The binary sequence for "All" defines three non-trivial splits, visualized as branches, which are partly compatible with the Bayesian tree; eg. Kolami groups with members of South I, and within South II we have two groups matching the subclades in the Bayesian tree.

Two analyses were run by the original authors, one using the standard binary model, Lewis’ Mk (1-paramter) model, and allowing for site-specific rate variation modelled using a Gamma-distribution (option -m BINGAMMA). As in the case of morphological data matrices (or certain SNP data sets), and in contrast to molecular data matrices, most of the characters are variable (not constant) in linguistic matrices. The lack of such invariant sites may lead to so-called “ascertainment bias” when optimizing the substitution model and calculating the likelihood.

Hence, RAxML includes an option to correct for this bias for morphological or other binary or multi-state matrices. In the case of the Dravidian language matrix, four out of the over 700 characters (sites) are invariant and were removed prior to rerun the analysis applying the correction (option -m ASC_BINGAMMA). The results of both runs show a high correlation— the Pearson correlation co-efficient of the bipartition frequencies (bootstrap support, BS) is 0.964. Nonetheless, BS support for individual branches can differ by up to 20 (which may be a genuine or random result, we don't know yet). The following figures show the bootstrap consensus network of the standard analysis and for the analysis correcting for the ascertainment bias.

Maximum likelihood (ML) bootstrap (BS) consensus network for the standard analysis. Green edges correspond to branches seen in the unconstrained Bayesian tree in Kolipakam et al. (2018, fig. 4), the olive edges to alternatives in the PP support network by S. Greenhill. Edge values show ML-BS support, and PP for comparison.

ML-BS consensus network for the analysis correcting for the ascertainment bias. BS_asc annotated at edges in bold font, with BS_unc and PP (graph before) provided for comparison. Note the higher tree-likeness of the graph.

Both graphs show that this characters’ naïve approach is relatively decisive, even more so when we correct against the ascertainment bias. The graphs show relatively few boxes, referring to competing, tree-incompatible signals in the underlying matrix.

Differences involve Kannada, a language that is resolved as equally related to Malayam-Tamil and Kodava-Yeruva — BS_asc = 39/35, when correcting for ascertainment bias; but BS_unc < 20/40, using the standard analysis); and Kolami is supported as sister to Koya-Telugu (BS_asc = 69 vs. BSunc. = 49) rather than Gondi (BS_asc < 20, BS_unc = 21).

They also show that from a tree-inference point of view, we don't need highly sophisticated models. All branches with high (or unambiguous) PP in the original analysis are also inferred, and can be supported using maximum likelihood with the simple 1-parameter Mk model. This also means that if the scoring were to include certain biases, the models may not correct against this. At best, they help to increase the support and minimize the alternatives, although the opposite can also be true.

For relationships within the Central-South II clade (unconstrained and constrained analyses), the PP were low. The character-naïve Maximum likelihood analysis reflects some signal ambiguity, too, and can occasionally be higher than the PP. BS > PP values are directly indicative of issues with the phylogenetic signal (eg. lack of discriminative signal, topological ambiguity), because in general PP tend to overestimate and BS underestimate. The only obvious difference is that Maximum likelihood failed to provide support for the putative sister relationship between Ollari Gadba and Parji of the Central group.

The crux with using trees

When inferring a tree as the basis of our hypothesis testing, we do this under the assumption that a series of dichotomies can model the diversification process. Languages are particularly difficult in this respect, because even when we clean the data of borrowings, we cannot be sure that the formation of languages represents a simple split of one unit into two units. Support consensus networks based on the Bayesian or bootstrap tree samples can open a new viewpoint by visualizing internal conflict.

This tree-model conflict may be genuine. For example, when languages evolve and establish they may be closer or farther from their respective sibling languages and may have undergone some non-dichotomous sorting process. Alternatively, the conflict may be due to character scoring, the way one transforms a lexicon into a sequence of (here) binary characters. The support networks allow exploring these phenomena beyond the model question. Ideally, a BS of 40 vs. 30 means that 40% of the binary characters support the one alternative and 30% support the competing one.

In this respect, historical-linguistic and morphological-biology matrices have a lot in common. Languages and morphologies can provide tree-incompatible signals, or contain signals that infer different topologies. By mapping the characters on the alternatives, we can investigate whether this is a genuine signal or one related to our character coding.

Mapping the binary sequences for the concept "all" (example used above to illustrate the matrix basic properties; equalling 15 binary characters) on the ML-BS consensus network. We can see that its evolution is in pretty good agreement with the overall reconstruction. Two binaries support the sister relationship of the South II languages Koya and Telugo, and a third collects most members of the South I group. All other binaries are specific to one language, hence, do not produce a conflict with the edges in the network.

Tuesday, April 25, 2017

The siteswap annotation in juggling, and the power of annotation and modeling

I have been a juggler for more than 20 years now. It started when I was thirteen, and primarily interested in doing magic tricks, but I quickly realized that there are more transparent ways of presenting ones manipulation skills. About 15 years ago, when I was starting my studies in Berlin, there was a booming juggling scene in that city, with many young people, including many geeks who studied mathematics, programming, or physics. I, myself, was studying Indo-European linguistics by then, a field deprived of formalisms and formulas, devoted to the implicit as reflected in scientific prose that is not amenable to formalization, modeling, or transparent annotation.

It was at that time that some jugglers began to develop an annotation system for juggling patterns. The system was very simple, using numbers to denote the height and the direction of balls (or other objects) flying around from hand to hand. The 1 denoted the transfer of one ball from hand to hand without tossing it, the 2 denoted to hold one ball in one hand, the 3 to throw it from one hand to the other with a height required to juggle three balls, the 4 to throw one ball up in the air so that one would catch it with the same hand, and the 5 denoted the crossing from one hand to the other, but this time slightly higher, as required when juggling five balls. Some of these numbers are indicated in these animated GIFs.

The people called this system siteswap, and they claimed that it was a good idea to formalize juggling to increase creativity, since one was not required to throw all of the balls with the same number, but one could combine them, following some basic mathematical ideas.

When people told me about this, I was extremely skeptical, probably due to my classical education, which gave me the conviction that juggling is an art, and an art cannot be describe in numbers. When people tried to teach me siteswaps, I ridiculed them, showing them some complicated patterns involving body movements (see the next GIFs), and told them they would never be able to describe all the creativity of all the jugglers in the world in numbers.

Only a couple of years later, I realized that the geeks had proven me wrong, when, after a longer break, I was again participating in one of the many juggling conventions that take place throughout the year, in different locations in Europe and the whole world. I saw people doing tricks with three balls that I had never thought of before, and I asked them what they were doing. They answered, that these were siteswaps, and they were juggling patterns they called 441, or 531, respectively, as shown in these GIFs:

I gave in completely, when I saw how they applied the same logic to routines with five and more balls, which they called 654, 97531, or 744, respectively. Especially the 97531 fascinated me. During this routine, all of the balls end up in one vertical line in the air, for just a moment, but enough even for laymen to see the vertical line, which then immediately breaks down to a normal five-ball pattern, as shown here.

I realized, how wrong it was to take the un-annotability of something for granted. But even more importantly, I also understood that models, as restrictive as they may seem to be at first sight, may open new pathways for creativity, showing us things we had been ignoring before.

Only recently, when I promised colleagues to juggle during a talk on linguistics, I detected the parallel with my own studies in historical linguistics. For a long time, the field has been held back by people claiming that things could not be handled formally, for various reasons.

But I am realizing more and more that this is not true. We just need to start with something, some kind of model, which may not be as ideal and as realistic as we might wish it to be, but that may eventually help us to detect things we did not see before. We just need to start doing it, walking in baby-steps, improving our models and our annotation, as well as improving our understanding of the limits and the chances of a given formalization.

Needless to say, the patterns that I deemed to be un-annotatable 10 years ago in juggling can now easily be handled by my colleagues. They did not stop with the normal number system, but kept (and keep) developing it, and they take a lot of inspiration from this.