Showing posts with label Family tree. Show all posts
Showing posts with label Family tree. Show all posts

Tuesday, June 27, 2017

Trees do not necessarily help in linguistic reconstruction


In historical linguistics, "linguistic reconstruction" is a rather important task. It can be divided into several subtasks, like "lexical reconstruction", "phonological reconstruction", and "syntactic reconstruction" — it comes conceptually close to what biologists would call "ancestral state reconstruction".

In phonological reconstruction, linguists seek to reconstruct the sound system of the ancestral language or proto-language, the Ursprache that is no longer attested in written sources. The term lexical reconstruction is less frequently used, but it obviously points to the reconstruction of whole lexemes in the proto-language, and requires sub-tasks, like semantic reconstruction where one seeks to identify the original meaning of the ancestral word form from which a given set of cognate words in the descendant languages developed, or morphological reconstruction, where one tries to reconstruct the morphology, such as case systems, or frequently recurring suffixes.

In a narrow sense, linguistic reconstruction only points to phonological reconstruction, which is something like the holy grail of computational approaches, since, so far, no method has been proposed that would convincingly show that one can do without expert insights. Bouchard-Côté et al. (2013) use language phylogenies to climb a language tree from the leaves to the root, using sophisticated machine-learning techniques to infer the ancestral states of words in Oceanic languages. Hruschka et al. (2015) start from sites in multiple alignments of cognate sets of Turkish languages to infer both a language tree, as well as the ancestral states along with the sound changes that regularly occurred at the internal nodes of the tree. Both approaches show that phylogenetic methods could, in principle, be used to automatically infer which sounds were used in the proto-language; and both approaches report rather promising results.

None of the approaches, however, is finally convincing, both for practical and methodological reasons. First, they are applied to language families that are considered to be rather "easy" to reconstruct. The tough cases are larger language families with more complex phonology, like Sino-Tibetan or any of its subbranches, including even shallow families like Sinitic (Chinese), or Indo-European, where the greatest achievements of the classical methods for language comparison have been made.

Second, they rely on a wrong assumption, that the sounds used in a set of attested languages are necessarily the pool of sounds that would also be the best candidates for the Ursprache. For example, Saussure (1879) proposed that Proto-Indo-European had at least two sounds that did not survive in any of the descendant languages, the so-called laryngeals, which are nowadays commonly represented as h₁, h₂, and h₃, and which leave complex traits in the vocalism and the consonant systems of some Indo-European languages. Ever since then, it has been a standard assumption that it is always possible that none of the ancestral sounds in a given proto-language is still attested in any its descendants.

A third interesting point, which I consider a methodological problem of the methods, is that both of them are based on language trees, which are either given to the algorithm or inferred during the process. Given that most if not all approaches to ancestral state reconstruction in biology are based on some kind of phylogeny, even if it is a rooted evolutionary network, it may sound strange that I criticize this point. But in fact, when linguists use the classical methods to infer ancestral sounds and ancestral sound systems, phylogenies do not necessarily play an important role.

The reason for this lies in the highly directional nature of sound change, especially in the consonant systems of languages, which often makes it extremely easy to predict the ancestral sound without invoking any phylogeny more complex than a star tree. That is, in linguistics we often have a good idea about directed character-state changes. For example, if a linguist observers a [k] in one set of languages and a [ts] in another languages in the same alignment site of multiple cognate sets, then they will immediately reconstruct a *k for the proto-language, since they know that [k] can easily become [ts] but not vice versa. The same holds for many sound correspondence patterns that can be frequently observed among all languages of the world, including cases like [p] and [f], [k] and [x], and many more. Why should we bother about any phylogeny in the background, if we already know that it is much more likely that these changes occurred independently? Directed character-state assessments make a phylogeny unnecessary.

Sound change in this sense is simply not well treated in any paradigm that assumes some kind of parsimony, as it simply occurs too often independently. The question is less acute with vowels, where scholars have observed cycles of change in ancient languages that are attested in written sources. Even more problematic is the change of tones, where scholars have even less intuition regarding preference directions or preference transitions; and also because ancient data does not describe the tones in the phonetic detail we would need in order to compare it with modern data. In contrast to consonant reconstruction, where we can do almost exclusively without phylogenies, phylogenies may indeed provide some help to shed light on open questions in vowel and tone change.

But one should not underestimate this task, given the systemic pressure that may crucially impact on vowel and tone systems. Since there are considerably fewer empty spots in the vowel and tone space of human languages, it can easily happen that the most natural paths of vowel or tone development (if they exist in the end) are counteracted by systemic pressures. Vowels can be more easily confused in communication, and this holds even more for tones. Even if changes are "natural", they could create conflict in communication, if they produce very similar vowels or tones that are hard to distinguish by the speakers. As a result, these changes could provoke mergers in sounds, with speakers no longer distinguishing them at all; or alternatively, changes that are less "natural" (physiologically or acoustically) could be preferred by a speech society in order to maintain the effectiveness of the linguistic system.

In principle, these phenomena are well-known to trained linguists, although it is hard to find any explicit statements in the literature. Surprisingly, linguistic reconstruction (in the sense of phonological reconstruction) is hard for machines, since it is easy for trained linguists. Every historical linguist has a catalogue of existing sounds in their head as well as a network of preference transitions, but we lack a machine-readable version of those catalogues. This is mainly because transcriptions systems widely differ across subfields and families, and since no efforts to standardize these transcriptions have been successful so far.

Without such catalogues, however, any efforts to apply vanilla-style methods for ancestral state reconstruction from biology to linguistic reconstruction in historical linguistics, will be futile. We do not need the trees for linguistic reconstruction, but the network of potential pathways of sound change.

References
  • Bouchard-Côté, A., D. Hall, T. Griffiths, and D. Klein (2013): Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences 110.11. 4224–4229.
  • Hruschka, D., S. Branford, E. Smith, J. Wilkins, A. Meade, M. Pagel, and T. Bhattacharya (2015): Detecting regular sound changes in linguistics as events of concerted evolution. Current Biology 25.1: 1-9.
  • Saussure, F. (1879): Mémoire sur le système primitif des voyelles dans les langues indo- européennes. Teubner: Leipzig.

Tuesday, March 7, 2017

Roundels and family trees


I have written before about the slow development of what has come to be known as the "family tree", including reducing human network relationships to a tree-like form (Reducing networks to trees), and presenting it as an actual tree (Drawing family trees as trees), rooted at the base (Does it matter which way up a tree is drawn?).

Most of the early representations of pedigrees had the people's names enclosed in a circle, called a "roundel", and it was these roundels that were connected to show the family relationships. One of the steps on the way to a tree was thus dropping this idea, so that the names could be connected directly.

Some of the diagrams with roundels that I have covered include:

c. 400 CE — The genealogy of Jesus Christ, Part I, Part II, Part III
c. 1000 CE — Genealogy of Cunigunde of Luxembourg
c. 1140 CE — Genealogy of the Carolingians
c. 1185 CE — Genealogy of the Welf dynasty
c. 1237 CE — Genealogy of the Ottonian dynasty

Interestingly, the earliest pedigrees that do not have roundels also date from this early period. As noted by Nathaniel Lane Taylor, the importance of this development is that: "the scribe relies on the power of the names themselves to anchor a diagram on the page, with lines simply taking the place of any syntax needed to describe the filiation." That is, no abstract iconography is needed.

I have already illustrated the earliest known example:
c. 1121 CE — The genealogy of Lambert of Saint-Omer


Taylor provides links to illustrations of the next known example:
   c. 1128, John of Worcester, Chronicle of World and English History (Corpus Christi College MS 157).
This book contains eight genealogies of Anglo-Saxon and Norman kings (pp. 47-54), one of which is shown above.

Taylor also refers to "one of the Arabic stemmata" illustrated in:
   Arthur Watson (1934) The Early Iconography of the Tree of Jesse. Oxford University Press,
I have not seen this book, but the illustrations are apparently confined to those from the 12th century, making the diagram contemporaneous with the two listed above. The Tree of Jesse normally appears in Medieval Christian art as a richly illustrated genealogy of Jesus in illuminated manuscripts, but apparently this one was an exception.

Wednesday, October 7, 2015

The Wave Theory: the predecessor of network thinking in historical linguistics


Dendrophilia

It has been mentioned in a couple of previous blogposts that tree-thinking started rather early in historical linguistics (Morrison 07/2013 and Morrison 11/2012).

Although he was not the first to draw language trees, it was August Schleicher (1821-1866) who made tree-thinking quite popular in linguistics with his two papers published in 1853 (1853a and 1853b). Note that there was no notable influence by Darwin here. It is more likely that Schleicher was influenced by stemmatics (manuscript comparison, Hoenigswald 1963: 8); and even today, historical linguistics has certain features that resemble manuscript comparison much more closely than evolutionary biology. It seems that Schleicher's enthusiasm for the drawing of language trees had quite an impact on Ernst Haeckel (1834-1919), since – as Schleicher pointed out himself (Schleicher 1863) – linguistic trees by then were concrete and not abstract like the one Darwin showed in his Origins (Darwin 1859).


Dendrophobia

Schleicher's tree-thinking, however, did not last very long in the world of historical linguistics. By the beginning of the 1870s Hugo Schuchardt (1842-1927) and Johannes Schmidt (1843-1901) published critical views, claiming that vertical descent was not only what language evolution is about (Schmidt 1872, Schuchardt 1870). Schuchardt was (at least in my opinion) really concrete and observant in his criticisms, especially pointing to the problem of borrowing between very closely related languages, which might deeply confuse the phylogenetic signal:
We connect the branches and twigs of the family tree with countless horizontal lines and it ceases to be a tree. (Schuchardt 1870: 11, my translation)
While Schuchardt's observations were based on his deep knowledge of the Romance languages, Schmidt drew his conclusions from a thorough investigation of shared homologous words in the major branches of Indo-European. What he found here were patterns of words that were in a strong patchy distribution, with many gaps in certain languages and only a few (if at all) patterns that could be found in all languages. One seemingly suprising fact was, for example, that Greek and Sanskrit shared about 39% of homologs (according to Schmidt's count, see Geisler and List 2013), Greek and Latin shared 53%, but Latin and Sanskrit only 8%. Assuming that Greek and Latin had a common ancestor, Schmidt found it very difficult to explain how the similarities between the two languages with Sanskrit could be so different (Schmidt 1872: 24). Furthermore, this pattern of patchy distributions seemed to be repeated in all branches of Indo-European that Schmidt compared in his investigation. Schmidt thus concluded:
No matter how we look at it, as long as we stick to the assumption that today's languages originated from their common proto-language via multiple furcation, we will never be able to explain all facts in a scientifically adequate way. (Schmidt 1872: 17, my translation).
Unfortunately, Schmidt did not stop with this conclusion but proposed another model of language divergence instead of the family tree model:
I want to replace [the tree] by the image of a wave that spreads out from the center in concentric circles becoming weaker and weaker the farther they get away from the center. (Schmidt 1872: 27, my translation)
Ever since then, this new model, the so-called wave theory (Wellentheorie in German) lurks around textbooks in historical linguistics, and confuses especially those who are not primarily trained in historical linguistics. What is the wave theory in the end? How could it replace the tree? While Schmidt did not give a visualization in his book from 1872, he gave one 3 years later (Schmidt 1875: 199):


What we can see from this figure is that we can't see anything: It displays languages in a pie-chart diagram in a quasi-geographic space. No information regarding ancestral states of the languages is given, and no temporal dynamics are shown. I find Schmidt's descriptions of the wave theory hard to understand in their core. He doesn't seem to ignore that evolution has a time dimension, but he seems to deliberately neglect it when drawing his waves.

Other scholars, like Hirt (1905), Bloomfield (1933), Meillet (1908), or Bonfante (1931), propososed similar and alternative ways to visualize Schmidt's wave, as shown in the image below. In contrast to the language trees which – after Schleicher's initial rather "realistic" tree drawings – quickly began to be schematized in historical linguistics, the correct way to draw a wave has remained a mysterium up to today.


Problems with Waves and Trees

When reading Schmidt's book from 1872 and also inspecting his data, certain fallacies in his argumentation become obvious. Firstly, he claims that the low amount of shared homologs between Sanskrit and Latin would be a problem for a family tree theory — however, this is of course no problem, as long as we do not assume that the loss of words follows an evolutionary clock. Furthermore, Schmidt underestimated the epistemological aspect of our knowledge. When comparing the three languages in alternative counts of more recent etymological databases (see Geisler and List 2013 for details), the scores change rapidly, with Latin and Greek sharing 40%, Greek and Sanskrit sharing 39% and Latin and Sanskrit sharing (already) 21%. Although no complete account of Schmidt's data is available in digital form, I think we can assume that the data that forced Schmidt to assume that there is no tree behind the Indo-European languages would not scare off an evolutionary dendrophilist. Whether the tree that the different phylogenetic frameworks would present us from Schmidt's data is a tree corresponding to any reality of Indo-European language formation is another question, but the data may well be quite tree-like, despite what Schmidt saw in it.

A further problem of the wave theory is that people contrast it with the family tree model. This does not seem to be justified, since -- as we can see from the visualizations shown above -- the wave theory ignores the temporal dimension of divergence and convergence. In this sense, it is a pure data display model, similar to a data-display network (Morrison 2011: 5-9) to which some geographical information has been added. As long as the wave theory shows only similarities between taxonomic units based on some kind of underlying data, it is neither a "theory" nor a hypothesis. It is no opponent of the family tree, since it serves a completely different purpose.

What Schuchardt already mentioned, and what Schmidt might have been looking for, was the idea of phylogenetic networks: if we cannot ignore the fact that languages exchange material laterally as well as they inherit it vertically, we "connect the branches and twigs of the family tree with countless horizontal lines and it ceases to be a tree" (Schuchardt 1870: 11).

References
  • Bloomfield, L. (1933 [1973]). Language. London: Allen & Unwin. 
  • Bonfante, G. (1931). “I dialetti indoeuropei”. Annali del R. Istituto Orientale di Napoli 4, 69–185.
  • Darwin, C. (1859). On the origin of species by means of natural selection, or, the preservation of favoured races in the struggle for life. Electronic resource. Online available under: http://www.nla.gov.au/apps/cdview/nla.gen-vn4591931. London: John Murray.
  • Geisler, H. und J.-M. List (2013). “Do languages grow on trees? The tree metaphor in the history of linguistics”. In: Classification and evolution in biology, linguistics and the history of science. Concepts – methods – visualization. Hrsg. von H. Fangerau, H. Geisler, T. Halling und W. Martin. Stuttgart: Franz Steiner Verlag, 111–124.
  • Hirt, H. (1905). Die Indogermanen. Ihre Verbreitung, ihre Urheimat und ihre Kultur. Bd. 1. Strassburg: Trübner. Internet Archive: dieindogermaneni01hirtuoft.
  • Hoenigswald, H. M. (1963). “On the history of the comparative method”. English. Anthropological Linguistics 5.1, pp. 1–11. URL: http://www.jstor.org/stable/30022394.
  • Meillet, A. (1922 [1908]). Les dialectes Indo-Européens. Paris: Librairie Ancienne Honoré Champion. Internet Archive: lesdialectesindo00meil.
  • Morrison, D. A. (2011). An introduction to phylogenetic networks. Uppsala: RJR Productions.
  • Schleicher, A. (1853a). “Die ersten Spaltungen des indogermanischen Urvolkes”. Allgemeine Monatsschrift für Wissenschaft und Literatur, 786–787.
  • Schleicher, A. (1853b). “O jazyku litevském, zvlástě na slovanský. Čteno v posezení sekcí filologické král. České Společnosti Nauk dne 6. června 1853”. Časopis Čsekého Museum 27, 320–334. URL: http://books.google.de/books?id=cLMDAAAAYAAJ.
  • Schleicher, A. (1863). Die Darwinsche Theorie und die Sprachwissenschaft. Offenes Sendschreiben an Herrn Dr. Erns Haeckel. Weimar: Hermann Böhlau. ZVDD: urn:nbn:de:bvb:12-bsb10588615-5.
  • Schmidt, J. (1872). Die Verwantschaftsverhältnisse der indogermanischen Sprachen. Weimar: Herman Böhlau.
  • Schmidt, J. (1875): Zur Geschichte des Indogermanischen Vokalismus. Weimar: Hermann Böhlau.
  • Schuchardt, H. (1870 [1900]). Über die Klassifikation der romanischen Mundarten. Probe-Vorlesung, gehalten zu Leipzig am 30. April 1870. Graz. URL: http://schuchardt.uni-graz.at/cgi-bin/print.cgi?action=show&type=pdf&id=724.