Showing posts with label Homology. Show all posts
Showing posts with label Homology. Show all posts

Monday, July 22, 2019

Two problems concerning the use of Ancient DNA


Last week I wrote a piece for The Wine Gourd blog, called The role of Wine Influencers — more of the same. I discussed the modern concern in the wine industry with social media Influencers, who use Facebook, Instagram, Twitter, Youtube, etc to promote wine — when LeBron James drinks a wine it will sell a whole lot better (presumably on the principle that “You may not be able to play like LeBron, but you can drink like him”).

My conclusion was that the wine industry has always had what are now called Micro- or Nano-Influencers, involving endorsements from people and organizations who possess an expert level of knowledge as well as social influence. For example, professional wine critics have always fitted this bill, notably Robert M. Parker Jr.

So, the existence of social media Wine Influencers is nothing new — it is simply the modern equivalent of something old.


Well, my blog post here is about the same idea in Ancient-DNA phylogenetics — the idea that, in spite of the claim that modern techniques provide new advantages, we may in fact simply be repeating ourselves. Modern issues are simply modern versions of the same old issues.

First problem

The first issue that I would like to raise is that of molecular data. This is seen as the crucial element of modern studies of ancient remains. Even the recent re-creation of the vineyard of Leonardo da Vinci (La Vigna di Leonardo, in Milan) involved the finding of sufficient DNA from the vineyard land, which was bombed during World War II, to identify the grape cultivar that was grown by da Vinci (Inside Leonardo da Vinci's vineyard).

The issue is that DNA studies, based on direct studies of genotype, are subject to all of the same data-analysis issues as are studies of phenotype (such as morphology, anatomy and ultrastructure).

One classic example is the supposed discovery in the 1980s of the phenomenon of Long-branch Attraction (LBA) in molecular studies. Here, if many shared nucleotide changes occur on distantly related branches of a phylogenetic tree, these branches may actually be reconstructed as sister lineages during the phylogenetic analysis. However, this is simply an example of parallelism, a phenomenon that had previously been known for decades in phylogenetic analyses of phenotype.

Many currently recognized practical problems in genotype studies, such as LBA and compositional biases, are merely specific examples of how analogy appears in molecular biology. Analogy will create convergences and parallelisms, and these will confound the attempt to detect homology.

So, reconstructing evolutionary history using molecular biology is a priori neither better nor worse than using any other source of data, because the same limitations apply. It is simply another type of data.

Second problem

The second issue that I would like to raise is that genome data are a type of Big Data, and the idea that Big Data will apparently solve all ills with data analyses. The idea seems to be that, if you can collect enough data, then you must be lead to "the truth".

This is nonsense — data are just numbers, and numbers can mislead, no matter how many there are. Data need to be interpreted by a human mind, if they are to tell that mind anything useful. The only thing that changes with the use of Big Data is the order in which the steps of the data analysis and interperation occur.

In the Old Days (ie. when I was a student), what we did was:
  1. develop an experimental question
  2. think about potential problems
  3. collect targeted data
  4. analyze the data
  5. interpret the data, to answer the question.
These days, with Big Data, what people do is:
  1. collect a very large amount of data
  2. analyze the data, and try to interpret it
  3. think of a question that the data might answer
  4. discover the potential problems later.
All that is really different is the order, along with which steps are confounded with which other steps.

I don't see that this is necessarily any better; it is just different. So, don't pin your hopes on Ancient DNA genome-scale data to solve problems with your work.

Other issues

Anyone working with Ancient DNA knows that there are oddles of other problems. Some of them are discussed for the general public by Gideon Lewis-Kraus, writing on January 17 2019 for The New York Times Magazine: Is Ancient DNA Research revealing new truths — or falling into old traps? The answer is, of course, "both".

Monday, August 27, 2018

Regular cognates: A new term for homology relations in linguistics


The identification of homologous words between genealogically related languages is one of the crucial tasks in historical linguistics. In contrast to biology where, especially at the level of genetic sequences, we find a rather rich terminology contrasting different types of homology among genes and gene sequences, linguistic terminology is still not very precise. Most scholars seem to be content if they can claim that they have identified words that are cognate, which means that they are homologous but have not been borrowed throughout their history.

On various occasions in the past, I have tried to work on a more precise terminology for linguistic frameworks (see for example List 2014 and List 2016, or this earlier blogpost on homology in linguistics). In this context, I have often tried to emphasize that we need to be specifically more careful with the problem of partial cognacy in linguistics, since many words across related languages are not fully homologous, but show homology only in specific parts (List et al. 2016).

Thanks to an increase in accurately annotated linguistic data, resulting specifically from my very productive collaboration with Nathan W. Hill (SOAS, London) on the Burmish languages (see Hill and List 2017), my view has now again changed a bit, and I thought it would be useful to share it here.

Cognacy and homology

The starting point for my earlier proposals to refine the notion of cognacy in linguistics was the rather refined distinction between orthologs, paralogs, and xenologs in molecular biology (Fitch 2000). To account for the distinction between directly inherited (orthologs), duplicated (paralogs), and laterally transferred genes (xenologs), I proposed the terms direct cognates, indirect cognates (inspired by the term oblique cognates by Trask 2000), and indirectly etymologically related words or morphemes (word parts).

While the first and last term are more or less straightforward with respect to linguistic processes, the notion of indirect cognates, however, turned out to be insufficient, given that it is not clear which processes lead to indirect cognacy. Originally, I thought of morphological processes, that is, processes of word formation, by which a word is slightly modified to account for a slightly derived meaning (usually involving processes like suffixation or compounding). My idea was that words that have "experienced" these processes would behave similarly to genes that have been duplicated in biological evolution, and that it would be sufficient to just assign them to a common sub-class of cognates.

However, the research with Nathan W. Hill recently revealed that these terms are insufficient to capture the processes underlying lexical change in historical linguistics.

In order to understand this idea, it is useful to get back to the biological terms and have a closer look at how they distinguish the underlying processes. As far as I understand it, a directaly inherited gene sequence may differ from its ancestral sequence due to processes of random mutation, by which the original gene sequence becomes modified throughout its history. In cases of paralogy, the original gene sequence is duplicated and both copies are subsequently inherited. The copies may, during this process, become more different from each other than would be expected when assuming direct inheritance and random mutation. Similarly, in cases of lateral transfer of genetic material, the changes may again be different from the ones introduced by "normal" random mutation.

If we adopt the view of "normal change", as it is employed in the biological processes, we find a counterpart in the process of sound change in linguistics. As I have mentioned earlier, sound change is a systemic process by which certain sounds in certain environments change regularly across all words in the lexicon of a given language. This process is definitely not comparable with random mutation in sequence evolution, since the process involves a class of "letters" in the sound system of a language that are systematically turned into another sound. However, regarding the crucial role that sound change plays in language evolution, it seems that it is in some sense comparable with random mutation resulting in orthologous genes. Sound change is somewhat the baseline of what happens if languages change, and we have the means to identify its traces by searching for regular sound correspondence patterns across related languages (see my earlier blogpost on this matter).

That sound change is the default which can be handled with some confidence, while other processes, like word formation, semantic change, or the notorious process of analogical leveling, by which not only complex paradigms are transformed to reduce complexity, but other complexities can emerge (compare the German irregular plural of Morgen-de "mornings", which is built on the template of "evenings" Abend-e), is also the reason why Gévaudan (2007) does not include it into the major processes of lexical change. If we take sound change as the default process of language change and as our key evidence for homologous word relations, however, this means that we can no longer make the distinction between direct and indirect cognates following my earlier proposal, since indirect cognates do not necessarily reflect instances of irregular sound change.

This is in fact easy to illustrate. If we follow the former definition of indirect cognacy, the comparison of German Handschuh "glove" (lit. hand-shoe) with English hand would reflect indirect cognacy, since the German word is a compound of Hand "hand" and Schuh "shoe", and thus a derived word form. The morpheme Hand in this example, however, is phonetically identical with German Hand, and the sound correspondences between the English word and the first element of the German compound are still regular by all means. In fact, only a small amount of word formation processes in language evolution also impact on the pronunciation of the base forms.

This means, in turn, that any distinction of cognate word forms (and word parts, i.e., morphemes) into direct and indirect ones that is based on the absence or presence of morphological (= word formation) processes, does not tell us much about the degree to which the sound change affecting these word forms was regular. We could state that direct cognates should always reflect regular sound change, since any irregularity would have to be accounted for by alternative explanations (eg. shortening of a given word due to frequent use, assimilation of sounds serving the ease of pronunciation, etc.).

I wonder whether this would be useful for the initial idea behind the concept of direct cognacy. If we find direct cognates, that is, words that we assume were used by a couple of languages without further modification, apart from regular sound change and potentially sporadic sound changes, it seems still useful to assume that these reflect vertical language history better than cognate sets with residues that were exposed to various morphological processes. Thus, when coding direct cognacy in linguistic datasets, sporadic sound change (if it can be illustrated properly) should not serve as an argument against direct cognacy.

The only way around this problem seems to be to establish a further shade of cognacy, which describes the relations among words and morphemes that have been only affected by sound change, in contrast to words whose history reflects various morphological derivations that impact directly on pronunciation, or processes of irregular sound change due to analogical leveling or assimilation. While I first thought that the biological term ortholog would be useful to describe these specific word relations in linguistics, I realized later that, judging from the Ancient Greek meaning of ortholog (ortho "straight, direct" + logos "relation"), the fact that differences are due to regular sound change is not that neatly reflected.

For now, I think that it should be sufficient to use the term regular cognates for those words or word parts for which we can demonstrate that their change was following the regular "laws" of sound change. Regular cognates are thus defined as words or word parts that have been affected only by sound change during their history. This notion deliberately excludes differences in meaning, frequency of use, or whether the word forms are only reflected in compounds or derived word forms. In fact, for some cases, we could even propose that only parts of a word form that no longer bear any meaning of their own (eg. the first two sounds of a word form) are regular cognates, as long as we can propose good arguments for the regularity of the correspondences.

Note that our tools for alignment analyses in historical linguistics already account for this property. The EDICTOR (http://edictor.digling.org, List 2017), a web-based tool for editing, analyzing, and publishing etymological dictionaries, allows users to exclude those parts from an alignment that are assumed to be irregular, as can be seen in the following illustrative alignment of Proto-Germanic *bakanan "to bake". Scholars who want to be explicit about what parts of an alignment they consider to be regular can use this annotation framework to provide more refined analyses.

EDICTOR alignment of regular cognates for Proto-Germanic *bakanan "to bake"

A crucial consequence of using only regularity in the sound correspondences as the criterion to distinguish regular from irregular cognates is that regular cognacy may also be found to hold for borrowings, since borrowings can, as well, be shown to be regular, especially when the language contact between languages was intensive. Identifying regular cognates is furthermore the first and most important step of the classical comparative method (Weiss 2015) for historical language comparison, since (unless we have written evidence for the true relations between languages) regular cognates (as proven by readily aligned cognate sets) are the fundament upon which we build all our hypotheses regarding the external history of languages.

References
Fitch, W. (2000) Homology: s personal view on some of the problems. Trends in Genetics 16.5: 227-231.
Hill, N. and J.-M. List (2017) Challenges of annotation and analysis in computer-assisted language comparison: a case study on Burmish languages. Yearbook of the Poznań Linguistic Meeting 3.1: 47–76.
List, J.-M. (2014) Sequence Comparison in Historical Linguistics. Düsseldorf University Press: Düsseldorf.
List, J.-M. (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1.2: 119-136.
List, J.-M., P. Lopez, and E. Bapteste (2016) Using sequence similarity networks to identify partial cognates in multilingual wordlists. In: Proceedings of the Association of Computational Linguistics 2016 (Volume 2: Short Papers). Association of Computational Linguistics, pp. 599-605.

List, J.-M. (2017) A web-based interactive tool for creating, inspecting, editing, and publishing etymological datasets. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. System Demonstrations, pp. 9-12.
Trask, R. (2000) The Dictionary of Historical and Comparative Linguistics. Edinburgh University Press: Edinburgh.
Weiss, M. (2015) The comparative method. In: Bowern, C. and N. Evans (eds.) The Routledge Handbook of Historical Linguistics. Routledge: New York, pp. 127-145.

Tuesday, January 31, 2017

Similarities and language relationship


There is a long-standing debate in linguistics regarding the best proof deep relationships between languages. Scholars often break it down to the question of words vs. rules, or lexicon vs. grammar. However, this is essentially misleading, since it suggests that only one type of evidence could ever be used, whereas most of the time it is the accumulation of multiple pieces of evidence that helps to convince scholars. Even if this debate is misleading, it is interesting, since it reflects a general problem of historical linguistics: the problem of similarities between languages, and how to interpret them.

Unlike (or like?) biology, linguistics has a serious problem with similarities. Languages can be strikingly similar in various ways. They can share similar words, but also similar structures, similar ways of expressing things.

In Chinese, for example, new words can be easily created by compounding existing ones, and the word for 'train' is expressed by combining huǒ 火 'fire' and chē 車 'wagon'. The same can be done in languages like German and English, where the words Feuerwagen and fire wagon will be slightly differently interpreted by the speakers, but the constructions are nevertheless valid candidates for words in both languages. In Russian, on the other hand, it is not possible to just put two nouns together to form a new word, but one needs to say something as огненная машина (ognyonnaya mašína), which literally could be translated as 'firy wagon'.

Neither German nor English are historically closely related to Chinese, but German, English, and Russian go back to the same relatively recent ancestral language. We can see that whether a language allows compounding of two words to form a new one or not, is not really indicative of its history, as is the question of whether a language has an article, or whether it has a case system.

The problem with similarities between languages is that the apparent similarities may have different sources, and not all of them are due to historical development. Similarities can be:
  1. coincidental (simply due to chance),
  2. natural (being grounded in human cognition),
  3. genealogical (due to common inheritance), and
  4. contact-induced (due to lateral transfer).
As an example for the first type of similarity, consider the Modern Greek word θεός [θɛɔs] ‘god’ and the Spanish dios [diɔs] ‘god’. Both words look similar and sound similar, but this is a sheer coincidence. This becomes clear when comparing the oldest ancestor forms of the words that are reflected in written sources, namely Old Latin deivos, and Mycenaean Greek thehós (Meier-Brügger 2002: 57f).

As an example of the second type of similarity, consider the Chinese word māmā 媽媽 'mother' vs. the German Mama 'mother'. Both words are strikingly similar, not because they are related, but because they reflect the process of language acquisition by children, which usually starts with vowels like [a] and the nasal consonant [m] (Jakobson 1960).

An example of genealogical similarity is the German Zahn and the English tooth, both going back to a Proto-Germanic form *tanθ-. Contact-induced similarity (the fourth type) is reflected in the English mountain and the French montagne, since the former was borrowed from the latter.

We can display these similarities in the following decision tree, along with examples from the lexicon of different languages (see List 2014: 56):

Four basic types of similarity in linguistics

In this figure, I have highlighted the last two types of similarity (in a box) in order to indicate that they are historical similarities. They reflect individual language development, and allow us to investigate the evolutionary history of languages. Natural and coincidental similarities, on the other hand, are not indicative of history.

When trying to infer the evolutionary history of languages, it is thus crucial to first rule out the non-historical similarities, and then the contact-induced similarities. The non-historical similarities will only add noise to the historical signal, and the contact-induced similarities need to be separated from the genealogical similarities, in order to find out which languages share a common origin and which languages have merely influenced each other some time during their history.

Unfortunately, it is not trivial to disentangle these similarities. Coincidence, for example, seems to be easy to handle, but it is notoriously difficult to calculate the likelihood of chance similarities. Scholars have tried to model the probability of chance similarities mathematically, but their models are far too simple to provide us with good estimations, as they usually only consider the first consonant of a word in no more than 200 words of each language (Ringe 1992, Baxter and Manaster Ramer 2000, Kessler 2001).

The problem here is that everything that goes beyond word-initial consonants would have to take the probability of word structures into account. However, since languages differ greatly regarding their so-called phonotactic structure (that is, the sound combinations they allow to occur inside a syllable or a word), an account on chance similarities would need to include a probabilistic model of possible and language-specific word structures. So far, I am not aware of anybody who has tried to tackle this problem.

Even more problematic is the second type of similarity. At first sight, it seems that one could capture natural similarities by searching for similarities that recur in very diverse locations of the world. If we compare, for example, which languages have tones, and we find that tones occur almost all over the world, we could argue that the existence of tone languages is not a good indicator of relatedness, since tonal systems can easily develop independently.

The problem with independent development, however, is again tricky, as we need to distinguish different aspects of independence. Independent development could be due to: human cognition (the fact that many languages all over the world denote the bark of a tree with a compound tree-skin is obviously grounded in our perception); or due to language acquisition (like the case of words for 'mother'); but potentially also due to environmental factors, such as the size of the population of speakers (Lupyan et al. 2010), or the location where the languages are spoken (see Everett et al. 2015, but also compare the critical assessment in Hammarström 2016).

Convergence (in linguistics, the term is used to denote similar development due to contact) is a very frequent phenomenon in language evolution, and can happen in all domains of language. Often we simply do not know enough to make a qualified assessment as to whether certain features that are similar among languages are inherited/borrowed or have developed independently.

Interestingly, this was first emphasized by Karl Brugmann (1849-1919), who is often credited as the "father of cladistic thinking" in linguistics. Linguists usually quote his paper from 1884, in order to emphasize the crucial role that Brugmann attributed to shared innovations (synapomorphies in the cladistic terminology) for the purpose of subgrouping. When reading this paper thoroughly, however, it is obvious that Brugmann himself was much less obsessed with the obscure and circular notion of shared innovations (which also holds for cladistics in biology; see De Laet 2005), but with the fact that it is often impossible to actually find them, due to our incapacity to disentangle independent development, inheritance and borrowing.

So far, most linguistic research has concentrated on the problem of distinguishing borrowed from inherited traits, and it is here that the fight over lexicon or grammar as primary evidence for relatedness primarily developed. Since certain aspects of grammar, like case inflection, are rarely transferred from one language to another, while words are easily borrowed, some linguists claim that only grammatical similarities are sufficient evidence of language relationship. This argument is not necessarily productive, since many languages simply lack grammatical structures like inflection, and will therefore not be amenable to any investigation, if we only accept inflectional morphology (grammar) as rigorous proof (for a full discussion, see Dybo and Starostin 2008). Luckily, we do not need to go that far. Aikhenvald (2007: 5) proposes the following borrowability scale:
Aikhenvald's (2007) scale of borrowability

As we can see from this scale, core lexicon (basic vocabulary) ranks second, right behind inflectional morphology. Pragmatically, we can thus say: if we have nothing but the words, it is better to compare words than anything else. Even more important is that, even if we compare what people label "grammar", we compare concrete form-meaning pairs (e.g., concrete plural-endings), and we never compare abstract features (e.g., whether languages have an article). We do so in order to avoid the "homoplasy problem" that causes so many headaches in our research. No biologist would group insects, birds, and bats based on their wings; and no linguist would group Chinese and English due to their lack of complex morphology and their preference for compound words.

Why do I mention all this in this blog post? For three main reasons. First, the problem of similarity is still creating a lot of confusion in the interdisciplinary dialogues involving linguistics and biology. David is right: similarity between linguistic traits is more like similarity in morphological traits in biology (phenotype), but too often, scholars draw the analogy with genes (genotype) (Morrison 2014).

Second, the problem of disentangling different kinds of similarities is not unique to linguistics, but is also present in biology (Gordon and Notar 2015), and comparing the problems that both disciplines face is interesting and may even be inspiring.

Third, the problem of similarities has direct implications for our null hypothesis when considering certain types of data. David asked in a recent blog post: "What is the null hypothesis for a phylogeny?" When dealing with observed similarity patterns across different languages, and recalling that we do not have the luxury to assume monogenesis in language evolution, we might want to know what the null hypothesis for these data should be. I have to admit, however, that I really don't know the answer.

References
  • Aikhenvald, A. (2007): Grammars in contact. A cross-linguistic perspective. In: Aikhenvald, A. and R. Dixon (eds.): Grammars in Contact. Oxford University Press: Oxford. 1-66.
  • Baxter, W. and A. Manaster Ramer (2000): Beyond lumping and splitting: Probabilistic issues in historical linguistics. In: Renfrew, C., A. McMahon, and L. Trask (eds.): Time depth in historical linguistics. McDonald Institute for Archaeological Research: Cambridge. 167-188.
  • Brugmann, K. (1884): Zur Frage nach den Verwandtschaftsverhältnissen der indogermanischen Sprachen [Questions regarding the closer relationship of the Indo-European languages]. Internationale Zeischrift für allgemeine Sprachewissenschaft 1. 228-256.
  • De Laet, J. (2005): Parsimony and the problem of inapplicables in sequence data. In: Albert, V. (ed.): Parsimony, phylogeny, and genomics. Oxford University Press: Oxford. 81-116.
  • Dybo, A. and G. Starostin (2008): In defense of the comparative method, or the end of the Vovin controversy. In: Smirnov, I. (ed.): Aspekty komparativistiki.3. RGGU: Moscow. 119-258.
  • Everett, C., D. Blasi, and S. Roberts (2015): Climate, vocal folds, and tonal languages: Connecting the physiological and geographic dots. Proceedings of the National Academy of Sciences 112.5. 1322-1327.
  • Gordon, M. and J. Notar (2015): Can systems biology help to separate evolutionary analogies (convergent homoplasies) from homologies?. Progress in Biophysics and Molecular Biology 117. 19-29.
  • Hammarström, H. (2016): There is no demonstrable effect of desiccation. Journal of Language Evolution 1.1. 65–69.
  • Jakobson, R. (1960): Why ‘Mama’ and ‘Papa’?. In: Perspectives in psychological theory: Essays in honor of Heinz Werner. 124-134.
  • Kessler, B. (2001): The significance of word lists. Statistical tests for investigating historical connections between languages. CSLI Publications: Stanford.
  • List, J.-M. (2014): Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
  • Lupyan, G. and R. Dale (2010): Language structure is partly determined by social structure. PLoS ONE 5.1. e8559.
  • Meier-Brügger, M. (2002): Indogermanische Sprachwissenschaft. de Gruyter: Berlin and New York.
  • Morrison, D. (2014): Is the Tree of Life the best metaphor, model, or heuristic for phylogenetics?. Systematic Biology 63.4. 628-638.
  • Ringe, D. (1992): On calculating the factor of chance in language comparison. Transactions of the American Philosophical Society 82.1. 1-110.

Tuesday, June 21, 2016

Alignments and phylogenetic reconstruction in linguistics and biology


In a very interesting article from 2009 (Morrison 2009), David discusses the question of why phylogeneticists would "ignore computerized sequence alignment". This article was really interesting to me for two reasons: First, the article provides some interesting statistics regarding the degree to which biologists manually adjust the alignments that were automatically produced by software. Second, the article points to the seemingly strange situation in biology in which tree-building is considered to be a task that can be entirely carried out by machines, while the majority of scholars would not trust their final sequence alignments to a computer (Morrison 2009: 150).

This situation finds a direct analogon in historical linguistics. Phylogenetic reconstruction is gaining more and more ground, with many scholars applying (mostly Bayesian) phylogenetic tools to analyze their data (Indo-European: Bouckaert et al. 2012, Tupí-Guaraní (South America): Michael et al. 2015, Japonic: Lee and Hasegawa 2011, Pama-Nguyan (Australian): Bowern and Atkinson 2012, Semitic: Kitchen et al. 2009, Bantu: Grollemund et al. 2015, etc.). Fully automated workflows involving automatic sequence comparison are also practiced (Holman et al. 2011, Jäger 2015, Wheeler 2015), but many linguists remain sceptical regarding their results.

One major difference between biology and linguistics is the selection of comparanda. Biological methods usually derive phylogenetic trees from multiply aligned sequences. Linguistic methods derive trees from sets of homologous (cognate) words (cognate sets) distributed across languages whose evolution is modeled as a process of word-gain and word loss (similar to gene-family gain-loss-studies in biology). While biologists fiddle with their alignments, linguists fiddle with their cognate sets. Cognate identification is exclusively done manually at the moment, and scholars use all kinds of information about word relations that they can get, be it etymological dictionaries, which have been published for more than 200 years, or the intuition of the expert who is annotating the data for cognacy.

Identification of cognate sets in linguistics is essentially a task of sequence comparison (List 2014), and algorithmic as well as manual procedures involve the multiple and the pairwise alignment of words (even if it is done only implicitly by human experts). Compared to biology, sequence comparison in historical linguistics is exacerbated by two factors:
  • alphabets (phoneme systems) in linguistics are themselves mutable (Geisler and List 2013), so that when aligning two words we need to find both a mapping between the two alphabets, translating one alphabet into the other, plus a scoring function by which we can score the alignment,
  • regular sound change (the process by which the phoneme system is changed) and sporadic sound change (the process by which a sound is sporadically assimilated, lost, or added) are not the only processes that contribute to change of words in the lexicon, and morphological change (by which whole blocks of meaningful parts of a word are re-arranged, exchanged, lost, or added) yields patterns that are essentially unalignable.
The problem of finding the correct mapping between two alphabets in linguistics is further exacerbated by language contact: If languages exchange words on a large scale, then this may have a huge impact on the system of the languages, and it may even introduce new sounds to a language that were not there before (thanks to English, German has now the sound [dʒ], as in journalist or job). If borrowing is frequent enough, it may get close to impossible to judge from comparing the words alone, whether two words in different languages have been transferred directly (vertically) from an ancestral language, or laterally.

As a result, it is probably understandable why linguists often refuse to carry out full alignments of the words in their data. An alignment itself does not necessarily tell us much, compared to all of those processes that an expert infers when comparing language data, which are not alignable.

As an example, let us consider the word for "sun" in six Indo-European languages. Since "sun" is a very basic concept, probably fundamental for all human cultures, experts assume that this word was present as *séh₂u̯el- in Indo-European (an asterisk indicates that the word is not reflected in written sources), and that it was retained as Russian солнце [sɔnʦə], Polish słońce [swɔnjʦɛ], French soleil [sɔlɛj], Italian sole [sole], German Sonne [sɔnə], and Swedish sol [suːl] (Wodtko et al. 2008). An obvious alignment, reflecting the surface similarity between all of these words, would be the following one (taken from List 2014: 135):

Alignment based on sequence similarity.

This alignment, however, is by no means correct. Russian [sɔnʦə] and Polish [swɔnʲʦɛ], for example, share a common suffix, which is reflected as [nʦə] in Russian and as [nʲʦɛ] in Polish, and which was innovated in the the common ancestor of Russian and Polish, but is not present in either of the four other languages. So the [n] in German [sɔnə] is essentially not homologous with the [n] in Russian or the [nʲ] in Polish. The same applies to the [ɛj] in French [sɔlɛj] which reflects a diminutive suffix in Latin sol-iculus "small sun", the regular ancestor form of French soleil. Furthermore, the [w] in the Polish word regularly corresponds to the [l] in French, Italian, and Swedish, but it reflects a swap (metathesis) in the order of the vowel and the consonant in Polish — [sɔl] became [slɔ] which became [swɔ]).

Taking all (and more) of this into account, we need to modify our alignment to account more closely for the processes that experts have inferred from intensive language comparison, as shown in the next figure below (taken from List 2014: 135). In this alignment, the swap in Polish is reflected by the white font of the sounds involved, and gray-shaded columns are supposed to reflect the oldest layer of homology.

Historically informed alignment.

However, even this alignment is essentially misleading. The Indo-European word for "sun" supposedly had a complex paradigm in which the word's stem was alternating in the nominative (and accusative) case and the other cases (oblique cases). So, nominative and accusative used the stem *sóh₂u̯el-, while the other cases used the stem *sh₂én-. The Russian, Polish, French, Italian, and the Swedish form go back to the former, while the German form goes back to the latter, since it is further assumed (or it can be assumed) that the alternation was still preserved in the ancestor of Swedish and German.

This means, however, that our alignment above shrinks to an alignment in which only the first letter, the s, is still reflected in all languages! The following graphic (taken from List 2016) illustrates the processes that led to the current situation for four of our six languages:

Morphological processes of lexical change.

What does this example tell us? On the one hand, it gives some explanation for why linguists do not really want to align words (although the first alignments go back to the early 20th centur, cf. Dixon and Kroeber 1919). It also explains, why classical linguists have a very sceptical attitude towards the computerization of word comparisons, based on the (partially justified) assumption that computers could not handle the complex patterns that are so characteristic of language change.

On the other hand, comparing the situation with biology as reported in Morrison (2009), we can find an interesting parallel between the two disciplines: both linguists and biologists do not really trust machines for comparing their sequences (albeit at different levels of analysis), but they do not seem to have many problems in trusting machines to reconstruct their trees.

However, especially this last point, the fact that we trust machines to grow our trees, while we distrust them to prepare the seeds, should ring an alarm bell. First, we seem to lack clear guidelines (at least in linguistics) regarding the way the manual adjustment (of alignments in biology and cognate sets in linguistics) should be carried out, which has a clear impact on repeatability. Second, if we have processes in both fields that yield essentially unalignable patterns, such as duplications and other molecular processes in biology (Morrison 2009: 156), and morphological processes in linguistics, how can we assume that a phylogenetic tree analysis can sufficiently cope with them, even if we manually adjust everything?

References
  • Bouckaert, R., P. Lemey, M. Dunn, S. Greenhill, A. Alekseyenko, A. Drummond, R. Gray, M. Suchard, and Q. Atkinson (2012): Mapping the origins and expansion of the Indo-European language family. Science 337.6097. 957-960.
  • Bowern, C. and Q. Atkinson (2012): Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88. 817-845.
  • Dixon, R. and A. Kroeber (1919): Linguistic families of California. University of California Press: Berkeley.
  • Geisler, H. and J.-M. List (2013): Do languages grow on trees? The tree metaphor in the history of linguistics. In: Fangerau, H., H. Geisler, T. Halling, and W. Martin (eds.): Classification and evolution in biology, linguistics and the history of science. Concepts – methods – visualization. Franz Steiner Verlag: Stuttgart. 111-124.
  • Grollemund, R., S. Branford, K. Bostoen, A. Meade, C. Venditti, and M. Pagel (2015): Bantu expansion shows that habitat alters the route and pace of human dispersals. Proceedings of the National Academy of Sciences 112.43. 13296–13301.
  • Holman, E., C. Brown, S. Wichmann, A. Müller, V. Velupillai, H. Hammarström, S. Sauppe, H. Jung, D. Bakker, P. Brown, O. Belyaev, M. Urban, R. Mailhammer, J.-M. List, and D. Egorov (2011): Automated dating of the world’s language families based on lexical similarity. Curr. Anthropol. 52.6. 841-875.
  • Jäger, G. (2015): Support for linguistic macrofamilies from weighted alignment. Proceedings of the National Academy of Sciences 112.41. 12752–12757.
  • Kitchen, A., C. Ehret, S. Assefa, and C. Mulligan (2009): Bayesian phylogenetic analysis of Semitic languages identifies an Early Bronze Age origin of Semitic in the Near East. Proc. R. Soc. London, Ser. B 276.1668. 2703-2710.
  • Lee, S. and T. Hasegawa (2011): Bayesian phylogenetic analysis supports an agricultural origin of Japonic languages. Proc. R. Soc. London, Ser. B 278.1725. 3662-3669.
  • List, J.-M. (2014): Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
  • List, J.-M. (2016): Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1. DOI: 10.1093/jole/lzw006.
  • Michael, L., N. Chousou-Polydouri, K. Bartolomei, E. Donnelly, V. Wauters, S. Meira, and Z. O’Hagan (2015): A Bayesian phylogenetic classification of Tupí-Guaraní. LIAMES 15.2. 193-221.
  • Morrison, D. (2009): Why would phylogeneticists ignore computerized sequence alignment? Syst. Biol. 58.1. 150-158.
  • Wheeler, W. and P. Whiteley (2015): Historical linguistics as a sequence optimization problem: the evolution and biogeography of Uto-Aztecan languages. Cladistics 31.2. 113-125.
  • Wodtko, D., B. Irslinger, and C. Schneider (2008): Nomina im Indogermanischen Lexikon [Nouns in the Indo-European lexicon]. Winter: Heidelberg.

Tuesday, May 17, 2016

Machine learning, the Go-game, and language evolution


I am not a hard-core science fiction fan. I have not even watched the new Star Wars movie yet. But I am quite interested in all kinds of issues involving artificial intelligence, duels between humans and machines, and also the ethical implications as they are discussed, for example, in the old Blade Runner movie. It is therefore no wonder that my interest was caught by the recent Go-Game human-machine challenge.

Silver et al. (2016) reported in an article about a new Go program, called AlphaGo, that defeated other Go programs with a rate of 99.8%, and finally also defeated the European Go champion, Fan Hui, in 5 matches with 5 to 0. They proudly report in their paper (p. 488):
This is the first time that a computer Go program has defeated a human professional player, without handicap, in the full game of Go — a feat that was previously believed to be at least a decade away.
The secret of the success of the new Go program seems to lie in a smart workflow by which the neural networks of the program were trained. As a result, the program could afford to calculate "thousands of times fewer positions than Deep Blue did in its chess match against Kasparov" (Silver et al. 2016: 489).

I should say that I was never really interested in the Go-game before. My father played it once in a while when I was a child, but I never understood what one actually needs to do. From the articles in the media in which this fight between man and machine was reported, I learned, however, that the Go-Game was apparently considered to be much more challenging than the Chess Game, due to an increased number of positions and moves, and that nobody was expecting the time to be already ripe for machines to beat humans in this task.

When reading the article and reflecting about it, I wondered how complicated the task of finding homologous words in linguistic datasets might be compared to the Go-Game. I know quite a few colleagues who consider this task as impossible to model; and I know that they have not only good reasons, but also a lot of experience in language comparison, so they would not say this without having given it some serious thoughts. But if it is impossible for computer programs to compete with humans in language comparison, does this mean that the Go-Game is a less challenging task?

On the other hand, I know also quite a few colleagues who consider automatic data-driven approaches in historical linguistics to be generally superior to the classical manual workflow of the comparative method (Meillet 1925). In fact, the algorithms for cognate detection that I developed during my PhD (List 2014) are often criticized as lacking the stochastic or the machine-learning component, since they are based on a rather explicit attempt to model how historical linguists compare languages.

Among many classically oriented linguists there is a strong mistrust regarding all kind of automated approaches in historical linguistics, while among many computationally oriented linguists and linguistically oriented computer scientists there is a strong belief that enough data will sooner or later solve the problems, and that all explicit frameworks with hard-coded parameters are inferior to data-driven frameworks. While classical linguists usually emphasize that the processes are just too complex to be modeled with simple approaches as they are used by computational linguists, the computational camp usually emphasizes the importance of letting "the data decide", or that "the data is robust enough to find signal even with simple models".

Given the success of AlphaGo, one could argue that the computational camp might be right, and that it will be just a matter of time until manual language comparison will be done in a fully automated manner. Our current situation in historical linguistics is somewhat similar to the situation in evolutionary biology during the 1960s and 1970s, when quantitative scholars prophesied (incorrectly, so far) that most classical taxonomists would soon be replaced by computers (Hull 1990: 121f).

However, since we are scientists, we should be really careful with any kind of orthodoxy, and I consider as problematic both the blind trust in machine learning techniques as well as the blind trust in the superiority of human experts over quantitative analyses. The problem with human experts is that they are necessarily less consistent and efficient than machines when it comes to tasks like counting and repeating. Given the increasing amount of digitally available data in historical linguistics, we simply lack the human resources to pursue classical research without trying to automatize at least parts of it.

The problem of computational approaches, and especially machine-learning techniques, however, is that they only provide us with a result of our analysis, not with an explanation that would tell us why the result was preferred over alternative possibilities. Apparently, Go players now have this problem with AlphaGo, since in many cases they do not know why the program made a certain move, they only know that it turned out to be successful. This black-box aspect of many computational approaches does not necessarily constitute a problem in practical applications: When designing an application for automatic speech recognition, the users won't care how the application recognizes speech as long as it understands their demands and acts accordingly. In science, however, it is not just the results that matter, but the explanation.

This is especially important in the historical sciences, where we investigate what happened in the past, and we constantly revise our knowledge about the past events by adjusting our theories and our interpretation of the evidence. If a machine tells me that two words in different languages are homologous, it is not the statement which is interesting but the explanation. Without the explanation, the statement itself is worthless. Since we are dealing with statements about the past, we can never really prove any statement that has been made. But what we can do is investigate explanations and compare the evolution of explanations in the past, thereby selecting those explanations that we prefer, perhaps because they are more probable, more general, or less complicated. A black-box method for word homology prediction would only make sense if we could evaluate the prediction — but if we could evaluate the prediction, we would not need the black-box method any more.

This does not mean that black-box methods are generally useless. A well-trained homology prediction machine could still speed up the process of data annotation, or assist linguists by providing them with initial hints regarding remotely related language families. But as long as black-box methods remain black boxes, they won't be able to replace the only ones who could still interpret what they produce.

References
  • Hull, D. (1988): Science as a Process - An Evolutionary Account of the Social and Conceptual Development of Science. The University of Chicago Press: Chicago.
  • List, J.-M. (2014): Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
  • Meillet, A. (1954): La méthode comparative en linguistique historique [The comparative method in historical linguistics]. Honoré Champion: Paris.
  • Silver, D., A. Huang, C. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016): Mastering the game of Go with deep neural networks and tree search. Nature 529.7587. 484-489.

Wednesday, December 9, 2015

Lexicostatistics: the predecessor of phylogenetic analyses in historical linguistics


Phylogenetic approaches in historical linguistics are extremely common nowadays. Especially, probabilistic models that model lexical change as a birth-death process of cognate sets evolving along a phylogenetic tree (Pagel 2009) are very popular (Lee and Hasegawa 2011, Kitchen et al. 2009, Bowern and Atkinson 2012), but also splits networks are frequently used (Ben Hamed 2005, Heggarty et al. 2010).

However, the standard procedure to produce a family tree or network with phylogenetic software in linguistics goes back to the method of lexicostatistics, which was developed in the 1950s by Morris Swadesh (1909-1967) in a series of papers (Swadesh 1950, 1952, 1955). Lexicostatistics was discarded by the linguistic community not long after it was proposed (Hoijer 1956, Bergsland and Vogt 1962). Since then, lexicostatistics is considered a methodus non gratus in classical circles of historical linguistics, and using it openly may drastically downgrade one's perceived credibility in certain parts of the community.

To avoid the conflicts, most linguists practicing modern phylogenetic approaches emphasize the fundamental differences between early lexicostatistics and modern phylogenetics. These differences, however, apply only to the way the data is analysed. The basic assumptions underlying the selection and preparation of data have not changed since the 1950s, and it is important to keep this in mind, especially when searching for appropriate phylogenetic models to analyse the data.

The Theory of Basic Vocabulary

Swadesh's basic idea was that in the lexicon of every human language there are words that are culturally neutral and functionally universal; and he used the term "basic vocabulary" to refer to these words. Culturally neutral hereby means that the meanings expressed by the words are independently used across different cultures. Functional universality means that the meanings are expressed by all human languages independent of the time and place where they are spoken. The idea is that these meanings are so important for the functioning of a language as a tool of communication, that every language needs to express them.

Cultural neutrality and functional universality guarantee two important aspects of basic words: their stability and their resistance to borrowing. Stability means that words expressing a basic concept are less likely to change their meaning or to be replaced by another word. An argument for this claim is the functional importance of the words — if the words are important for the functioning of a language, it would not make much sense to change them too quickly. Humans are good at changing the meanings of words, as we can see from daily conversations in the media, where new words tend to pop up seemingly on a daily basis, and old words often drastically change their meanings. But changing words that express basic meanings like "head", "stone", "foot", or "mountain" too often might give rise to confusion in communication. As a result, one can assume that words change at a different pace, depending on the meaning they express, and this is one of the core claims of lexicostatistics.

Resistance to borrowing follows also from stability, since the replacement of words expressing basic meanings may again have an impact on our daily communication, and we may thus assume that speakers avoid borrowing these words too quickly. Cultural neutrality of concepts is another important point to guarantee resistance to borrowing. Words expressing concepts which play an important cultural role may easily be transferred from one language to another along with the culture. Thus, although it seems likely that every language has a word for "god" or "spirit" and the like (so the concept is to a certain degree functionally universal), the lack of cultural independency makes words expressing religious terms very likely candidates for borrowing, and it is probably no coincidence that words expressing religion and belief rank first in the scale of borrowability (Tadmor 2009: 232).

Lexical Replacement, Data Preparation, and Divergence Time Estimation

Swadesh had further ideas regarding the importance of basic vocabulary. He assumed that the process of lexical replacement follows universal rates as far as the basic vocabulary is concerned, and that this would allow us to date the divergence of languages, provided we are able to identify the shared cognates. In lexical replacement, a word w₁ expressing a given meaning x in a language is replaced by a word w₂ which then expresses the meaning x, while w₁ either shifts to express another meaning, or completely disappears from the language. For example, older thou did in English was replaced by the plural form you, which now also expresses the singular. In order to search for cognates and determine the time when two languages diverged, Swadesh proposed a straightforward procedure, consisting of very concrete steps (compare Dyen et al. 1992):
  • Compile a list of basic concepts (concepts that you think are culturally neutral and functionally universal; see here for a comparative collection of different lists that have been proposed and used in the past)
  • translate these concepts into the different languages you want to analyse
  • search for cognates between the languages in each meaning slot; if words in two languages are not cognate for a given meaning, then this points to former processes of lexical replacement in at least one of the languages since their divergence
  • count the number of shared cognates, and use some mathematics to calculate the divergence time (which has been independently calibrated using some test cases of known divergence times).
As an example for such a wordlist with cognate judgments, compare the table in the first figure, where I have entered just a few basic concepts from Swadesh's standard concept list and translated them into four languages. Cognacy is assigned with help of IDs in the column at the right of each language column, but also further highlighted with different colors.

Classical cognate coding in lexicostatistics

Phylogenetic Approaches in Historical Linguistics

Modern phylogenetic approaches in historical linguistics basically follow the same workflow that Swadesh propagated for lexicostatistics, the only difference being the last step of the working procedure. Instead of Swadesh's formula, which compared lexical replacement with radioactive decay and was based on aggregated distances in its core, character-based methods are used to infer phylogenetic trees. Characters are retrieved from the data by extracting each cognate from a lexicostatistical wordlist and annotating the presence or absence of each cognate set in each language.

Thus, while Swadesh's lexicostatistical data model would state that the words for "hand" in German and English were cognate, and also in Italian and French, but not in Germanic and Romance, the binary presence-absence coding states that the cognate set formed by words like English hand and German Hand is not present in Romance languages, and that the cognate set formed by words like Italian mano and French main is absent in Germanic languages. This is illustrated in the table below, where the same IDs and colors are used to mark the cognate sets as in the table shown above.

Presence-absence cognate coding for modern phylogenetic analyses

The new way of cognate coding along with the use of phylogenetic software methods has brought, without doubt, many improvements compared to Swadesh's idea of dating divergence times by counting percentages of shared cognates. A couple of problems, however, remain, and one should not forget them when applying computational methods to originally lexicostatistic datasets.

First, we could ask whether the main assumptions of functional universality and cultural neutrality really hold. It seems to be true that words can be remarkably stable throughout the history of a language family. It is, however, also true that the most stable words are not necessarily the same across all language families. Ever since Swadesh established the idea of basic vocabulary, scholars have tried to improve the list of basic vocabulary items. Swadesh himself started from a list of 215 concepts (Swadesh 1950), which he then reduced to 200 concepts (1952) and then later to 100 concepts (1952). Other scholars went further, like Dolgopolsky (1964 [1986]) who reduced the list to 16 concepts. The Concepticon is a resource that links many of the concept lists that have been proposed in the past. When comparing these lists, which all represent what some scholars would label "basic vocabulary items", it becomes obvious that the number of items that all scholars agree upon sinks drastically, while the number of concepts that have been claimed to be basic increases.

An even greater problem than the question of universality and neutrality of basic vocabulary, however, is the underlying model of cognacy in combination with the proposed process of change. Swadesh's model of cognacy controls for meaning. While this model of cognacy is consistent with Swadesh's idea of lexical replacement as a basic process of lexical change, it is by no means consistent with birth-death models of cognate gain and cognate loss if they are created from lexicostatistical data. In biology, birth-death models are usually used to model the evolution of homologous gene families distributed across whole genomes. If we use the traditional view according to which words can be cognate regardless of meaning, the analogy holds, and birth-death processes seem to be adequate in order to analyze datasets that are based on these root cognates (Starostin 1989) or etymological cognates (Starostin 2013). But if we control for meaning in the cognate judgments, we do not necessarily capture processes of gain and loss in our data. Instead, we capture processes in which links between word forms and concepts are shifted, and we investigate these shifts through the very narrow "windows" of pre-defined slots of basic concepts, as I have tried to depict in the following graphic.

Looking at kexical replacement through the small windows of basic vocabulary

Conclusion

As David has mentioned before: We do not necessarily need realistic models in phylogenetic research to infer meaningful processes. The same can probably be said about the discrepancy between our lexicostatistical datasets (Swadesh's heritage, which we keep using for practical reasons) and the birth-death models we now use to analyse the data. Nevertheless, I cannot avoid an uncomfortable feeling when thinking that an algorithm is modeling gain and loss of characters in a dataset that was not produced for this purpose. In order to model the traditional lexicostatistical data consistently, we would either (i) need explicit multistate-models in which concepts are a character and the forms represent the states (Ringe et al. 2002, Ben Hamed and Wang 2006), or (ii) we should directly turn to "root-cognate" methods. These methods have been discussed for some time now (Starostin 1989, Holm 2000), but there is only one recent approach by Michael et al. (forthcoming) in which this is consistently tested.

References
  • Bergsland, K. and H. Vogt (1962): On the validity of glottochronology. Curr. Anthropol. 3.2. 115-153.
  • Bowern, C. and Q. Atkinson (2012): Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88. 817-845.
  • Dolgopolsky, A. (1964): Gipoteza drevnejšego rodstva jazykovych semej Severnoj Evrazii s verojatnostej točky zrenija [A probabilistic hypothesis concering the oldest relationships among the language families of Northern Eurasia]. Voprosy Jazykoznanija 2. 53-63.
  • Dyen, I., J. Kruskal, and P. Black (1992): An Indoeuropean classification. A lexicostatistical experiment. T. Am. Philos. Soc. 82.5. iii-132.
  • Ben Hamed, M. and F. Wang (2006): Stuck in the forest: Trees, networks and Chinese dialects. Diachronica 23. 29-60.
  • Hoijer, H. (1956): Lexicostatistics. A critique. Language 32.1. 49-60.
  • Holm, H. (2000): Genealogy of the main Indo-European branches applying the separation base method. J. Quant. Linguist. 7.2. 73-95.
  • Kitchen, A., C. Ehret, S. Assefa, and C. Mulligan (2009): Bayesian phylogenetic analysis of Semitic languages identifies an Early Bronze Age origin of Semitic in the Near East. Proc. R. Soc. London, Ser. B 276.1668. 2703-2710.
  • Lee, S. and T. Hasegawa (2011): Bayesian phylogenetic analysis supports an agricultural origin of Japonic languages. Proc. R. Soc. London, Ser. B 278.1725. 3662-3669.
  • Pagel, M. (2009): Human language as a culturally transmitted replicator. Nature Reviews. Genetics 10. 405-415.
  • Ringe, D., T. Warnow, and A. Taylor (2002): Indo-European and computational cladistics. T. Philol. Soc. 100.1. 59-129.
  • Starostin, S. (1989): Sravnitel'no-istoričeskoe jazykoznanie i leksikostatistika [Comparative-historical linguistics and lexicostatistics]. In: Kullanda, S., J. Longinov, A. Militarev, E. Nosenko, and V. Shnirel'man (eds.): Materialy k diskussijam na konferencii[Materials for the discussion on the conference].1. Institut Vostokovedenija: Moscow. 3-39.
  • Starostin, G. (2013): Lexicostatistics as a basis for language classification. In: Fangerau, H., H. Geisler, T. Halling, and W. Martin (eds.): Classification and evolution in biology, linguistics and the history of science. Concepts – methods – visualization.. Franz Steiner Verlag: Stuttgart. 125-146.
  • Swadesh, M. (1950): Salish internal relationships. Int. J. Am. Linguist. 16.4. 157-167.
  • Swadesh, M. (1952): Lexico-statistic dating of prehistoric ethnic contacts. With special reference to North American Indians and Eskimos. Proc. Am. Philol. Soc. 96.4. 452-463.
  • Swadesh, M. (1955): Towards greater accuracy in lexicostatistic dating. Int. J. Am. Linguist. 21.2. 121-137.
  • Tadmor, U. (2009): Loanwords in the world’s languages. Findings and results. In: Haspelmath, M. and U. Tadmor (eds.): Loanwords in the world's languages. de Gruyter: Berlin and New York. 55-75.

Wednesday, August 12, 2015

The complexity of lexical change


Most computational approaches to historical linguistics, be it those producing networks or those producing trees, make use of lexical data. There are several reasons for this preference. Lexical data is much easier to handle than abstract grammatical data. Many linguists also think that lexical data is more representative of language evolution in general, and thus offers a much better starting point for inferences. Whether one likes the preference for lexical data or not, it seems to be worthwhile in this context to reflect a bit more about the nature of lexical data and the complexities of lexical change. This may help to get a clearer picture of the differences between language history and biological evolution.

What Makes a Word?

In a very simple language model, the lexicon of a language can be seen as a bag of words. A word, furthermore, is traditionally defined by two aspects: its form and its meaning. Thus, the French word arbre can be defined by its written form arbre or its phonetic form [ɑʁbʁə], and its meaning "tree". This is reflected in the famous sign model of Ferdinand de Saussure (Saussure 1916), which I have reproduced in [A] in the graphic below. In order to emphasize the importance of the two aspects, linguists often say that form and meaning of a word are like two sides of the same coin (see [B] in the graphic below). But we should not forget that a word is only a word if it belongs to a certain language! From the perspective of the German or the English language, for example, the sound chain [ɑʁbʁə] is just meaningless. So, instead of two major aspects of a word, we may better talk of three major aspects: form, meaning, and language. As a result, our bilateral sign model becomes a trilateral one, as I have tried to illustrate in [C] in the graphic below.


What is Lexical Change?

If there was no lexical change, the lexicon of languages would remain stable during all times. Words might change their forms by means of regular sound change, but there would always be an unbroken tradition of identical patterns of denotation. Since this is not the case, the lexicon of all languages is constantly changing. Words are lost, when the speakers cease to use them, or new words enter the lexicon when new concepts arise, be it that they are borrowed from other languages, or created from native material via different morphological processes. Such processes of word loss and word gain are quite frequent and can sometimes even be observed directly by the speakers of a language when they compare their own speech with the speech of an elder or a younger generation.

An even more important process of lexical change, especially in quantitative historical linguistics, is the process of lexical replacement. Lexical replacement refers to the process by which a given word A which is commonly used to express a certain meaning x ceases to express this meaning, while at the same time another word B, which was formerly used to express a meaning y, is now used to express the meaning x. The notion of lexical replacement is thus nothing else than a shift in the perspective on semantic change (as one major dimension of lexical change, see below). While semantic change is usually described from a semasiological perspective, i.e. from the perspective of the form, lexical replacement describes semantic change from an onomasiological perspective, i.e. from the perspective of the meaning.

Three Dimensions of Lexical Change

Gévaudan (2007) distinguishes three dimensions of lexical change: the morphological dimension, the semantic dimension, and the stratic dimension. The morphological dimension points to changes in the outer form of the words which are not due to regular sound change. As an example of this type of change, consider English birth and its ancestral form Proto-Germanic *ga-burdi "birth" — while the meaning of the word did not change (or at least only slightly), the English word apparently lost the prefix ga-. This prefix is still present in the German Geburt "birth", but it was lost without leaving a trace in English.

The loss of prefixes is not the only way in which words can change during language evolution. We also find that prefixes or suffixes are added, as, for example, in French soleil "sun", which goes back to Latin soliculus "small sun, sunny" which is itself a derivation of Latin sol "sun". The semantic dimension is illustrated by changes like the one from Proto-Germanic *sælig "happy" to English silly.

The stratic dimension refers to changes involving the exchange of words between languages, that is, processes of borrowing, in which a word is transferred from one stratum of a language to another. An example for this type of change is English mountain which was borrowed from Old French montaigne "mountain".

Note that these three dimensions of lexical change correspond directly to the three major aspects constituting a linguistic sign (or a word) that I mentioned above: The morphological dimension changes the form of a word, the semantic dimension changes its meaning, and the stratic dimension its language. Thus, the three dimensions of lexical change, as proposed by Gévaudan (2007), find their direct reflection in the major dimensions according to which words can vary.


During language evolution, lexical change processes interact in all three dimensions, and yield complex patterns which may be very hard to uncover for historical linguists. As an example of this complexity, consider the development of Proto-Indo-European *bʰreu̯Hg̑-* "to use", as depicted in the graphic below, which was originally designed by Hans Geisler (Heinrich-Heine University, Düsseldorf), who kindly allowed me to reproduce it here. In the graphic, changes in the stratic dimension are illustrated with the help of dotted arcs (the legend labels this as "borrowed from"), and changes in the morphological dimension are indicated by double arcs (labelled as "derived from"). The semantic dimension is not specifically labelled as such, but one can easily detect it by comparing the meanings of the words.


Modeling Lexical Change

If we look at different historical relations from the perspective of the three dimensions of lexical change, it becomes obvious that the terminology we use in linguistics is rather fuzzy. I mentioned this in an earlier post, where I pointed to the different shades of cognacy, which were never really settled in a satisfying way in historical linguistics. If we look at this again from the perspective of the three dimensions, it is much easier to become clear about the origin of these different historical relations between words.

If we investigate the different uses of the term "cognacy", for example, it becomes obvious that the differences result from controling for one or more of the three dimensions of lexical change. The traditional Indo-Europeanist notion of cognacy, for example, controls the stratic dimension by requiring stratic continuity (no borrowing), but at the same time it is indifferent regarding the other two dimensions. Cognacy à la Swadesh (especially Swadesh 1955), as we know it from the popular computational approaches which model lexical change as a process of cognate loss and gain, is indifferent regarding morphological continuity, but controls the semantic and the stratic dimensions by only considering words that have the same meaning and have not been borrowed (at least in theory).

In the table below, I have attempted to illustrate in which way the different terms, including the biological terms of homology, orthology, paralogy, and xenology, cover processes by controling each for one or more of the three dimensions of lexical change (with "+" indicating that continuity is required, "-" indicating that change is required, and "+/-" indicating indifference.) Contrasting the different dimensions of lexical change with the terminology used to refer to different relations between words shows not only the arbitrariness of the traditional linguistic terminology (why do we only cover two out of 3 * 3 * 3 = 27 different possible types? why do we only control by requiring continuity, not change? etc.), but also the fundamental difference between biological and linguistic terminology.


Concluding Remarks

So far, all computational methods that have been proposed for historical linguistics are based on the strict Swadesh type of wordlist encoding, which in the end controls for the semantic and stratic dimensions of lexical change and is indifferent regarding morphology. Such an encoding is per se inconsistent, since there is no reason to assume that morphological change would be less frequent or less indicative of language history than any of the other types.

The reason why linguists tend to control for meaning when creating their datasets is mostly due to problems of sampling: it is much easier to draw a set of words from a couple of languages by starting from a given set of meanings. However, it may be useful to relax this criterion, since the restricted sets of only about 200 meanings on average necessarily hide vivid and interesting processes of lexical change.

The reasons why linguists control for borrowing are only historical, and in many cases also not feasible, since our evidence for borrowing may be limited, especially in cases where the majority of speakers is bilingual (which is more often the rule than the exception in the languages of the world). It seems much more fruitful to revive our network thinking in linguistics and to invest into the development of high quality datasets with a less arbitrary exclusion of certain dimensions of lexical change, and transparent computational methods which do not exclusively stick to the tree model.

References

  • Gévaudan, P. (2007) Typologie des lexikalischen Wandels [Typology of lexical change]. Tübingen: Stauffenburg.
  • Swadesh, M. (1955)  Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics. Vol. 21(2), pp. 121- 137.
  • Saussure, F. de (1916) Cours de linguistique générale [Course on general linguistics]. Lausanne: Payot.

Wednesday, May 13, 2015

Homology and cognacy: fundamental historical relations between words


This is a guest blog post, following on from his previous post, by:

Johann-Mattis List

Centre des Recherches Linguistiques sur l'Asie Orientale, Paris, France

Introduction

All languages constantly change. Words are lost when speakers cease to use them, new words are gained when new concepts evolve, and even the pronunciation of the words changes slightly over time. Slight modifications that can barely be noticed during a person's lifetime sum up to great changes in the system of a language over centuries. When the speakers of a language diverge, their speech keeps on changing independently in the two communities, and at a certain point of time the independent changes are so great that they can no longer communicate with each other — what was one language has become two.

Demonstrating that two languages once were one is one of the major tasks of historical linguistics. If no written documents of the ancestral language exist, one has to rely on specific techniques for linguistic reconstruction (see the examples in this previous post). These techniques require us to first identify those words in the descendant languages that presumably go back to a common word form in the ancestral language. In identifying these words, we infer historical relations between them. The most fundamental historical relation between words is the relation of common descent. However, similarly to evolutionary biology, where homology can be further subdivided into the more specific relations of orthology, paralogy, and xenology, more specific fundamental historical relations between words can be defined for historical linguistics, depending on the underlying evolutionary scenario.

Homology and Cognacy in Linguistics and Biology

In evolutionary biology there is a rather rich terminological framework describing fundamental historical relations between genes and morphological characters. Discussions regarding the epistemological and ontological aspects of these relations are still ongoing (see the overview in Koonin 2005, but also this recent post by David). Linguists, in contrast, have rarely addressed these questions directly. They rather assumed that the fundamental historical relations between words are more or less self-evident, with only few counter-examples, which were largely ignored in the literature (Arapov and Xerc 1974; Holzer 1996; Katičić 1966). As a result, our traditional terminology to describe the fundamental historical relations between words is very imprecise and often leads to confusion, especially when it comes to computational applications that are based on software originally developed for applications in evolutionary biology.

As an example, consider the fundamental concept of homology in evolutionary biology. According to Koonin (2005: 311), it "designates a relationship of common descent between any entities, without further specification of the evolutionary scenario". The terms orthology, paralogy, and xenology are used to address more specific relations. Orthology refers to "genes related via speciation" (Koonin 2005: 311); that is, genes related via direct descent. Paralogy refers to "genes related via duplication" (ibid.); that is, genes related via indirect descent. Xenology, a notion which was introduced by Gray and Fitch (1983), refers to genes "whose history, since their common ancestor, involves an interspecies (horizontal) transfer of the genetic material for at least one of those characters" (Fitch 2000: 229); i.e. to genes related via descent involving lateral transfer.

In historical linguistics, the only relation that is explicitly defined is cognacy (also called cognation). Cognacy usually refers to words related via “descent from a common ancestor” (Trask 2000: 63), and it is strictly distinguished from descent involving lateral transfer (borrowing). The term cognacy itself, however, covers both direct and indirect descent. Hence, traditionally, German Zahn 'tooth' is cognate with English tooth, and German selig 'blessed' with English silly, and German Geburt 'birth' with English birth, although the historical processes that shaped the present appearance of these three word pairs are quite different. Apart from the sound shape, Zahn and tooth have regularly developed from Proto-Germanic *tanθ-; selig and silly both go back to Proto-Germanic *sæli- 'happy', but the meaning of the English word has changed greatly; Geburt and birth stem from Proto-Germanic *ga-burdi-, but the English word has lost the prefix as a result of specific morphological processes during the development of the English language (all examples follow Kluge and Seebold 2002, with modifications for the pronunciation of Proto-Germanic). Thus, of the three examples of cognate words given, only the first would qualify as having evolved by direct inheritance, while the inheritance of the latter two could be labelled as indirect, involving processes which are largely language-specific and irregular, such as meaning shift and morpheme loss. Trask (2000: 234) suggests the term oblique cognacy to label these cases of indirect inheritance, but this term seems to be rarely used in historical linguistics; and at least in the mainstream literature of historical linguistics I could not find even a single instance where the term was employed (apart from the passage by Trask).


In the table above (with modifications taken from List 2014: 39), I have tried to contrast the terminology used in evolutionary biology and historical linguistics by comparing to which degree they reflect fundamental historical relations between words or genes. Here, common descent is treated as a basic relation which can be further subdivided into relations of direct common descent, indirect common descent, and common descent involving lateral transfer. As one can easily see, historical linguistics lacks proper terms for at least half of the relations, offering no exact counterparts for homology, orthology, and xenology in evolutionary biology.

Cognacy in historical linguistics is often deemed to be identical with homology in evolutionary biology, but this is only true if one ignores common descent involving lateral transfer. One may argue that the notion of xenology is not unknown to linguists, since the borrowing of words is a very common phenomenon in language history. However, the specific relation which is termed xenology in biology has no direct counterpart in historical linguistics: the term borrowing refers to a distinct process, not a relation resulting from the process. There is no common term in historical linguistics which addresses the specific relation between such words as German kurz 'short' and English short. These words are not cognate, since the German word has been borrowed from Latin cŭrtus 'mutilated' (Kluge and Seebold 2002). They share, however, a common history, since Latin cŭrtus and English short both (may) go back to Proto-Indo-European *(s)ker- 'cut off' (Vaan 2008: 158). The specific history behind these relations is illustrated in the following figure.


A specific advantage of the biological notion of homology as a basic relation covering any kind of historical relatedness, compared to the linguistic notion of cognacy as a basic relation covering direct and indirect common descent, is that the former is much more realistic regarding the epistemological limits of historical research. Up to a certain point, it can be fairly reliably demonstrated that the basic entities in the respective disciplines (words, genes, or morphological characters) share a common history. Demonstrating that more detailed relations hold, however, is often much harder. The strict notion of cognacy has forced linguists to set goals for their discipline which may often be far too ambitious to achieve. We need to adjust our terminology accordingly and bring our goals into balance with the epistemological limits of our discipline. In order to do so, I have proposed to refine our current terminology in historical linguistics to the schema shown in the table below (with modifications taken from List 2014: 44):


Fifty Shades of Cognacy

In a recent blog post, David pointed to the relative character of homology in evolutionary biology in emphasizing that it "only applies locally, to any one level of the hierarchy of character generalization". Recalling his example of bat wings compared to bird wings, which are homologous when comparing them as forelimbs but who are analogous when comparing them as wings, we can find similar examples in historical linguistics.

If we consider words for 'to give' in the four Romance languages Portuguese, Spanish, Provencal and French, then we can state that both Portuguese dar and Spanish dar are homologous, as are Provencal douna and French donner. The former pair go back to the Latin word dare 'to give', and the latter pair go back to the Latin word donare 'to gift (give as a present)'. In those times when Latin was commonly spoken, both dare and donare were clearly separated words denoting clearly separated contexts and being used in clearly separated contexts. The verb donare itself was derived from Latin donum 'present, gift'. Similarly to English where nouns can be easily used as verbs, Latin allowed for specific morphological processes. In contrast to English, however, these processes required that the form of the noun was modified (compare English gift vs. to gift with Latin donum vs. donare).

What the ancient Romans (who spoke Latin as their native tongue) were not aware of is that Latin donum 'gift' and Latin dare 'to give' themselve go back to a common word form. This was no longer evident in Latin, but it was in Proto-Indo-European, the ancestor of the Latin language. Thus, Latin dare goes back to Proto-Indo-European *deh3- 'to give', and Latin donum goes back to Proto-Indo-European *deh3-no- 'that which is given (the gift)' (Meiser 1999; what is written as *h3 in this context was probably pronounced as [x] or [h]). The word form *deh3-no- is a regular derivation from *deh3-, so at the Indo-European level both forms are homologous, since one is derived from the other. That means, in turn, that Latin dare and donum are also homologs, since they are the residual forms of the two homologous words in Proto-Indo-European. And since Latin donare is a regular derivation of donum, this means, again, that Latin dare and donare are also homologous, as are the words in the four descendant languages, Portuguese dar, Spanish dar, Provencal douna, and French donner. Depending on the time depth we apply, we will arrive at different homology decisions. I have tried to depict the complex history of the words in the following figure:


Judging from the treatment in linguistic databases, many scholars do not regard these different "shades of homology" as a real problem. In most cases, scholars use a "lumping approach" and label as cognates all words that go back to a common root, no matter how far that root goes back in time (compare, for example, the cognate labeling for reflexes of Proto-Indo-European *deh3- in the IELex).

Importantly, this labeling practice, however, may be contrary to the models that are used to analyze the data afterwards. All computational analyses model language evolution as a process of word gain and word loss. The words for the analyses are sampled from an initial set of concepts (such as 'give', 'hand', 'foot', 'stone', etc.) which are translated into the languages under investigation. If we did not know about the deeper history of Latin dare and donare, we would assume a regular process of language evolution here: at some point, the speakers of Gallo-Romance would cease to use the word dare to express the meaning 'to give' and use the word donare instead, while the speakers of Ibero-Romance would keep on using the word dare. This well-known process of lexical replacement (illustrated in the graphic below), which may provide strong phylogenetic signals, is lost in the current encoding practice where all four words are treated as homologs. Our current practice of cognate coding masks vital processes of language change.


Outlook

Historical linguistics needs a more serious analysis of the fundamental processes of language change and the fundamental historical relations resulting from these processes. In the last two decades a large arsenal of quantitative methods has been introduced in historical linguistics. The majority of these methods come from evolutionary biology. While we have quickly learned to adapt and apply these methods to address questions of language classification and language evolution, we have forgotten to ask whether the processes these methods are supposed to model actually coincide with the fundamental processes of language evolution. Apart from adapting only the methods from evolutionary biology, we should consider also adapting the habit of having deeper discussions regarding the very basics of our methodology.

References

Arapov MV, Xerc MM (1974) Математические методы в исторической лингвистике [Mathematical methods in historical linguistics]. Moscow: Nauka. German translation: Arapov, M. V. and M. M. Cherc (1983). Mathematische Methoden in der historischen Linguistik. Trans. by R. Köhler and P. Schmidt. Bochum: Brockmeyer.

Fitch WM (2000) Homology: a personal view on some of the problems. Trends in Genetics 16.5, 227-231.

Gray GS, Fitch WM (1983) Evolution of antibiotic resistance genes: the DNA sequence of a kanamycin resistance gene from Staphylococcus aureus. Molecular Biology and Evolution 1.1, 57-66.

Holzer G (1996) Das Erschließen unbelegter Sprachen. Zu den theoretischen Grundlagen der genetischen Linguistik. Frankfurt am Main: Lang

Katičić R (1966) Modellbegriffe in der vergleichenden Sprachwissenschaft. Kratylos 11, 49-67.

Kluge F, Seebold E (2002) Etymologisches Wörterbuch der deutschen Sprache. 24th ed. Berlin: de Gruyter.

List J-M (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

Meiser G (1999) Historische Laut- und Formenlehre der lateinischen Sprache. Wissenschaftliche Buchgesellschaft: Darmstadt.

Trask RL (2000) The Dictionary of Historical and Comparative Linguistics. Edinburgh: Edinburgh University Press.

Vaan M (2008) Etymological Dictionary of Latin and the Other Italic Languages. Leiden and Boston: Brill.