Showing posts with label lexical change. Show all posts
Showing posts with label lexical change. Show all posts

Monday, November 25, 2019

Typology of semantic promiscuity (Open problems in computational diversity linguistics 10)


The final problem in my list of ten open problems in computational diversity linguistics touches upon a phenomenon that most linguists, let alone ordinary people, might not have even have heard about. As a result, the phenomenon does not have a real name in linguistics, and this makes it even more difficult to talk about it.

Semantic promiscuity, in brief, refers to the empirical observations that: (1) the words in the lexicon of human languages are often built from already existing words or word parts, and that (2) the words that are frequently "recycled", ie. the words that are promiscuous (similar to the sense of promiscuous domains in biology, see Basu et al. 2008) denote very common concepts.

If this turns out to be true, that the meaning of words decides, at least to some degree, their success in giving rise to new words, then it should be possible to derive a typology of promiscuous concepts, or some kind of cross-linguistic ranking of those concepts that turn out to be the most successful on the long run.

Our problem can (at least for the moment, since we still have problems of completely grasping the phenomenon, as can be seen from the next section) thus be stated as follows:
Assuming a certain pre-selection of concepts that we assume are expressed by as many languages as possible, can we find out which of the concepts in the sample give rise to the largest amount of new words?
I am not completely happy with this problem definition, since a concept does not actually give rise to a new word, but instead a concept is expressed by a word that is then used to form a new word; but I have decided to leave the problem in this form for reasons of simplicity.

Background on semantic promiscuity

The basic idea of semantic promiscuity goes back to my time as a PhD student in Düsseldorf. My supervisor then was Hans Geisler, a Romance linguist, with a special interest in sound change and sensory-motor concepts. Sensory-motor concepts are concepts that are thought to be grounded in sensory-motor processes. In concrete, scholars assume that many abstract concepts expressed by many, if not all, languages of the world originate in concepts that denote concrete bodily experience (Ströbel 2016).

Thus, we can "grasp an idea", we can "face consequences", or we can "hold a thought". In such cases we express something that is abstract in nature, but expressed by means of verbs that are originally concrete in their meaning and relate to our bodily experience ("to grasp", "to face", "to hold").

When I later met Hans Geisler in 2016 in Düsseldorf, he presented me with an article that he had recently submitted for an anthology that appeared two years later (Geisler 2018). This article, titled "Sind unsere Wörter von Sinnen?" (approximate translation of this pun would be: "Are our words out of the sense?"), investigates concepts such as "to stand" and "to fall" and their importance for the lexicon of German language. Geisler claims that it is due to the importance of the sensory-motor concepts of "standing" and "falling" that words built from stehen ("to stand") and fallen ("to fall") are among the most productive (or promiscuous) ones in the German lexicon.

Words built from fallen and stehen in German.

I found (and still find) this idea fascinating, since it may explain (if it turns out to hold true for a larger sample of the world's languages) the structure of a language's lexicon as a consequence of universal experiences shared among all humans.

Geisler did not have a term for the phenomenon at hand. However, I was working at the same time in a lab with biologists (led by Eric Bapteste and Philippe Lopez), who introduced me to the idea of domain promiscuity in biology, during a longer discussion about similar processes between linguistics and biology. In our paper reporting our discussion of these similarities, we proposed that the comparison of word formation processes in linguistics and protein assembly processes in biology could provide fruitful analogies for future investigations (List et al. 2016: 8ff). But we did not (yet) use the term promiscuity in the linguistic domain.

Geisler's idea, that the success of words to be used to form other words in the lexicon of a language may depend on the semantics of the original terms, changed my view on the topic completely, and I began to search for a good term to denote the phenomenon. I did not want to use the term "promiscuity", because of its original meaning.

Linguistics has the term "productive", which is used for particular morphemes that can be easily attached to existing words to form new ones (eg. by turning a verb into a noun, or by turning a noun into an adjective, etc.). However, "productivity" starts from the form and ignores the concepts, while concepts play a crucial role for Geisler's phenomenon.

At some point, I gave up and began to use the term "promiscuity" in lack of a better term, first in a blogpost discussing Geisler's paper (List 2018, available here). Later in 2018, Nathanael E. Schweikhard, a doctoral student in our research group, developed the idea further, using the term semantic promiscuity (Schweikhard 2018, available here), which considers my tenth and last open problem in computational diversity linguistics (at least for 2019).

In the discussions with Schweikhard, which were very fruitful, we also learned that the idea of expansion and attraction of concepts comes close to the idea of semantic promiscuity. This references Blank's (2003) idea that some concepts tend to frequently attract new words to express them (think of concepts underlying taboo, for simplicity), while other concepts tend to give rise to many new words ("head" is a good example, if you think of all the meanings it can have in different concepts),. However, since Blank is interested in the form, while we are interested in the concept, I agree with Schweikhard in sticking with "promiscuity" instead of adopting Blank's term.

Why it is hard to establish a typology of semantic promiscuity

Assuming that certain cross-linguistic tendencies can be found that would confirm the hypothesis of semantic promiscuity, why is it hard to do so? I see three major obstacles here: one related to the data, one related to annotation, and one related to the comparison.

The data problem is a problem of sparseness. For most of the languages for which we have lexical data, the available data are so sparse that we often even have problems to find a list of 200 or more words. I know this well, since we were struggling hard in a phylogenetic study of Sino-Tibetan languages, where we ended up discarding many interesting languages because the sources did not provide enough lexical data to fill in our wordlists (Sagart et al. 2019).

In order to investigate semantic promiscuity, we need substantially more data than we need for phylogenetic studies, since we ultimately want to investigate the structure of word families inside a given language and compare these structures cross-linguistically. It is not clear where to start here, although it is clear that we cannot be exhaustive in linguistics, as biologists can be when sequencing a whole gene or genome. I think that one would need, at least, 1,000 words per language in order to be able to start looking into semantic promiscuity.

The second problem points to the annotation and the analysis that would be needed in order to investigate the phenomenon sufficiently. What Hans Geisler used in his study were larger dictionaries of German that are digitally available and readily annotated. However, for a cross-linguistic study of semantic promiscuity, all of the annotation work of word families would still have to be done from scratch.

Unfortunately, we have also seen that the algorithms for automated morpheme detection that have been proposed today usually fail greatly when it comes to detecting morpheme boundaries. In addition, word families often have a complex structure, and parts of the words shared across other words are not necessarily identical, due to numerous processes involved in word formation. So, a simple algorithm that splits the words into potential morphemes would not be enough. Another algorithm that identifies language-internal cognate morphemes would be needed; and here, again, we are still waiting for convincing approaches to be developed by computational linguists.

The third problem is the comparison itself, reflects the problem of comparing word-family data across different languages. Since every language has its own structure of words and a very individual set of word families, it is not trivial to decide how one should compare annotated word-family data across multiple languages. While one could try to compare words with the same meaning in different languages, it is quite possible that one would miss many potentially interesting patterns, especially since we do not yet know how (and if at all) the idea of promiscuity features across languages.

Traditional approaches

Apart from the work by Geisler (2018), mentioned above, we find some interesting studies on word formation and compounding in which scholars have addressed some similar questions. Thus, Steve Pepper has submitted (as far as I know) his PhD thesis on The Typology and Semantics of Binomial Lexemes (Pepper 2019, draft here), where he looks into the structure of words that are frequently constructed from two nominal parts, such as "windmill", "railway", etc. In her masters thesis titled Body Part Metaphors as a Window to Cognition, Annika Tjuka investigates how terms for objects and landscapes are created with help of terms originally denoting body parts (such as the "foot" of the table, etc., see Tjuka 2019).

Both of these studies touch on the idea of semantic promiscuity, since they try to look at the lexicon from a concept-based perspective, as opposed to a pure form-based one, and they also try to look at patterns that might emerge when looking at more than one language alone. However, given their respective focus (Pepper looking at a specific type of compounds, Tjuka looking at body-part metaphors), they do not address the typology of semantic promiscuity in general, although they provide very interesting evidence showing that lexical semantics plays an important role in word formation.

Computational approaches

The only study that I know of that comes close to studying the idea of semantic promiscuity computationally is by Keller and Schulz (2014). In this study, the authors analyze the distribution of morpheme family sizes in English and German across a time span of 200 years. Using Birth-Death-Innovation Models (explained in more detail in the paper), they try to measure the dynamics underlying the process of word formation. Their general finding (at least for the English and German data analyzed) is that new words tend to be built from those word forms that appear less frequently across other words in a given language. If this holds true, it would mean that speakers tend to avoid words that are already too promiscuous as a basis to coin new words for a given language. What the study definitely shows is that any study of semantic promiscuity has to look at competing explanations.

Initial ideas for improvement

If we accept that the corpus perspective cannot help us to dive deep into the semantics, since semantics cannot be automatically inferred from corpora (at least not yet to a degree that would allow us to compare them afterwards across a sufficient sample of languages), then we need to address the question in smaller steps.

For the time being, the idea that a larger amount of the words in the lexicon of human languages are recycled from words that originally express specific meanings remains a hypothesis (whatever those meanings may be, since the idea of sensory motor concepts is just one suggestion for a potential candidate for a semantic field). There are enough alternative explanations that could drive the formation of new words, be it the frequency of recycled morphemes in a lexicon, as proposed by Keller and Schulz, or other factors that we still not know, or that I do not know, because I have not yet read the relevant literature.

As long as the idea remains a hypothesis, we should first try to find ways to test it. A starting point could consist of the collection of larger wordlists for the languages of the world (eg. more than 300 words per language) which are already morphologically segmented. With such a corpus, one could easily create word families, by checking which morphemes are re-used across words. By comparing the concepts that share a given morpheme, one could try and check to which degree, for example, sensory-motor concepts form clusters with other concepts.

All in all, my idea is far from being concrete; but what seems clear is that we will need to work on larger datasets that offer word lists for a sufficiently large sample of languages in morpheme-segmented form.

Outlook

Whenever I try to think about the problem of semantic promiscuity, asking myself whether it is a real phenomenon or just a myth, and whether a typology in the form of a world-wide ranking is possible after all, I feel that my brain is starting to itch. It feels like there is something that I cannot really grasp (yet, hopefully), and something I haven't really understood.

If the readers of this post feel the same way afterwards, then there are two possibilities as to why you might feel as I do: you could suffer from the same problem that I have whenever I try to get my head around semantics, or you could just have fallen victim of a largely incomprehensible blog post. I hope, of course, that none of you will suffer from anything; and I will be glad for any additional ideas that might help us to understand this matter more properly.

References

Basu, Malay Kumar and Carmel, Liran and Rogozin, Igor B. and Koonin, Eugene V. (2008) Evolution of protein domain promiscuity in eukaryotes. Genome Research 18: 449-461.

Blank, Andreas (1997) Prinzipien des lexikalischen Bedeutungswandels am Beispiel der romanischen Sprachen. Tübingen:Niemeyer.

Geisler, Hans (2018) Sind unsere Wörter von Sinnen? Überlegungen zu den sensomotorischen Grundlagen der Begriffsbildung. In: Kazzazi, Kerstin and Luttermann, Karin and Wahl, Sabine and Fritz, Thomas A. (eds.) Worte über Wörter: Festschrift zu Ehren von Elke Ronneberger-Sibold. Tübingen:Stauffenburg. 131-142.

Keller, Daniela Barbara and Schultz, Jörg (2014) Word formation is aware of morpheme family size. PLoS ONE 9.4: e93978.

List, Johann-Mattis and Pathmanathan, Jananan Sylvestre and Lopez, Philippe and Bapteste, Eric (2016) Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics. Biology Direct 11.39: 1-17.

List, Johann-Mattis (2018) Von Wortfamilien und promiskuitiven Wörtern [Of word families and promiscuous words]. Von Wörtern und Bäumen 2.10. URL: https://wub.hypotheses.org/464.

Pepper, Steve (2019) The Typology and Semantics of Binominal Lexemes: Noun-noun Compounds and their Functional Equivalents. University of Oslo: Oslo.

Sagart, Laurent and Jacques, Guillaume and Lai, Yunfan and Ryder, Robin and Thouzeau, Valentin and Greenhill, Simon J. and List, Johann-Mattis (2019) Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proceedings of the National Academy of Science of the United States of America 116: 10317-10322. DOI: https://doi.org/10.1073/pnas.1817972116

Schweikhard, Nathanael E. (2018) Semantic promiscuity as a factor of productivity in word formation. Computer-Assisted Language Comparison in Practice 1.11. URL: https://calc.hypotheses.org/1169.

Ströbel, Liane (2016) Introduction: Sensory-motor concepts: at the crossroad between language & cognition. In: Ströbel, Liane (ed.) Sensory-motor Concepts: at the Crossroad Between Language & Cognition. Düsseldorf University Press, pp. 11-16.

Tjuka, Annika (2019) Body Part Metaphors as a Window to Cognition: a Cross-linguistic Study of Object and Landscape Terms. Humboldt Universität zu Berlin: Berlin. DOI: https://doi.org/10.17613/j95n-c998.

Monday, June 24, 2019

Simulation of lexical change (Open problems in computational diversity linguistics 5)


The fifth problem in my list of open problems in computational diversity linguistics is devoted to the problem of simulating lexical change. In a broad sense, lexical change refers to the way in which the lexicon of a human language evolves over time. In a narrower sense, we would reduce it to the major processes that constitute the changes that affect the words of human languages.

Following Gevaudán (2007: 15-17), we can distinguish three different dimensions along which words can change, namely:
  • the semantic dimension — a given word can change its meaning
  • the morphological dimension —new words are formed from old words by combining existing words or deriving new words with help of affixes, and
  • the stratic dimension — languages may acquire words from their neighbors and thus contain strata of contact.
If we take these three dimension as the basis of any linguistically meaningful system that simulates lexical change (and I would strongly argue that we should), the task of simulating lexical change can thus be worded as follows:
Create a model of lexical change that simulates how the lexicon of a given language changes over time. This model may be simplifying, but it should account for change along the major dimensions of lexical change, including morphological change, semantic change, and lexical borrowing.
Note that the focus on three dimensions along which a word can change deliberately excludes sound change (which I will treat as a separate problem in an upcoming blogpost). Excluding sound change is justified by the fact that, in the majority of cases, the process proceeds independently from semantic change, morphological change, and borrowing, while the latter three process often interact.

There are, of course, cases where sound change may trigger the other three processes — for example, in cases where sound change leads to homophonous words in a language that express contrary meanings, which is usually resolved by using another word form for one of the concepts. An example for this process can be found in Chinese, where shǒu (in modern pronunciation) came to mean both "head" and "hand" (spelled as 首 and 手). Nowadays, shǒu remains only in expressions like shǒudū 首都 "capital", while tóu 头 is the regular word for "head".

Since the number of these processes where we have sufficient evidence to infer that sound change triggered other changes is rather small, we will do better to ignore it when trying to design initial models of lexical change. Later models could, of course, combine sound change with lexical change in an overarching framework, but given how the modeling of lexical change is already complex just with the three dimensions alone, it seems useful to put it aside for the moment and treat it as a separate problem.

Why simulating lexical change is hard

For historical linguists, it is obvious why it is hard to simulate lexical change in a computational model. The reason is that all three major processes of lexical change, semantic change, morphological change, and lexical borrowing, are already hard to model and understand themselves.

Morphological change is not only difficult to understand as a process, it is even difficult to infer; and it is for this reason, that we find morphological segmentation as the first example in my list of open problems. The same holds for lexical borrowing, which I discussed as the second example in my list of open problems. The problem of common pathways of semantic change will be discussed in a later post, devoted to the general typology of semantic change processes.

If each of the individual processes that constitute lexical change is itself either hard to model or to infer, it is no wonder that the simulation is also hard.

Traditional insights into the process of lexical change

Important work on lexical change goes back at least to the 1950s, when Morris Swadesh (1909-1967) proposed his theory of lexicostatistics and glottochronology (Swadesh 1952, 1955, Lees 1953). What was important in this context was not the idea that one could compute the divergence time of languages, but the data model which Swadesh introduced. This data model is represented by a word-list in which a particular list of concepts is translated into a particular range of languages. While former work on semantic change had been mostly onomasiological — ie. form-based, taking the word as the basic unit and asking how it would change its meaning over time — the new model used concepts as a comparandum, investigating how word forms replaced each other in expressing specific contexts over time. This onomasiological or concept-based perspective has the great advantage of drastically facilitating the sampling of language data from different languages.

When comparing only specific word forms for cognacy, it is difficult to learn something about the dynamics of lexical change through time, since it is never clear how to sample those words that one wants to investigate more closely in a given study. With Swadesh's data model, the sampling process is reduced to the selection of concepts, regardless of whether one knows how many concepts one can find in a given sample of languages. Swadesh was by no means the first to propose this perspective, but he was the one who promulgated it.

Swadesh's data model does not directly measure lexical change, but instead measures the results of lexical change, given that its results surface in the distribution of cognate sets across lexicostatistical word-lists. While historical linguists mostly focused on sound change processes before, often ignoring morphological and semantic change, the lexicostatistical data model moved semantic change, lexical borrowing, and (to a lesser degree also) morphological change into the spotlight of linguistic endeavors. As an example, consider the following quote from Lees (1953), discussing the investigation of change in vocabulary under the label of morpheme decay:
The reasons for morpheme decay, ie. for change in vocabulary, have been classified by many authors; they include such processes as word tabu, phonemic confusion of etymologically distinct items close in meaning, change in material culture with loss of obsolete terms, rise of witty terms or slang, adoption of prestige forms from a superstratum language, and various gradual semantic shifts such as specialization, generalization, and pejoration. [Lees 1953: 114]
In addition to lexicostatistics and the discussions that arose especially from it (including those that criticized the method harshly), I consider the aforementioned model of three dimensions of lexical change by Gevaudán (2007) to be very useful in this context, since it constitutes one of the few attempts to approach the question of lexical change in a formal (or formalizable) way.

Computational approaches

Among the most frequently used models in the historical linguistics literature are those in which lexical change is modeled as a process of cognate gain and cognate loss. Modeling lexical change as a process of word gain and word loss, or root gain and root loss, is in fact straightforward. We well know that languages may cease to use certain words during their evolution, either because the things the words denote no longer exist (think of the word walkman and then try to project the future of the word ipad), or because a specific word form is no longer being used to denote a concept and therefore drops out of the language at some point (think of thorp which meant something like "village", as a comparison with German Dorf "village" shows, but now exists only as a suffix in place names).

Since the gain-loss (or birth-death) model finds a direct counterpart in evolutionary biology, where genome evolution is often modeled as a process involving gain and loss of gene families (Cohen et al. 2008), it is also very easy to apply it to linguistics. The major work on the stochastic description of different gain-loss models has already been done, and we can find very stable software to helps us employ gain-loss models to reconstruct phylogenetic trees (Ronquist and Huelsenbeck 2003).

It is therefore not surprising that gain-loss models are very popular in computational approaches to historical linguistics. Starting from pioneering work by Gray and Jordan (2000) and Gray and Atkinson (2003), they have now been used on many language families, including Austronesian (Gray et al. 2007), Australian languages (Bowern and Atkinson 2012), and most recently also Sino-Tibetan (Sagart et al. 2019). Although scholars (including myself) have expressed skepticism about their usefulness (List 2016), the gain-loss model can be seen as reflecting the quasi-standard of phylogenetic reconstruction in contemporary quantitative historical linguistics.

Despite their popularity for phylogenetic reconstructions, gain-loss models have been used only sporadically in simulation studies. The only attempts that I know of so far are one study by Greenhill et al. (2009), where the authors used the TraitLab software (Nicholls 2013) to simulate language change along with horizontal transfer events, and a study by Murawaki (2015), in which (if I understand the study correctly) a gain-loss model is used to model language contact.

Another approach is reflected in the more "classical" work on lexicostatistics, where lexical change is modeled as a process of lexical replacement within previously selected concept slots. I will call this model the concept-slot model. In this model (and potential variants of it), a language is not a bag of words whose contents changes over time, but is more like a chest of drawers, in which each drawer represents a specific concept and the content of a drawer represents the words that can be used to express that given concept. In such a model, lexical change proceeds as a replacement process: a word within a given concept drawer is replaced by another word.

This model represents the classical way in which Morris Swadesh used to view the evolution of a given language. It is still present in the work of scholars working in the original framework of lexicostatistics (Starostin 2000), but it is used almost exclusively within distance-based frameworks, since a character-based account of the model would require a potentially large number of character states, which usually exceeds the number of character states allowed in the classical software packages for phylogenetic reconstruction.

Similar to the gain-loss model, there have not been many attempts to test the characteristics of this model in simulation studies. The only one known to me is a posthumously published letter from Sergei Starostin (1953-2005) to Murray Gell-Mann (Starostin 2007), in which he describes an attempt to account for his theory that a word's replacement rage increases with the word's age (Starostin 2000) in a computer simulation.

Problems with current models of lexical change

Neither the gain-loss model nor the concept-slot model seem to be misleading when it comes to describe the process of lexical change. However, they both obviously ignore specific and crucial aspects of lexical change that (according to the task stated above) any ambitious simulation of lexical change should try to account for. The gain-loss model, for example, deliberately ignores semantic change and morphological change. It can account for borrowings, which can be easily included in a simulation by allowing contemporary languages to exchange words with each other, but it cannot tell us (since it ignores the meaning of word forms) how the meaning of words changes over time, or how word forms change their shape due to morphological change.

The concept-slot model can, in theory, account for semantic change, but only as far as the concept-slots allow: the number of concepts in this model is fixed and one usually does not assume that it would change. Furthermore, while borrowing can be included in this model, the model does not handle morphological change processes.

In phylogenetic approaches, both models also have clear disadvantages. The main problem of the gain-loss model is the sampling procedure. Since one cannot sample all words of a language, scholars usually derive the cognate sets they use to reconstruct phylogenies from cognate-coded lexicostatistical word-lists. As I have tried to show earlier, in List (2016), this sampling procedure can lead to problems when homology is defined in a loose way. The problem of the concept-slot model is that it cannot be easily applied in phylogenetic inference based on likelihood models (like Maximum likelihood or Bayesian inference), since the only straightforward way to handle them would be multi-state models, which are generally difficult to handle.

Initial ideas for improvement

For the moment, I have no direct idea of how to model morphological change, and more research will be needed before we will be able to handle this in models of lexical change. The problem of the gain-loss and the concept-slot models to account for semantic change, however, can be overcome by turning to bipartite graph models of lexical change (see Newman 2010: 32f for details on bipartite graphs). In such a model, the lexicon of a human language is represented by a bipartite graph consisting of concepts as one type of node and word forms (or forms) as another type of node. The association strength of a given word node and a given concept node (or its "reference potential", see List 2014: 21f), ie. the likelihood of a word being used by a speaker to denote a given concept, can be modeled with help of weighted edges. This model naturally accounts for synonymy (if a meaning can be expressed by multiple words) and polysemy (if a word can express multiple meanings). Lexical change in such a model would consist of the re-arrangement of the weights in the network. Word loss and word gain would occur if a new word node is introduced into the network or an existing node gets dissociated from all of the concepts.


Sankoff's (1996) bipartite model of the lexicon of human languages

We can find this idea of bipartite modeling of a language's lexicon in the early linguistic work of Sankoff (1969: 28-53), as reflected in the figure above, taken from his dissertation (Figure 5, p. 36). Similarly, Smith (2004) used bipartite form-concept networks (which he describes as a matrix) in order to test the mechanisms by which these vocabularies are transmitted from the perspective of different theories on cultural evolution.

As I have never actively tried to review the large amount of literature devoted to simulation studies in historical linguistics, biology, and cultural evolution, it is quite possible that this blogpost lacks reference to important studies devoted to the problem. Despite this possibility, we can clearly say that we are lacking simulation studies in historical linguistics. I am furthermore convinced that the problem of handling lexical change in simulation studies is a difficult one, and that we may well have to wait to acquire more knowledge of the key processes involving lexical change in order to address it sufficiently in the future.

While I understand the popularity of gain-loss models in recent work on phylogenetic reconstruction in historical linguistics, I hope that it might be possible to develop more realistic models in the future. It is well possible that such studies will confirm the superiority of gain-loss models over alternative approaches. But instead of assuming this in an axiomatic way, as we seem to be doing it for the time being, I would rather see some proof for this in simulation studies, or in studies where the data fed to the gain-loss algorithms is sampled differently.

References

Bowern, Claire and Atkinson, Quentin D. (2012) Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88: 817-845.

Cohen, Ofir and Rubinstein, Nimrod D. and Stern, Adi and Gophna, Uri and Pupko, Tal (2008) A likelihood framework to analyse phyletic patterns. Philosophical Transactions of the Royal Society B 363: 3903-3911.

Gévaudan, Paul (2007) Typologie des lexikalischen Wandels. Bedeutungswandel, Wortbildung und Entlehnung am Beispiel der romanischen Sprachen. Tübingen:Stauffenburg.

Gray, Russell D. and Jordan, Fiona M. (2000) Language trees support the express-train sequences of Austronesian expansion. Nature 405: 1052-1055.

Gray, Russell D. and Atkinson, Quentin D. (2003) Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426: 435-439.

Gray, Russell D. and Greenhill, Simon J. and Ross, Malcolm D. (2007) The pleasures and perils of Darwinzing culture (with phylogenies). Biological Theory 2: 360-375.

Greenhill, S. J. and Currie, T. E. and Gray, R. D. (2009) Does horizontal transmission invalidate cultural phylogenies? Proceedings of the Royal Society of London, Series B 276: 2299-2306.

Lees, Robert B. (1953) The basis of glottochronology. Language 29: 113-127.

List, Johann-Mattis (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1: 119-136.

Murawaki, Yugo (2015) Spatial structure of evolutionary models of dialects in Contact. PLoS One 10: e0134335.

Newman, M. E. J. (2010) Networks: An Introduction. Oxford: Oxford University Press.

Nicholls, Geoff K and Ryder, Robin J and Welch, David (2013) TraitLab: A MatLab package for fitting and simulating binary tree-like data.

Ronquist, Frederik and Huelsenbeck, J. P. (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19: 1572–1574.

Sagart, Laurent, Jacques, Guillaume, Lai, Yunfan, Ryder, Robin, Thouzeau, Valentin, Greenhill, Simon J., List, Johann-Mattis (2019) Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proceedings of the National Academy of Science of the United States of America 116: 10317–10322. DOI: 10.1073/pnas.1817972116

Sankoff, David (1969) Historical Linguistics as Stochastic Process. McGill University: Montreal.

Smith, Kenny (2004) The evolution of vocabulary. Journal of Theoretical Biology 228: 127-142.

Starostin, Sergej Anatolévič (2000) Comparative-historical linguistics and lexicostatistics. In: Renfrew, Colin, McMahon, April, Trask, Larry (eds.): Time Depth in Historical Linguistics: 1. Cambridge:McDonald Institute for Archaeological Research, pp. 223-265.

Starostin, Sergej A. (2007) Computer-based simulation of the glottochronological process (Letter to M. Gell-Mann). In: : S. A. Starostin: Trudy po yazykoznaniyu [S. A. Starostin: Works in Linguistics]. LRC Publishing House, pp. 854-861.

Swadesh, Morris (1952) Lexico-statistic dating of prehistoric ethnic contacts. With special reference to North American Indians and Eskimos. Proceedings of the American Philosophical Society 96: 452-463.

Swadesh, Morris (1955) Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21.2: 121-137.

Tuesday, February 28, 2017

Models and processes in phylogenetic reconstruction


Since I started interdisciplinary work (linguistics and phylogenetics), I have repeatedly heard the expression "model-based". This expression often occurrs in the context of parsimony vs. maximum likelihood and Bayesian inference, and it is usually embedded in statements like "the advantage of ML is that it is model-based", or "but parsimony is not model-based". By now I assume that I get the gist of these sentences, but I am afraid that I often still do not get their point. The problem is the ambiguity of the word "model" in biology but also in linguistics.

What is a model? For me, a model is usually a formal way to describe a process that we deal with in our respective sciences, nothing more. If we talk about the phenomenon of lexical borrowing, for example, there are many distinct processes by which borrowing can happen.

A clearcut case is Chinese kāfēi 咖啡 "coffee". This word was obviously borrowed from some Western language not that long ago. I do not know the exact details (which would require a rather lengthy literature review and an inspection of older sources), but that the word is not too old in Chinese is obvious. The fact that the pronunciation comes close to the word for coffee in the largest European languages (French, English, German) is a further hint, since the longer a new word has survived after having been transplanted to another language, the more it resembles other words in that language regarding its phonological structure; and the syllable does not occur in other words in Chinese. We can depict the process with help of the following visualization:


Lexical borrowing: direct transfer
The visualization tells us a lot about a very rough and very basic idea as to how the borrowing of words proceeds in linguistics: Each word has a form and a function, and direct borrowing, as we could call this specific subprocess, proceeds by transferring both the form and the function from the donor language to the target language. This is a very specific type of borrowing, and many borrowing processes do not directly follow this pattern.

In the Chinese word xǐnǎo 洗脑 "brain-wash", for example, the form (the pronunciation) has not been transferred. But if we look at the morphological structure of xǐnǎo, being a compound consisting of the verb "to wash" and nǎo "the brain", it is clear that here Chinese borrowed only the meaning. We can visualize this as follows:
Lexical borrowing: meaning transfer

Unfortunately, I am already starting to simplify here. Chinese did not simply borrow the meaning, but it borrowed the expression, that is, the motivation to express this specific meaning in an analogous way to the expression in English. However, when borrowing meanings instead of full words, it is by no means straightforward to assume that the speakers will borrow exactly the same structure of expression they find in the donor language. The German equivalent of skyscraper, for example, is Wolkenkratzer, which literally translates as "cloudscraper".

There are many different ways to coin a good equivalent for "brain-wash" in any language of the world but which are not analogous to the English expression. One could, for example, also call it "head-wash", "empty-head", "turn-head", or "screw-mind"; and the only reason we call it "brain-wash" (instead of these others) is that this word was chosen at some time when people felt the need to express this specific meaning, and the expression turned out to be successful (for whatever reason).

Thus, instead of just distinguishing between "form transfer" and "meaning transfer", as my above visualizations suggest, we can easily find many more fine-grained ways to describe the processes of lexical borrowing in language evolution. Long ago, I took the time to visualize the different types of borrowing processes mentioned in the work of (Weinreich 1953[1974]) in the following graphic:

Lexical borrowing: hierarchy following Weinreich (1953[1974])

From my colleagues in biology, I know well that we find a similar situation in bacterial evolution with different types of lateral gene transfer (Nelson-Sathi et al. 2013). We are even not sure whether the account by Weinreich as displayed in the graphic is actually exhaustive; and the same holds for evolutionary biology and bacterial evolution.

But it may be time to get back to the models at this point, as I assume that some of you who have read this far have began to wonder why I am spending so many words and graphics on borrowing processes when I promised to talk about models. The reason is that in my usage of the term "model" in scientific contexts, I usually have in mind exactly what I have described above. For me (and I suppose not only for me, but for many linguists, biologists, and scientists in general), models are attempts to formalize processes by classifying and distinguishing them, and flow-charts, typologies, descriptions and the identification distinctions are an informal way to communicate them.

If we use the term "model" in this broad sense, and look back at the discussion about parsimony, maximum likelihood, and Bayesian inference, it becomes also clear that it does not make immediate sense to say that parsimony lacks a model, while the other approaches are model-based. I understand why one may want to make this strong distinction between parsimony and methods based on likelihood-thinking, but I do not understand why the term "model" needs to be employed in this context.

Nearly all recent phylogenetic analyses in linguistics use binary characters and describe their evolution with the help of simple birth-death processes. The only difference between parsimony and likelihood-based methods is how the birth-death processes are modelled stochastically. Unfortunately, we know very well that neither lexical borrowing nor "normal" lexical change can be realistically described as a birth-death process. We even know that these birth-death processes are essentially misleading (for details, see List 2016). Instead of investing our time to enhance and discuss the stochastic models driving birth-death processes in linguistics, doesn't it seem worthwhile to have a closer look at the real proceses we want to describe?

References
  • List, J.-M. (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1.2. 119-136.
  • Nelson-Sathi, S., O. Popa, J.-M. List, H. Geisler, W. Martin, and T. Dagan (2013) Reconstructing the lateral component of language history and genome evolution using network approaches. In: : Classification and evolution in biology, linguistics and the history of science. Concepts – methods – visualization. Franz Steiner Verlag: Stuttgart. 163-180.
  • Weinreich, U. (1974) Languages in contact. With a preface by André Martinet. Mouton: The Hague and Paris.

Wednesday, December 9, 2015

Lexicostatistics: the predecessor of phylogenetic analyses in historical linguistics


Phylogenetic approaches in historical linguistics are extremely common nowadays. Especially, probabilistic models that model lexical change as a birth-death process of cognate sets evolving along a phylogenetic tree (Pagel 2009) are very popular (Lee and Hasegawa 2011, Kitchen et al. 2009, Bowern and Atkinson 2012), but also splits networks are frequently used (Ben Hamed 2005, Heggarty et al. 2010).

However, the standard procedure to produce a family tree or network with phylogenetic software in linguistics goes back to the method of lexicostatistics, which was developed in the 1950s by Morris Swadesh (1909-1967) in a series of papers (Swadesh 1950, 1952, 1955). Lexicostatistics was discarded by the linguistic community not long after it was proposed (Hoijer 1956, Bergsland and Vogt 1962). Since then, lexicostatistics is considered a methodus non gratus in classical circles of historical linguistics, and using it openly may drastically downgrade one's perceived credibility in certain parts of the community.

To avoid the conflicts, most linguists practicing modern phylogenetic approaches emphasize the fundamental differences between early lexicostatistics and modern phylogenetics. These differences, however, apply only to the way the data is analysed. The basic assumptions underlying the selection and preparation of data have not changed since the 1950s, and it is important to keep this in mind, especially when searching for appropriate phylogenetic models to analyse the data.

The Theory of Basic Vocabulary

Swadesh's basic idea was that in the lexicon of every human language there are words that are culturally neutral and functionally universal; and he used the term "basic vocabulary" to refer to these words. Culturally neutral hereby means that the meanings expressed by the words are independently used across different cultures. Functional universality means that the meanings are expressed by all human languages independent of the time and place where they are spoken. The idea is that these meanings are so important for the functioning of a language as a tool of communication, that every language needs to express them.

Cultural neutrality and functional universality guarantee two important aspects of basic words: their stability and their resistance to borrowing. Stability means that words expressing a basic concept are less likely to change their meaning or to be replaced by another word. An argument for this claim is the functional importance of the words — if the words are important for the functioning of a language, it would not make much sense to change them too quickly. Humans are good at changing the meanings of words, as we can see from daily conversations in the media, where new words tend to pop up seemingly on a daily basis, and old words often drastically change their meanings. But changing words that express basic meanings like "head", "stone", "foot", or "mountain" too often might give rise to confusion in communication. As a result, one can assume that words change at a different pace, depending on the meaning they express, and this is one of the core claims of lexicostatistics.

Resistance to borrowing follows also from stability, since the replacement of words expressing basic meanings may again have an impact on our daily communication, and we may thus assume that speakers avoid borrowing these words too quickly. Cultural neutrality of concepts is another important point to guarantee resistance to borrowing. Words expressing concepts which play an important cultural role may easily be transferred from one language to another along with the culture. Thus, although it seems likely that every language has a word for "god" or "spirit" and the like (so the concept is to a certain degree functionally universal), the lack of cultural independency makes words expressing religious terms very likely candidates for borrowing, and it is probably no coincidence that words expressing religion and belief rank first in the scale of borrowability (Tadmor 2009: 232).

Lexical Replacement, Data Preparation, and Divergence Time Estimation

Swadesh had further ideas regarding the importance of basic vocabulary. He assumed that the process of lexical replacement follows universal rates as far as the basic vocabulary is concerned, and that this would allow us to date the divergence of languages, provided we are able to identify the shared cognates. In lexical replacement, a word w₁ expressing a given meaning x in a language is replaced by a word w₂ which then expresses the meaning x, while w₁ either shifts to express another meaning, or completely disappears from the language. For example, older thou did in English was replaced by the plural form you, which now also expresses the singular. In order to search for cognates and determine the time when two languages diverged, Swadesh proposed a straightforward procedure, consisting of very concrete steps (compare Dyen et al. 1992):
  • Compile a list of basic concepts (concepts that you think are culturally neutral and functionally universal; see here for a comparative collection of different lists that have been proposed and used in the past)
  • translate these concepts into the different languages you want to analyse
  • search for cognates between the languages in each meaning slot; if words in two languages are not cognate for a given meaning, then this points to former processes of lexical replacement in at least one of the languages since their divergence
  • count the number of shared cognates, and use some mathematics to calculate the divergence time (which has been independently calibrated using some test cases of known divergence times).
As an example for such a wordlist with cognate judgments, compare the table in the first figure, where I have entered just a few basic concepts from Swadesh's standard concept list and translated them into four languages. Cognacy is assigned with help of IDs in the column at the right of each language column, but also further highlighted with different colors.

Classical cognate coding in lexicostatistics

Phylogenetic Approaches in Historical Linguistics

Modern phylogenetic approaches in historical linguistics basically follow the same workflow that Swadesh propagated for lexicostatistics, the only difference being the last step of the working procedure. Instead of Swadesh's formula, which compared lexical replacement with radioactive decay and was based on aggregated distances in its core, character-based methods are used to infer phylogenetic trees. Characters are retrieved from the data by extracting each cognate from a lexicostatistical wordlist and annotating the presence or absence of each cognate set in each language.

Thus, while Swadesh's lexicostatistical data model would state that the words for "hand" in German and English were cognate, and also in Italian and French, but not in Germanic and Romance, the binary presence-absence coding states that the cognate set formed by words like English hand and German Hand is not present in Romance languages, and that the cognate set formed by words like Italian mano and French main is absent in Germanic languages. This is illustrated in the table below, where the same IDs and colors are used to mark the cognate sets as in the table shown above.

Presence-absence cognate coding for modern phylogenetic analyses

The new way of cognate coding along with the use of phylogenetic software methods has brought, without doubt, many improvements compared to Swadesh's idea of dating divergence times by counting percentages of shared cognates. A couple of problems, however, remain, and one should not forget them when applying computational methods to originally lexicostatistic datasets.

First, we could ask whether the main assumptions of functional universality and cultural neutrality really hold. It seems to be true that words can be remarkably stable throughout the history of a language family. It is, however, also true that the most stable words are not necessarily the same across all language families. Ever since Swadesh established the idea of basic vocabulary, scholars have tried to improve the list of basic vocabulary items. Swadesh himself started from a list of 215 concepts (Swadesh 1950), which he then reduced to 200 concepts (1952) and then later to 100 concepts (1952). Other scholars went further, like Dolgopolsky (1964 [1986]) who reduced the list to 16 concepts. The Concepticon is a resource that links many of the concept lists that have been proposed in the past. When comparing these lists, which all represent what some scholars would label "basic vocabulary items", it becomes obvious that the number of items that all scholars agree upon sinks drastically, while the number of concepts that have been claimed to be basic increases.

An even greater problem than the question of universality and neutrality of basic vocabulary, however, is the underlying model of cognacy in combination with the proposed process of change. Swadesh's model of cognacy controls for meaning. While this model of cognacy is consistent with Swadesh's idea of lexical replacement as a basic process of lexical change, it is by no means consistent with birth-death models of cognate gain and cognate loss if they are created from lexicostatistical data. In biology, birth-death models are usually used to model the evolution of homologous gene families distributed across whole genomes. If we use the traditional view according to which words can be cognate regardless of meaning, the analogy holds, and birth-death processes seem to be adequate in order to analyze datasets that are based on these root cognates (Starostin 1989) or etymological cognates (Starostin 2013). But if we control for meaning in the cognate judgments, we do not necessarily capture processes of gain and loss in our data. Instead, we capture processes in which links between word forms and concepts are shifted, and we investigate these shifts through the very narrow "windows" of pre-defined slots of basic concepts, as I have tried to depict in the following graphic.

Looking at kexical replacement through the small windows of basic vocabulary

Conclusion

As David has mentioned before: We do not necessarily need realistic models in phylogenetic research to infer meaningful processes. The same can probably be said about the discrepancy between our lexicostatistical datasets (Swadesh's heritage, which we keep using for practical reasons) and the birth-death models we now use to analyse the data. Nevertheless, I cannot avoid an uncomfortable feeling when thinking that an algorithm is modeling gain and loss of characters in a dataset that was not produced for this purpose. In order to model the traditional lexicostatistical data consistently, we would either (i) need explicit multistate-models in which concepts are a character and the forms represent the states (Ringe et al. 2002, Ben Hamed and Wang 2006), or (ii) we should directly turn to "root-cognate" methods. These methods have been discussed for some time now (Starostin 1989, Holm 2000), but there is only one recent approach by Michael et al. (forthcoming) in which this is consistently tested.

References
  • Bergsland, K. and H. Vogt (1962): On the validity of glottochronology. Curr. Anthropol. 3.2. 115-153.
  • Bowern, C. and Q. Atkinson (2012): Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88. 817-845.
  • Dolgopolsky, A. (1964): Gipoteza drevnejšego rodstva jazykovych semej Severnoj Evrazii s verojatnostej točky zrenija [A probabilistic hypothesis concering the oldest relationships among the language families of Northern Eurasia]. Voprosy Jazykoznanija 2. 53-63.
  • Dyen, I., J. Kruskal, and P. Black (1992): An Indoeuropean classification. A lexicostatistical experiment. T. Am. Philos. Soc. 82.5. iii-132.
  • Ben Hamed, M. and F. Wang (2006): Stuck in the forest: Trees, networks and Chinese dialects. Diachronica 23. 29-60.
  • Hoijer, H. (1956): Lexicostatistics. A critique. Language 32.1. 49-60.
  • Holm, H. (2000): Genealogy of the main Indo-European branches applying the separation base method. J. Quant. Linguist. 7.2. 73-95.
  • Kitchen, A., C. Ehret, S. Assefa, and C. Mulligan (2009): Bayesian phylogenetic analysis of Semitic languages identifies an Early Bronze Age origin of Semitic in the Near East. Proc. R. Soc. London, Ser. B 276.1668. 2703-2710.
  • Lee, S. and T. Hasegawa (2011): Bayesian phylogenetic analysis supports an agricultural origin of Japonic languages. Proc. R. Soc. London, Ser. B 278.1725. 3662-3669.
  • Pagel, M. (2009): Human language as a culturally transmitted replicator. Nature Reviews. Genetics 10. 405-415.
  • Ringe, D., T. Warnow, and A. Taylor (2002): Indo-European and computational cladistics. T. Philol. Soc. 100.1. 59-129.
  • Starostin, S. (1989): Sravnitel'no-istoričeskoe jazykoznanie i leksikostatistika [Comparative-historical linguistics and lexicostatistics]. In: Kullanda, S., J. Longinov, A. Militarev, E. Nosenko, and V. Shnirel'man (eds.): Materialy k diskussijam na konferencii[Materials for the discussion on the conference].1. Institut Vostokovedenija: Moscow. 3-39.
  • Starostin, G. (2013): Lexicostatistics as a basis for language classification. In: Fangerau, H., H. Geisler, T. Halling, and W. Martin (eds.): Classification and evolution in biology, linguistics and the history of science. Concepts – methods – visualization.. Franz Steiner Verlag: Stuttgart. 125-146.
  • Swadesh, M. (1950): Salish internal relationships. Int. J. Am. Linguist. 16.4. 157-167.
  • Swadesh, M. (1952): Lexico-statistic dating of prehistoric ethnic contacts. With special reference to North American Indians and Eskimos. Proc. Am. Philol. Soc. 96.4. 452-463.
  • Swadesh, M. (1955): Towards greater accuracy in lexicostatistic dating. Int. J. Am. Linguist. 21.2. 121-137.
  • Tadmor, U. (2009): Loanwords in the world’s languages. Findings and results. In: Haspelmath, M. and U. Tadmor (eds.): Loanwords in the world's languages. de Gruyter: Berlin and New York. 55-75.