Showing posts with label Anthropology. Show all posts
Showing posts with label Anthropology. Show all posts

Monday, November 11, 2019

A new playground for networks and exploratory data analysis


[This is a post by Guido with some help from David]

There tend to be two types of studies of inheritance and evolution. First, there is evolution of organisms, either of the phenotype (morphology, anatomy, cell ultrastructure, etc) or genotype (chromosome, nucleotides). The latter involves direct inheritance, but it is often treated as including all molecules, although it is the nucleotides (and chromosomes) that get inherited, not amino acids, for example.

Second, there are studies of the evolution of behaviour, which has focused mainly on humans, of course, but can include all species. For humans, this includes socio-cultural phenomena, particularly language (written as well as spoken), but also including cultural advancements such as social organization, tool use, agriculture, etc., which are inherited indirectly, by learning.

However, we rarely see studies that are multi-disciplinary in the sense of combining both physical and behavioural evolution. It is therefore very interesting to note the just-published preprint by:
Fernando Racimo, Martin Sikora, Hannes Schroeder, Carles Lalueza-Fox. 2019. Beyond broad strokes: sociocultural insights from the study of ancient genomes. arXiv.
These authors provide a review about the extent to which the analysis of ancient human genomes has provided new insights into socio-cultural evolution. This provides a platform for interesting future cross-disciplinary research.

The authors comment:
In this review, we summarize recent studies showcasing these types of insights, focusing on the methods used to infer sociocultural aspects of human behaviour. This work often involves working across disciplines that have, until recently, evolved in separation. We argue that multidisciplinary dialogue is crucial for a more integrated and richer reconstruction of human history, as it can yield extraordinary insights about past societies, reproductive behaviours and even lifestyle habits that would not have been possible to obtain otherwise.
Since multi-disciplinary dialogue is a focal point here at the Genealogical World of Phylogenetic Networks. Since our blog embraces non-biological data, we have done a little brainstorming, to put forward some ideas based on Racimo et al.'s comments. The four figures contain some extra discussion, with some visual representations of the ideas.

Why it's important to correlate genetic, linguistic and socio-cultural data. The doodle shows a simple free expansion model of a founder population with three genotypes (yellow, green, blue), a shared language (L) and two major cultural innovations (white stars). Because of drift and stochastic intra-population processes (size represent the size of the actively reproducing populace) the first expansion (light gray arrows) lead to 'tribes' that show already some variation. The smaller ones close to the founder population spoke still the same language, the ones further away used variants (dialects) of L (L', still close to L, L'', more distinct). Because of bootlenecks, geographic distance and differing levels of inbreeding (the smaller a population, the farther away from the source, the more likely are changes in genotype frequency), each population has a different genotype composition. The second expansion (mid-gray arrows) mixing two sources leads to a grandchild that evolved a new language M and lost the blue genotype. Because the cultural innovations are beneficial, we find them in the entire group. In extreme cases of genetic sorting and linguistic evolution, such shared cultural innovations may be the only evidence clearly linking all these populations.

Social-cultural character matrices

Correlating different sets of data and (cross-)exploring the signal in these data can be facilitated by creating suitable character matrices. In phylogenetics, we primarily use characters that underlie (ideally) neutral evolution, such as nucleotide sequences and their transcripts, amino-acid sequences. When using matrices scoring morphological traits, we relax the requirement of neutral evolution, but we are still scoring traits that are the product of biological evolution. However, we don't need to stop there, phylo-linguistics is an active field, even though languages involve different evolutionary constraints and processes than we meet in biology. Data-wise there are nonetheless many analogies, and phylogenetic methods seem to work fine.

So, why not also score socio-cultural traits in a character matrix? For instance, we can characterize cultures and populations by basic features including: the presence of agriculture, which crops were cultivated, which animals were domesticated, which technological advances were available, whether it was a stone-age, bronze-age, iron-age culture, etc. Linguistically, we could also develop matrices of local populations, with regional accents or dialects, etc.

Creating such a matrix should, of course, be informed by available objective information. As in the case of morphological matrices or non-biological matrices in general, we should not be concerned about character independence. We don't need to infer a phylogenetic tree from these matrices, as their purpose is just to sum up all available characteristics of a socio-cultural group.

Second phase: stabilization of differentiation pattern. While the close-by tribes are still in contact with the mother population, the most distant lost contact. As consequence the gene pools of the L/L'-speaking communities will become more similar, and new innovations acquired by the founder population (black star) are readily propagated within its cultural sphere. Re-migration from the larger M-speaking tribe to the struggling L''-speakers (small population with high inbreeding levels) lead to the extinction of the blue genotype in the latter and increased 'borrowing' of M-words and concepts.

Distance calculations

Pairwise distance matrices are most versatile for comparing data across different data sets.

First, any character matrix can be quickly transformed into a distance matrix, and the right distance transformation can handle any sort of data: qualitative, categorical data as well as quantitative, continuous data.

Second, the signal in any distance matrix can be quickly visualized using Neighbor-nets. This blog has a long list of posts showing Neighbor-nets based on all sorts of sociological data that don't follow any strict pattern of evolution, and are heavily biased by socio-cultural constraints (eg. bikability, breast sizes, German politics, gun legislation, happiness, professional poker, spare-time activities). We have even included celestial bodies.

Third, distance matrices can be tested for correlation as-is, without any prior inference, using simple statistics, such as the Pearson correlation coefficient. To give just one example from our own research: in Göker and Grimm (BMC Evol. Biol. 2008), the latter was used for testing the performance of character and distance transformations for cloned ITS data covering substantial intra-genomic diversity, by correlating the resulting individual-based distances with species-level morphological data matrices. (The internal transcribed spacers are multi-copy, nuclear-encoded, non-coding gene regions; in the simplest case each individual has two sets of copies, arrays, one inherited from the father, the other from the mothers, which may differ between but also within the individual.)

In the context of Racimo et al.'s paper, one could construct a genetic, a socio-cultural, a linguistic and a geographical matrix, determine the pairwise distances between what in phylogenetics are called OTUs (the operational taxonomic units), and test how well these data (or parts of it) correlate. The OTUs would be local human groups sharing the same culture (and, if known) language.

Alternatively, one can just map the scored socio-cultural traits onto trees based on genetic data or linguistics.

A new culture with its own language (Λ), genotype (red) and innovations (ruby-red pentagon) migrates close to the settling area of the L-people. Because of raids, genotypes and innovations from the the L-people get incorporated into the the Λ-culture.

How to get the same set of OTUs

The Göker & Grimm paper mentioned above tested several options for character and distance transformations, because we faced a similar problem to what researchers will face when trying to correlate socio-cultural data with genetic profiles of our ancestors: a different set of leaves (the OTUs). We were interested in phylogenetic relationships between individuals using data representing the genetic heterogeneity within these individuals.

Genetic studies of human (ancient or modern) DNA use data based from individuals, but socio-cultural and linguistic data can only be compiled at a (much) higher level: societies, or other groups of many individuals. In addition, these groups may also span a larger time frame. Since humans love to migrate, we are even more of a genetic mess than were the ITS data that we studied.

One potential alternative is to use the host-associate analysis framework of Göker & Grimm. Instead of using the individual genetic profiles (the associate data), one sums them across a socio-cultural unit (serving as host). The simplest method is to create a consensus of the data (in Göker & Grimm, we tested strict and modal consensuses). This produces sequences with a lot of ambiguity codes — genetic diversity within the population will be presented by intra-unit sequence polymorphism (IUSP). Standard distance and parsimony implementation do not deal with ambiguities, but the Maximum likelihood, as implemented in RAxML, does to some degree. A gapstop is the recoding of ambiguities as discrete states for phylogenetic analysis (tree and network inference) as done by Potts et al. (Syst. Biol. 2014 [PDF]) for 2ISPs ('twisps'), intra-individual site polymorphism. It can't hurt to try out whether this works for IUSPs, too.

Since humans (tribes, local groups) often differ in the frequency of certain genotypes, it would be straightforward to use these frequencies directly when putting up a host matrix. Instead of, for example, nucleotides or their ambiguity codes, the matrix would have the frequency of the different haplotypes. We can't infer trees from such a matrix (we need categorical data), but we can still calculate the distance matrix and infer a Neighbor-net.

The 'phylogenetic Bray-Curtis' (distance) transformation introduced in Göker & Grimm (2008) also keeps the information about within-host diversity when determining inter-host distances (see Reticulation at its best ...)


Transformations for genetic data from smaller to larger, more-inclusive units are implemented in the software package POFAD by Joli et al. (Methods in Ecology & Evolution, 2015. Their paper also provides a comparison of different methods, including the ones tested in Göker & Grimm (2008, also implemented in the tiny executables g2cef and pbc, compiled for any platform).

The process of assimilation. The Λ-people subdued the L-culture with the consequence that all innovations are shared in their influence sphere. Having a much smaller total population size, the language of the invaders is largely lost but the new common language L* still includes some Λ-elements (in a phylogenetic tree analysis, L* would be part of the L/M clade, using networks, L* would share edges with Λ in contrast to L and M). The L''/M-speaking remote population is re-integrated. The invaders' genotype (red) becomes part of the L-people's gene pool. Re-migration (forced or not) introduces L-genotypes into the original Λ-population. Only by comparing all available data, ideally covering more than one time period, we can deduce that the M-speakers represent an early isolated subpopulation of the L-people that was not affected by the Λ-invasion. With only the genetic data at hand, one may identify the M-speakers as one source and the Λ-tribe as another source for the L*-people, and infer that all L/M and Λ-tribes share a common origin (since the yellow genotype is found in both the M- and the original Λ-population).

Conclusion

It therefore seems to us that there is enormous potential for multi-disciplinary work, that truly combine organismal and socio-cultural evolution. We have provided a few practical suggestions here about how this might be done. We encourage you all to have try some of these ideas, to see where it leads us all.

Tuesday, January 9, 2018

False reports of US women's breast sizes


The role of the social media in spreading fake news has recently been in the headlines; and it is becoming recognized as a major global risk, unique to the 21st century (the first known examples apparently date from 2010). For example, Chengcheng Shao et al. (The spread of fake news by social bots) note:
If you get your news from social media, you are exposed to a daily dose of false or misleading content - hoaxes, rumors, conspiracy theories, fabricated reports, click-bait headlines, and even satire. We refer to this misinformation collectively as false or fake news ... Even in an ideal world where individuals tend to recognize and avoid sharing low-quality information, information overload and finite attention limit the capacity of social media to discriminate information on the basis of quality. As a result, online misinformation is just as likely to go viral as reliable information.
However, an equally problematic issue occurs when the professional media indulge in the same practice — disseminating fake news online. A good example of this appeared during June-July 2016. It involved the presence online of this so-called research paper:
Scientific analysis reveals major differences in the breast size of women in different countries. The Journal of Female Health Sciences.

On the face of it, the paper seems very doubtful:
  • The concept itself is preposterous — although different genetic groups might have differences in breast size, on average, many countries have a mix of difference genetic groups, and thus should have a mix of breast sizes. There isn't an Olympics of breast dimensions!
  • The paper first appeared online in mid 2015, at a location not directly associated with any known journal.
  • The alleged journal's home page contains no references to any other published papers, nor to any mechanism for accessing or subscribing to it.
  • The alleged society publishing the journal has no internet presence, other than the journal homepage.
  • The alleged institutions from which the authors hail have no internet presence, other than the paper.
  • The alleged authors also have no internet presence, other than the paper.
It thus takes only a few minutes of effort to confidently identify this paper as a hoax. One therefore has to wonder why so much of the professional media did not make this effort. Instead, they enthusiastically listed the results, which proclaim the USA as having women with the largest breast size, on average, and the Philippines as having the smallest.

A Google search results in 755 hits to the paper's title, many of them internet commentaries. However, consider the following list of professional publications that took the paper seriously in mid 2016:
  • The Sun — The breast in the world: the countries where women have the biggest natural boobs in the world … and the smallest
  • The Telegraph — US women have the biggest breasts in the world — study reveals
  • The Mirror — The countries boasting the women with the biggest natural boobs revealed - where does Britain rank?
  • Daily Mail — Land of the free and home of the busty! American women revealed as having the biggest natural breasts in the world, while Brits come in fifth and Filipinos are last
  • The Irish Sun — Women in Ireland have the third biggest natural boobs in the world
  • New York Daily News — Red, white and boobs: American women boast the biggest breasts in the world
  • Seventeen — American women apparently have the biggest boobs in the world
  • Teen Vogue — U.S. women have the biggest boobs in the world, says science
  • FHM — Pinays have the smallest breasts in the world, study finds
  • Philippine Star — Study: Filipino women have the smallest breast size in the world
  • ABS-CBN — Study: PH women have smallest breasts in the world
  • South Africa Times — Where boobs grow biggest
Importantly, there were a number of commentators who did point out the hoax almost immediately the news reports started appearing:
  • Media Equalizer — Fake breast size study fools publications around the world
  • Manila Times — Fake research on women’s breast sizes is trite and boring
  • Daily Caller — Study showing America has world’s biggest boobs is a hoax but let’s rejoice anyway
  • Jose Carillo — Open letter on news stories that Filipinas have the world’s smallest breasts
Why, then, has the data subsequently been taken seriously in these places:
  • Radiation Oncology Journal 35: 121-128 (2017 ) In vivo dosimetry and acute toxicity in breast cancer patients undergoing intraoperative radiotherapy as boost.
  • Answers.com — Which country's women have biggest breasts in the world?


It is instructive to look at whether the perpetrators went to any trouble to produce their data. We can do this with a phylogenetic network, as usual on this blog. The network above is a NeighborNet based on the Euclidean distance — countries near each other in the network have similar breast sizes, and the further apart they they are then the less similarity they have. Only the 20 largest breast sizes are labeled.

You can see that the biggest breast sizes come preferentially from women with European backgrounds. You can also see just how extreme the breast sizes are claimed to be in North America. Both claims are actually doubtful.

Obviously, I do not know the origin of the paper and its data, but there is a somewhat similar presentation dating from March 2011, this time with a world map of bra sizes:
  • Target Map — Average breast cup size in the world
No source is identified for the latter data, but note that, in this case, it is the Nordic countries plus Russia that are reported to have the largest bra sizes. Indeed, the Spearman rank correlation between the the paper and map bra-size datasets is 0.71, so that only 50% of the variation in data is shared between the two datasets.

Finally, if you really do feel the need to read a scientific report about female breast morphology, then try this real one, which at least makes sense:
Evolution and Human Behavior 38: 217-226 (2017) Men's preferences for women's breast size and shape in four cultures.

Wednesday, May 3, 2017

On stemmatics and phylogenetic methods

No se publica un libro sin alguna divergencia entre cada uno de los ejemplares. Los escribas prestan juramento secreto de omitir, de interpolar, de variar. [No book is published without some divergence between each of the copies. Scribes take a secret oath to omit, to interpolate, to change.] (Jorge Luis Borges, La lotería en Babilonia, in Ficciones, 1962)
This is the first on series of posts on stemmatics, a field just as much in love with trees and networks as are phylogenetics and historical linguistics. Being an introduction, I explain what the field does, present the most important jargon, and offer a list references that, while suitable for the audience of this blog, is denser than what one might expect for a blog post.

Thank you to Mattis and David for inviting me to write!

Textual criticism

Textual criticism (or, less precisely, "philology") is a discipline concerned with the investigation of the history of literary, legal, and religious texts for explaining how differences among the copies of a text (its "witnesses") arose, and with the production of "critical editions", either scholarly curated versions of a text that aim to reconstruct the lost original or corrected versions of an existing copy.

The problem of divergence between copies of text, with the accumulation of involuntary and deliberate errors, as well as the need for a systematic study of such differences, is as old as writing itself. For example, our current editions for the epic poems of Homer descend from Ancient philological attempts to restore an uncontaminated original (see the first two figures). These include the edition of Pisistratus (VI century BCE, which determined what was to be sung at the Panathenaic Games), and the so-called VMK (Viermännerkommentar, "commentary of the four men") of the Alexandrian School (I-II century BCE), which is generally assumed to be the root of the witnesses that we have.

Van der Valk's reconstruction of the sources for Venetus A, one of the most
important manuscripts of Homer's Iliad (source: Wikipedia).

Erbse's reconstruction of the sources for Venetus A, one of the most important
manuscripts of the Iliad (source: Wikipedia).

Before stemmatics, an edition could either be based on a "good copy" (a version considered to be less contaminated or more faithful than others), in a "majority reading" (in which the most attested variant would be chosen), or in a principle of "eclecticism" (with each best reading individually selected by the editor's judgment). Each new version, as expected, contributed even more to the confusion, particularly when changes were voluntary.

Among the texts with long and complex traditions, objects of countless and sometimes bloody disputations on the "correct" readings, are the Bible and codes of laws, for which it was not uncommon to have a different version in each city, with predictable consequences. For example, the first published textual tree, as already covered in this blog (The first Darwinian evolutionary tree), was authored by Carl Johan Schlyter in 1827 in a study precisely on the multiple and conflicting copies of Swedish law.

As such, it is no surprise that objective approaches were soon developed (Homer's VMK edition being one of the first examples), culminating with the development of stemmatics, with its study of the genealogical relationship between witnesses, and its representation of such relationships by means of trees.

Stemmatics

As a scientific approach to textual criticism, stemmatics established itself from the beginnings of 19th century as an alternative to emendations based in the opinions and wishes of editors, possibly inspiring both Charles Darwin and August Schleicher (for a general discussion on the development and significance of this method, see Timpanaro 2005). However, more than a "source", we should consider it a branch equally stemming from the "cultural framework" (Macé and Baret 2006: 91) that also gave us Darwinism and historical linguistics.

As was true for these latter disciplines, stemmatics was at first opposed, because of the revolution it brought to its field, along with its genealogical trees. However, just as in these sister disciplines, the results of the new mindset introduced by the explanation of evolution with trees could not be ignored, and this approach is so central to textual criticism that the latter can be divided into periods before and after the work of Karl Lachmann, the "father" of stemmatics, in particular the publication of his edition of Lucretius' De rerum natura (1850). In his commentaries, besides demonstrating the number of lines per page in the lost manuscript at the root of the tradition, Lachmann was even able to demonstrate the kind of script used to write it (Lachmanni 1850).

The work he chose, with the importance of Lucretius in the development of the scientific mindset (and, as we should remember when dealing with cultural evolution, of Darwin's theories), is unlikely to be casual, but this is a matter for a different blog post.

Trees

Genealogical trees are so central to the stemmatic method that the field itself is actually named after them. The main goal of an editor is to produce a stemma codicum ("family tree of manuscripts"), or simply stemma, a tree-like structure that supports the textual emendation and represents the "tradition" (the witnesses' genealogy), in analogy with the family trees of Roman families that figured in many texts reviewed by 19th century philologists. Stemma, in fact, is a Greek word meaning garland or wreath, that was incorporated in Imperial Latin to designate a family tree (and, figuratively, nobility itself), as family trees were drawn with a stemma at their top.

In short, stemmatics begins with a recensio, which is an investigation of all total and partial copies of a work. This review is followed by a collatio, a systematic scrutiny of the manuscripts' contents, when readings are aligned and compared. The results of this alignment are used to produce the stemma, following the principle that "community of errors implies community of origin". By analyzing the stemma and the errors, editors finally proceed to the emendatio, which is a reconstruction that explains the known variants, and is intended to represent the "archetype" (a lost witness at the root of the ramification, assumed to be closer to the original than any other copy).

A stemma is conventionally drawn top-to-bottom, with vertical placements roughly indicating the date of the manuscript (the higher, the older). Solid edges ("arrows") indicate descent, while dashed ones imply contamination (scribes using more than one source). Witnesses are usually labeled with abbreviated names or Latin letters, when the manuscript is available, or with Greek letters, when it is missing (with α usually reserved for the archetype and ω for the original). Below is a reproduction of Petrocchi's partial stemma for the tradition of Dante Alighieri's Divine Comedy, which I will cover in a future post. Note that the genealogy is actually a reticulating network rather than a simple tree.

Petrocchi's partial stemma for the Divine Comedy, presented in the
introduction to his critical edition (1965).

The example stemma offered by Maas (1958), adapted below, is still useful to demonstrate the principles of stemmatics. In this example, for a textual emendation manuscript H should be eliminated (as it descends from F), as well as I and J (copies of G). Manuscript C shows a contamination from its collateral D, something which should be considered when weighting errors. Sub-archetypes β and γ are to be inferred from the available witnesses of their branches, and their readings will have the same weight as K, the only member of the third family branching from the archetype (even though it is a recent manuscript), in establishing the "lesson" of α. Errors might be presumed in α itself, or even in the original ω, and in both cases a corrected "lesson" might be offered by the editor after internal and external evidences.

Exemplary stemma adapted from Maas (1958).

Adoption and practice

Stemmatics has been criticized and confronted since Lachmann's time. It requires very specialized knowledge, for example in distinguishing between monogenetic and polygenetic errors, i.e. those that arose once and those that emerged independently more than once (and that, as such, are not disjunctive). A number of its suppositions are routinely called into question, such as the idea that each copy always derives from a single source (accepting contamination, at most), that each copy has at least the same number of errors of its source, and, fundamentally, that traditions have one and only one archetype.

Many measures tend to be adopted to reduce the editorial effort. These include eliminating manuscripts considered to be descripti (i.e. proved to descend from a preserved witness, in theory sharing all the errors of their sources), and only performing the collatio in a set of critical passages (loci critici). While a complete stemma and a full collatio are desirable, such compromises might be unavoidable for long texts with ample traditions. For example, in the case of Dante Alighieri's Divine Comedy, after considering the time employed by scholars such as Petrocchi, Sanguineti, and Shaw for their editions, Trovato (2016) estimated the length of a full stemmatic approach in 400 man-years.

An alternative to stemmatic methods and suppositions, which also reduces the editorial effort, is found in scholars who follow the work of Joseph Bédier, who successfully challenged the limits of stemmatics by adopting a renewed version of the method of the "good copy" for his editions of medieval texts. The Bédierian method does not refute a scientific approach or methods such as the recensio, the collatio, or even the production of a stemma, but these are used to support the editor's judgment in selecting and curating a bon manuscript — a good edition of text to be corrected only where errors can be proved beyond reasonable doubt. In short, trees (and networks) have been central to textual criticism even when stemmatics itself, as a method, is being challenged.

Considering the editorial effort and the analogies with linguistics and biology, it is no surprise that digital workflows have been proposed, along with the development of computer resources and phylogenetic methods. Ideas for new approaches were explored by Froger (1969), and formal phylogenetic methods were attempted by Platnick and Cameron (1977). Recently, the number of editions supported by formal phylogenetic methods and software has increased (see, for example, Barbook et al. 1998; Stolz 2003; and Lantin, Baret and Macé 2004), also in the face of scientific evaluations of performance (Roos and Heikkila 2009).

Besides advances in speed and replicability, the new technologies are allowing us to expand the goals of the discipline, moving from electronic editing to computational philology. In fact, while the field has for centuries been defined by the production of critical editions, digital approaches have been shown to support a reduction in the importance of "authorial intention", allowing researchers to focus on the reception of texts by the public, in line with developments of literary theory (Jauss 1982), and with the goals established by the "New Philology" (Cerquiglini 1989). Manuscripts with readings that differ from a supposed original, traditionally described as "corrupted", are changing from copies that were meant to be discarded into data points that collaborate in an investigation of human history that is assisted by quantitative data and methods.

References

Barbrook A.C., Howe C.J., Blake N., Robinson P. (1998) The phylogeny of the Canterbury Tales. Nature 394 (6696): 839.

Cerquiglini B. (1989) Éloge de la variante: histoire critique de la philologie. Aux Travaux. Paris: Éditions du Seuil.

Froget J. (1969) La critique des textes et son automatization. Bulletin De L’Association Guillaume Budé 1(1): 125–129.

Jauss H.-R. (1982) Toward an Aesthetic of Reception. Minneapolis: University of Minnesota Press.

Lachmann C. (1850) De Rerum Natura. Commentarius. Berolini: Imprensis Georgii Reimeri.

Lantin A.-C., Baret P.V., Macé C. (2004) Phylogenetic analysis of Gregory of Nazianzus’ Homily 27. 7èmes Journées Internationales d’Analyse statistique des Données Textuelles, pp. 700-707.

Maas P. (1958). Textual Criticism. Translated by Barbara Flower. Oxford: Oxford University Press.

Macé C.; Baret P.V. (2006) Why phylogenetic methods work: the theory of evolution and textual criticism. Linguistica Computazionale. The Evolution of Texts: Confronting Stemmatological and Genetical Methods 24: 89–108.

Platnick N.I., Cameron H.D. (1977) Cladistic methods in textual, linguistic, and phylogenetic analysis. Systematic Zoology 26: 380–385.

Roos T., Heikkilä T. (2009) Evaluating methods for computer-assisted stemmatology using artificial benchmark data sets. Literary and Linguistic Computing fqp002.

Stolz, M. (2003) New philology and new phylogeny: aspects of a critical electronic edition of Wolfram’s Parzival. Literary and Linguistic Computing 18(2): 139–150.

Timpanaro S. (2005) The Genesis of Lachmann's Method. Translated and edited by G. W. Most. Chicago: University of Chicago Press.

Trovato P. (2016) Metodologia editoriale per la Commedia di Dante Alighieri. Ferrara. See Youtube; date of access: March 19, 2017.

Tuesday, April 18, 2017

Multimedia phylogeny?


Evolutionary concepts have often been transferred to other fields of study, or derived independently in them, especially in anthropology in the broadest sense, covering all cultural products of the human mind. This includes phylogenetic studies of languages, texts, tales, artifacts, and so on — you will find many examples of such studies in this blog. One of the more recent applications has been to what is sometimes called multimedia phylogeny — the research field that "studies the problem of discovering phylogenetic dependencies in digital media".

I have noted before that phylogenetics in the biological sense is an analogy when applied to other fields, because only in biology is genetic information physically transferred between generations — in the other fields, cultural information transfer is all in the minds of the people, not in their genes (see False analogies between anthropology and biology). This analogy often becomes problematic when applied to other fields, because the practical application of bioinformatics techniques separates the informatics from the bio, and the mathematical analyses focus on trying to implement the informatics without any biological justification.


A recent paper that discusses the application of bioinformatics to multimedia phylogeny exemplifies the potential problems:
Guilherme D Marmerola, Marina A Oikawa, Zanoni Dias, Siome Goldenstein, Anderson Rocha (2017) On the reconstruction of text phylogeny trees: evaluation and analysis of textual relationships. PLoS One 11(12): e0167822.
The authors described their background information thus:
Articles on news portals and collaborative platforms (such as Wikipedia), source code, posts on social networks, and even scientific publications or literary works, are some examples in which textual content can be subject to changes in an evolutionary process. In this scenario, given a set of near-duplicate documents, it is worthwhile to find which one is the original and the history of changes that created the whole set. Such functionality would have immediate applications on news tracking services, detection of plagiarism, textual criticism, and copyright enforcement, for instance.
However, this is not an easy task, as textual features pointing to the documents' evolutionary direction may not be evident and are often dataset dependent. Moreover, side information, such as time stamps, are neither always available nor reliable. In this paper, we propose a framework for reliably reconstructing text phylogeny trees, and seamlessly exploring new approaches on a wide range of scenarios of text reusage. We employ and evaluate distinct combinations of dissimilarity measures and reconstruction strategies within the proposed framework.
So, their solution to the separation of bio from informatics is to try a range of techniques, none of which are based on any particular model of how phylogenetic changes might occur in text documents. All of these methods involve distance-based tree-building.

The essential problem, as I see it, is that without a model of change there is no reliable way to separate phylogenetic information from any other type of information. For example, similarity can arise from many sources, only some of which provide information about phylogenetic history — phylogenetic similarity is a form of "special similarity". In biology, other sources of similarity are usually lumped together as chance similarities, such as convergence, parallelism, etc. Without this basic separation of phylogenetic and chance similarity, it does not matter how many distance measures you use, or how many tree-building methods you employ — if you can't separate phylogeny from chance then you are wasting your time constructing a hypothetical  evolutionary history.

The authors' only saving grace is their claim that: "In text phylogeny, unlike stemmatology [the analysis of hand-written rather than digital texts], the fundamental aim is to find the relationships among near-duplicate text documents through the analysis of their transformations over time." The expectation, then, is that the phylogenetic similarity of the texts will be high, which will thus reduce the possibility of chance similarities. Sadly, it will also reduce the probability that the similarities will contain any phylogenetic information at all — this is the classic short-branches-are-hard-to-reconstruct problem in phylogenetics.

For digital texts, the authors employ three distance measures: edit distance, normalized compression distance, and cosine similarity. None of these are model-based in any phylogenetic sense (although the first one is used in alignment programs such as Clustal) — I have discussed this in the post on Non-model distances in phylogenetics. Their tree-building methods include: parsimony, support vector machines (a machine-learning form of classification), and random forests (a decision-tree form of classification). Once again, none of these is model-based in terms of textual changes.

A final issue is the insistence on trees as the model of a phylogeny. In stemmatology, for example, a network is a more obvious phylogenetic model, because hand-written texts can be copied from multiple sources. Indeed, this distinction plays an important role in the first application of phylogenetics to stemmatology (see the post on An outline history of phylogenetic trees and networks). Perhaps this is not an issue for "near-duplicate text documents", but it does seem like an unnecessary restriction. Moreover, one of the empirical examples used in the paper actually has a network history, which therefore does not match the authors' reconstructed tree.

Tuesday, October 18, 2016

The Genome Cellar is no such thing


In an earlier blog post, I noted that The Music Genome Project is no such thing. The use of the word "genome" in this context is an analogy, in which the musical characteristics are seen as producing a sort of genetic fingerprint. However, this is a false analogy, because the data used for the Music Genome Project are actually phenotypic, not genotypic. Indeed, music has no analog of a genotype.

In a similar vein, the data used for The Genome Cellar are phenotypic, not genotypic, and so this is also a false analogy.


The Genome Cellar is the database used by the Next Glass app. This app was released in November 2014, and a concurrent press release explained the concept:
Next Glass is the breakthrough app that uses science and machine learning software to provide accurate, personalized recommendations to consumers. Next Glass has analyzed tens of thousands of bottles of wine and beer with a mass spectrometer and stores the "DNA" of each product in its Genome Cellar™, which combines with users' Taste Profiles™ to provide product-specific recommendations.
So, the beer / wine data in the Genome Cellar are peaks in a spectrophotometer output. This is made clear in another press release:
Next Glass has developed the world’s first Genome Cellar, an extensive database that contains the chemical makeup – or "DNA" – of tens of thousands of wines and beers. By looking at each bottle on a molecular level, Next Glass defines a unique taste profile for every bottle by analyzing thousands of chemical elements.
This procedure will, indeed, provide a unique fingerprint for each alcoholic product, but it will be a phenotypic one not a genotypic one. Genetics is often chemistry but not all chemistry is genetics.

The idea of the Next Glass app is the same as that for the Music Genome Project — to use the fingerprint of currently liked products (music or wines / beers) to make recommendations for other products that might appeal to the customer. This approach can be expected to work for alcoholic beverages, because the subjective preferences will be based to some extent on the sensory components of the chemical makeup. If you document enough of the chemistry then you are bound to include a large proportion of the sensory part.

Anyway, you can see a short video about the laboratory here.

Finally, you might like to compare this approach with that of WineFriend, which tries to assess your taste in wine with multiple-choice questions, instead of complex chemistry. WineFriend:
uses a simple eight question taste survey that gives insights into a customer's thresholds for sweet, sour, bitterness and intensity of flavour. It then creates a profile which enables it to select wines that are tailored to the individual customer's tastes.
No mention of genomes here.

Tuesday, October 11, 2016

Changes in Playboy's women through 60 years


It has long been known that ideas about female attractiveness, and concern with body weight among young women, are closely related to exposure to mass media images (see the review by Spettigue & Henderson 2004). The print media are particularly involved in this issue, not least the so-called "men's magazines", such as Playboy. It therefore created a great deal of media interest when it was announced in October 2015 that Playboy would no longer feature nude centerfolds (known as Playmates).

Indeed, Playboy has often been claimed as a purveyor of the US society's image of the "ideal woman", although this is surely media exaggeration. Playboy, whether we love it or hate it, has simply portrayed females that the editors thought would sell magazines at the time. Nevertheless, the magazine's choice of models has been used in the professional medical and psychological literature as representative of a prevalent cultural idealization of an ultra-slender female body shape (eg. Garner et al. 1980; Wiseman et al. 1992; Szabo 1996; Spitzer et al. 1999; Katzmarzyk & Davis 2001; Pettijohn & Jungeberg 2004).

It therefore comes as no surprise that the magazine's database of model statistics was subjected to scrutiny in the online media after the 2015 announcement, particularly with regard to how things had changed during the magazine's 62 years (for an earlier analysis, see The girls next door: Life in the centerfold). Sadly, some of this recent analysis was quite poor (eg. Playboy's image of the ideal woman sure has changed). Here, I try to correct this by presenting a more thorough study of the available data.


The data I have used covers all of the Playmates of the Month that have appeared in the US edition of the magazine since its inception. This is contained in a searchable version of the pmstats.txt file that has been maintained by Jim Dean, Johnny Corvin and Doug Ewell, as currently available on Peggy Wilkins' website. This file is an updated compilation of the so-called "vital statistics" of the Playmates from December 1953 to February 2016, inclusive, as reported in Playboy, sometimes supplemented from other available sources.

Note, especially, that the data are basically self-reported by the Playmates. Some of the information has been questioned at various times, notably where it seems to contradict the associated photographic evidence. As a reputable scientist, I should probably have personally checked all of this evidence, but I have not done so (you can do so yourself, based on whatever photos you can find on the internet, or the book edited by Gretchen Edgren 2006). I have simply assumed that, at a minimum, the information presents whatever the Playmates thought was a desirable public image at the time of publication.

There are 753 records in the dataset, separately including twins and triplets appearing in the same magazine issue, as well as multiple appearances by the same woman in different issues. The data include: magazine issue month; Playmate name, birth date and birth location; height in inches and weight in pounds; breast, waist and hip dimensions in inches; and photographer name. From this information, for each Playmate I calculated their age at the time of publication, along with standard measurements for determining whether a body is healthy or not: Body Mass Index (BMI), for body size (ie. underweight, normal weight, overweight, obese), and Waist to Hip Ratio (WHR), for body curvaceousness.

Analysis

As is usual in this blog, the data can be summarized using a phylogenetic network as a form of exploratory data analysis (see How to interpret splits graphs).

I first range-standardized the data (so that all of the measurements are compared on the same scale), and log-transformed the BMI and WHR measurements (because otherwise these ratios will have non-linear relationships to the other variables). I then used the manhattan distance to calculate the similarity of the different publication years and birth locations, based on the Playmates' body dimensions. This was followed by a neighbor-net analysis to display the between-year and the between-location similarities as two phylogenetic networks.

The network of relationships among the years is shown first. Years that are closely connected in the network are similar to each other based on the body dimensions of their Playmates, and those that are further apart are progressively more different from each other.

Click to enlarge

The network shows that there has been a strong and consistent change in Playmate age, size and shape through time. In the graph there is a simple gradient through time form top-right to bottom-left — the 1950s and 1960s are intermingled at the top, with the 1970s below them, the 1980s and 1990s below that, and the 2000s and 2010s intermingled at the bottom.

So, it will be worth looking at time graphs of the individual measurements. Let's start with age.


This does not show a particularly consistent trend, but the average age of the models does increase from 21 to 24 years from beginning to end of the time period.

The next graph shows that the reported height of the Playmates also increases across the 62 years, by 2.5" on average. There is almost no change in average weight across the decades (and so the graph is not shown).


However, far more notable is the relationship between height and weight, as expressed by the BMI, which is shown in the next graph. This does not show a linear trend at all, but a distinctly curved one. That is, the size of Playmates definitely changed through time, becoming thinner for the first 40 years, but then thickening up again for the next 20 years.


This trend has not been discussed in the professional literature, as far as I can determine, perhaps because previous assessments have been based only on a relatively short period of time, not the full 6 decades. Note that the bottom point of the curve occurs in c. 1997, and that by 2016 the BMI measurements had returned to the 1975 level (40 years earlier). I wonder whether they would return to the 1950s level in another 20 years?

More importantly, given that Playmates are to one degree or another reflecting a contemporary societal image of a desirable woman, we can note that 48% of these models are classified as being underweight. The lower limit of a healthy BMI is 18.5, as shown in the next graph, which also shows the boundaries between Mild thinness (17-18.5), Moderate thinness (16-17) and Severe thinness (<16).


Clearly, during the period 1975-1995 the vast majority of the models reported being underweight, while in the 1950s and 1960s very few of them did. This situation has improved recently, with roughly a half being underweight during the past 20 years. Also, several of the reported body sizes are very unhealthy. However, perhaps the BMI values below 16 are unreliable, in the sense that such a person is not likely to be very photogenic.

We can now move on to the circumferences of the models. The next graph shows the time trend for the reported circumference at breast level. This shows the biggest and most consistent change of all, with a dramatic reduction in bustiness.


Indeed, chest sizes of >36" have hardly been reported since the start of 1990, and yet in the early years a buxom 36-24-36 figure was the most common claim by the Playmates. Interestingly, very few of the models have claimed a chest size of 33" (as opposed to 32" or 34"); is this some sort of superstition?

The other large and consistent change in circumference is for waist size, as shown in the next graph. This shows the opposite trend, with an increase in average reported size of 2" across the 60 years.


There was a slight but not consistent reduction in hip circumference during time (and so the graph is not shown). This means that the WHR, the measure of curvaceousness, changed greatly through time, as shown in the next graph. So, with the waists reportedly becoming larger, there was apparently a very large reduction in the curvaceousness of the models through time.


Note that the reduction in BMI was apparently achieved in spite of an increase in waist size — the BMI reduction seems to be related to the increase in average reported height without an increase in weight, and partly to the decrease in chest size.

When combined with the reduction in breast circumference, this means that the Playmates of the 21st century have been a very different shape from those of the mid 20th century. They were taller, with smaller breasts and larger waists, and thus had fewer curves.

We can end this discussion by considering where these Playmates were born. Most of them reported being born in the USA (83%). This means that we can consider how the various states compare in producing nude models. Obviously, more models are likely to come from the most populous states, and so we need to standardize the data by dividing by the population size of each state (as estimated for 2015 in Wikipedia), to yield the number of Playmates per million people in each state.


Apparently, Hawaii and California are more likely than the other states to produce models who are prepared to take their clothes off in public, while Delaware and Vermont have not yet done so, at least as far as Playboy is concerned. The apparently large value for Washington DC represents only 2 models from a relatively small population.

We can also consider whether the dimensions of the models vary in any consistent way between the states. This can be done with a phylogenetic network, as discussed above. In the following network, states that are closely connected are similar to each other based on the body dimensions of their Playmates, and those that are further apart are progressively more different from each other.


There appear to be no consistent patterns here.

So, we can finish by considering the countries from which the remaining 17% of the models originated. Once again, the data are standardized, to yield the number of Playmates per million people in each country (or province, for Canada). The apparently large value for Malta represents one set of twins from a relatively small population.


There have been a relatively large number of models from Scandinavia (Norway, Denmark and Sweden). This presumably represents the number of females whose body shape matches the image required by the Playboy editors, as much as the willingness of Scandinavians to disrobe publicly. However, it is notable that the rate of models from Norway is double those for Denmark and Sweden.

References

Edgren G (ed.) (2006) The Playmate Book: Six Decades of Centerfolds. Taschen.

Garner DM, Garfinkel P, Schwartz D, Thompson M (1980) Cultural expectations of thinness in women. Psychological Reports 47: 484-491.

Katzmarzyk PT, Davis C (2001) Thinness and body shape of Playboy centerfolds from 1978 to 1998. International Journal of Obesity 25: 590-592.

Pettijohn TF, Jungeberg BJ (2004) Playboy Playmate curves: changes in facial and body feature preferences across social and economic conditions. Personality and Social Psychology Bulletin 30: 1186-1197.

Spettigue W, Henderson KA (2004) Eating disorders and the role of the media. Canadian Child and Adolescent Psychiatry Review 13: 16-19.

Spitzer BL, Henderson KA, Zivian, MT (1999) Gender differences in population versus media body sizes: a comparison over four decades. Sex Roles 40: 545-565.

Szabo CP (1996) Playboy centrefolds and eating disorders - from male pleasure to female pathology. South African Medical Journal 86: 838-839.

Wiseman CV, Gray JJ, Mosimann JE, Ahrens AH (1992) Cultural expectations of thinness in women: an update. International Journal of Eating Disorders 11: 85-89.

Tuesday, October 4, 2016

The practical limits of networks?


Network techniques are becoming more widespread in biology and anthropology. However, the data in both of these disciplines can form very complicated patterns, indeed; and there must be practical limits to what one can do with a network analysis. This post discusses an example that covers both disciplines, and which may well exceed those limits.

The data come from:
Pugach I, Matveev R, Spitsyn V, Makarov S, Novgorodov I, Osakovsky V, Stoneking M, Pakendorf B (2016) The complex admixture history and recent southern origins of Siberian populations. Molecular Biology and Evolution 33: 1777-1795.

The authors note:
Siberia is an extensive geographical region of North Asia stretching from the Ural Mountains in the west to the Pacific Ocean in the east, and from the Arctic Ocean in the north to the Kazakh and Mongolian steppes in the south. This vast territory is inhabited by a relatively small number of indigenous peoples, with most populations numbering only in the hundreds or few thousands. These indigenous peoples speak a variety of languages belonging to the Turkic, Tungusic, Mongolic, Uralic, Yeniseic, Chukotko-Kamchatkan, and Aleut-Yupik-Inuit families, as well as a few isolates. There is also variation in traditional subsistence patterns ... This linguistic and cultural diversity suggests potentially different origins and historical trajectories of the Siberian peoples.
Previous studies of the genetic history of Siberian populations were hampered by the extensive admixture that appears to have taken place among these populations, because commonly used methods assume a tree-like population history and at most single admixture events.
This suggests the use of network techniques, instead of tree-based ones. However, under the circumstances described here it may be unwise to try to produce a phyogenetic network. The situation, as described, does not resemble a "tree with reticulations" but more of an "anastomosing plexus". The latter may be more confusing than helpful, when visualized as a network.

So, the authors do not mention the word "network" nor even "reticulation". Instead:
Here we analyze geogenetic maps and use other approaches to distinguish the effects of shared ancestry from prehistoric migrations and contact, and develop a new method based on the covariance of ancestry components, to investigate the potentially complex admixture history. We furthermore adapt a previously devised method of admixture dating for use with multiple events of gene flow, and apply these methods to whole-genome genotype data [genome-wide SNPs] from over 500 individuals belonging to 20 different Siberian ethnolinguistic groups [plus 9 reference populations].
The results of these analyses indicate that there have been multiple layers of admixture detectable in most of the Siberian populations, with considerable differences in the admixture histories of individual populations.
The admixture (or introgression) patterns among the populations are illustrated using a map. Each bar represents a population, with the colors denoting the different enthnolinguistic groups. Note that every population shows admixture.


The reconstructed migration relationships among the populations are also illustrated using a map. This time, the colors of the arrows represent the different ethnolinguistic groups.


I would not like to have to represent these patterns using a network, and make that network comprehensible. So, this dataset may exceed the practical limits of networks.

Tuesday, September 27, 2016

Inheritance in cultural evolution


I recently reviewed a book anthology devoted to the application of phylogenetic methods in archaeology (see List 2016, PDF here). This book, entitled Cultural Phylogenetics: Concepts and Applications in Archaeology, edited by Larissa Mendoza Straffon (2016), assembles eight articles by scholars who discuss or illustrate the application of phylogenetic approaches in different fields of anthropology and archaeology.

The volume presents a rich collection of different approaches, covering various topics ranging from the evolution of skateboards (Prentiss et al.) to the spread of the potter's wheel (Knappett). The articles dealing with theoretical questions range from historical accounts of tree-thinking in biology and anthropology (Kressing and Krischel) to an overview of the impact of Darwinian thinking on archaeology and anthropology (Rivero). Although I missed a golden thread when reading the eight articles of the volume, it is definitely worth a read for those interested in evolutionary approaches in a broader sense, as most articles explicitly reflect differences and commonalities between biological and cultural evolution, providing concrete insights into the challenges that archaeologists face when trying to promulgate quantitative approaches.

It is clear that evolution in the general sense is much broader than merely evolution in biology, as I have often tried to illustrate in this blog when showing how phylogenetic approaches can be applied in linguistics. Provided that descent with modification holds — in a broader sense — also for cultural artifacts, it is obvious to search for fruitful analogies between biological and cultural evolution, in order to profit from methodological transfer in disciplines like anthropology and archaeology. It is also clear, however, that certain analogies between biological evolution and evolution in other fields should be considered with great care. Even in linguistics, this is clearly evident, and I have pointed to this problem in the past (see Productive and unproductive analogies...). The goal cannot be a to try to press biological methods into the anthropological template. Instead, we have to rigorously test our proposed analogies, and adapt the biological methods to our needs if necessary.

What surprised me when reading the book was that the majority of the articles did not really seem to care about the crucial differences between biological and cultural evolution, but rather tried to fit the feet and heels of cultural evolution into biology's shoes. Tree thinking dominated most of the articles (with Knappett as a notable exception), and the scholars tried hard to find a clear distinction between vertical and lateral inheritance in cultural evolution. While it is clear that this distinction is the basis for phylogenetic tree applications, where patterns that do not fit a tree are explained as instances of homoplasy or lateral transfer, it is by no means clear why one would go through all the pain to identify these patterns in cultural evolution.

Consider, as an example, the evolution of skateboards. At some point in the history of mankind (some late point!), people decided to put wheels on a board and to do artistic tricks with it. Later, other people merchandised this idea, and started to sell those boards with wheels. Later on, other companies jumped on the bandwagon and started to produce their own brands, thus instigating a fight for the "best" model for a certain kind of clientel. In all of these cases, ideas for design were clearly taken among groups of people, further modified by specific needs or trends, until the current variety of skateboards arose. But which of these ideas were transferred vertically, and which ideas were transferred laterally? Can we identify processes of "speciation" in skateboard evolution, during which new brands were born?

In biology and linguistics we have the clear-cut criteria of interfertility and intelligibility. They cause us enough problems, given that we have ring species in biology and dialect chains in linguistics, but at least they give us some idea how to classify a given exemplar as belonging to a certain group. But what is the counterpart in the evolution of skateboards? Their brand? Their shape? Their users? The analogy simply does not hold. We have neither vertical nor lateral transfer in topics such as skateboard evolution. All we have is a before and an after— a complex network in which objects were constantly recreated and modified, be it based on ideas that were inspired by other objects or people, or independently developed. It seems completely senseless to search for a distinction between vertical and lateral patterns here, as it is not even clear to what degree we are actually dealing with decent with modification.

It seems to me that the problem of inheritance needs to be addressed in cultural evolution before any further quantitative applications using tree-building methods are carried out. Given that ideas can easily be develop independently, the crucial question for studies of cultural evolution is whether similar ideas can be shown to share a common history. It is (as David mentioned in earlier in a blog post on False analogies between anthropology and biology) the general problem of homology that does not seem to be solved in most studies on cultural evolution. Here, linguistics has generally fewer problems, given that linguists have developed methods to test whether two words are homologous. In cultural evolution, however, the assessment of homology is far from being obvious.

I think that cultural evolution studies such as the ones presented in the book would generally profit from network approaches. By network approaches, I do not necessarily mean evolutionary networks (in the sense of Morrison 2011), as the problem of inheritance is difficult to solve. Instead, I am thinking of exploratory data analysis using phylogenetic networks (Morrison 2011), or some version of similarity networks (Bapteste et al. 2012). Phylogenetic network approaches are frequently used in biology, and are now also very popular in linguistics. Similarity networks are more common in biology, but we have carried out some promising studies of linguistic data (List et al. 2016). As all of these approaches are exploratory and very flexible regarding the data that is fed to them, they might offer new possibilities for exploratory studies on cultural evolution.

References
  • Bapteste, E., P. Lopez, F. Bouchard, F. Baquero, J. McInerney, and R. Burian (2012) Evolutionary analyses of non-genealogical bonds produced by introgressive descent. Proceedings of the National Academy of Sciences 109.45. 18266-18272.
  • Knappett, C. (2016) Resisting Innovation? Learning, Cultural Evolution and the Potter’s Wheel in the Mediterranean Bronze Age. In: Mendoza Straffon, L. (ed.) Cultural Phylogenetics: Concepts and Applications in Archaeology. Springer International Publishing: Cham and Heidelberg and New York and Dordrecht, pp. 97-111.
  • List, J.-M., P. Lopez, and E. Bapteste (2016) Using sequence similarity networks to identify partial cognates in multilingual wordlists. In: Proceedings of the Association of Computational Linguistics 2016 (Volume 2: Short Papers), pp. 599-605.
  • List, J.-M. (2016) [Review of] Cultural Phylogenetics: Concepts and Applications in Archaeology; edited by Larissa Mendoza Straffon. Systematic Biology (published online before print).
  • Morrison, D. (2011) An Introduction to Phylogenetic Networks. RJR Productions: Uppsala.
  • Prentiss, A., M. Walsh, R. Skelton, and M. Mattes (2016) Mosaic evolution in cultural frameworks: skateboard decks and projectile points. In: Mendoza Straffon, L. (ed.) Cultural Phylogenetics: Concepts and Applications in Archaeology. Springer International Publishing: Cham and Heidelberg and New York and Dordrecht, pp. 113-130.
  • Rivero, D. (2016) Darwinian archaeology and cultural phylogenetics. In: Mendoza Straffon, L. (ed.) Cultural Phylogenetics: Concepts and Applications in Archaeology. Springer International Publishing: Cham and Heidelberg and New York and Dordrecht, pp. 43-72.
  • Mendoza Straffon, L. (2016) Cultural Phylogenetics. Concepts and Applications in Archaeology. Springer International Publishing: Cham.

Tuesday, September 20, 2016

Network of who marries whom, by profession


This blog is supposed to be about phylogenetic networks, not social networks. However, this post is a blatant exception.

Earlier this year, Adam Pearce and Dorothy Gambrell released this interesting web page:
This chart shows who marries CEOs, doctors, chefs and janitors
It is an interactive interface to a database of who marries whom. It is well known that people in certain professions tend to marry others with a given profession, and this database quantifies this pattern. The data are from the United States Census Bureau’s 2014 American Community Survey, which covers 3.5 million households. However, much of the dataset clearly also applies to many countries in the "western world".


The infographic is a matrix of professions organized left to right by more male-dominated to more female-dominated (as determined from the data in the database). If you move the mouse-pointer over any profession (or use the search box) then lines link the most common professions that the focus profession tends to marry, with line thickness indicating quantity. The pink and blue color gradients indicate the sexes of the two spouses.

You could try well-known marriage links like those for veterinarians (who tend to marry other veterinarians) and nurses (who tend to marry medical doctors), but more interesting ones for readers of this blog might be: biologists, mathematicians and statisticians (shown in the image above), computer programmers, or information professionals.

However, if you want to get really confused, try looking at "waitresses", "cooks" and "chefs", which seem to offer intransitive relationships.

Tuesday, August 16, 2016

Networks of music history


Networks are currently popular in studies of music. However, they tend to be unrooted similarity networks, showing some form of alleged commonality among artists or their music, as shown in the first graph. This example displays phenotypic similarity among the named artists, although how the "similarity" is measured is not always clear (the post on The Music Genome Project is no such thing briefly discusses this).


[Note: For an alternative approach, Glenn McDonald's Every Noise at Once has a two-dimensional scatter-plot of 1,491 music genres.]

Of more interest to us is the use of a network to study the historical development of music genres, for which we need a rooted network. Clearly, music history will be reticulate rather than tree-like, given the obvious transfers of musical modes between and within cultures, and even the possible resurrection of earlier styles at a later time and even place. A similar argument applies to musical instruments, of course (see Cornets: from a tree to a network; Guitars and networks).

Music networks appear in a previous post, on Reconstructing ancestors in a splits network. That post discusses the paper by J. Miguel Díaz-Báñez, Giovanna Farigu, Francisco Gómez, David Rappaport & Godfried T. Toussaint (2004) El Compás flamenco: a phylogenetic analysis. Proceedings of BRIDGES Conference: Mathematical Connections in Art, Music and Science, pp. 61-70.

The authors provide an analysis of the hand-clapping patterns of the flamenco music of Andalucia, in southern Spain. There are four recognized patterns, plus the fandango pattern, and the authors use two different distance measures to assess their rhythmic similarities. They produce unrooted phylogenetic networks based on each of these distances, using NeighborNet, one of which is shown in the second graph.


The authors ignore the fact that "it is well established that the fountain of flamenco music is the fandango", which would make the fandango the outgroup for rooting if we did wish to treat the networks as rooted. Instead, they try to "reconstruct the 'ancestral' rhythms correspnding to the nodes" by using mid-point rooting. This is a tricky business for networks, because there are multiple paths through the graph, and so the mid-point is not necessarily unique.

A similar NeighborNet analysis had previously been provided by Godfried Toussaint (2003) Classification and phylogenetic analysis of African ternary rhythm timelines. Proceedings of BRIDGES Conference: Mathematical Connections in Art, Music and Science, pp. 25-36. This involved an analysis of the 12/8 time bell rhythms in African and Afro-American music. The distances were based on "measures of rhythmic oddity and off-beatness" (this is briefly discussed in Hunting for rhythm’s DNA).


Very few people seem to be interested in producing rooted phylogenetic diagrams directly, except when their model is a tree rather than a network. Perhaps the most ambitious of these is by Victor Grauer (2011) Sounding the Depths: Tradition and the Voices of History. This is available as a paperback or for kindle. The audio-visual examples are available as a blog page, as are the figures.

His tree is shown in the next graph, including the characters on which it is based. Note that group B3. "Social Unison" is associated with a historical bottleneck, so that the prior history appears to be uncertain.


Finally, not everyone agrees about the importance of the obvious reticulation patterns in music history, notably Sylvie Le Bomin, Guillaume Lecointre, Evelyne Heyer (2016) The evolution of musical diversity: the key role of vertical transmission. PLoS One 11: e0151570. These authors study the music of groups of farmer and hunter-gatherer Bantu and Ubanguian speakers from Gabon, in western Africa. Their music characters are from three groups: repertoire (set of pieces including circumstance and social or symbolic implicit information), performativ (polyphonic process, form, instruments and vocal techniques), and intrinsic (metrics, rhythm and melodic).


The authors present a rooted phylogenentic tree, but there is also a "filtered" NeighborNet tucked away in an appendix. It seems to contradict any claim for the data being particularly tree-like.

Finally, to return to where I started, you could take a look at Musicmap, which allegedly covers The Genealogy and History of Popular Music Genres from Origin till Present (1870-2016). To quote from the info:
Musicmap attempts to provide the ultimate genealogy of popular music genres, including their relations and history. It is the result of more than seven years of research with over 200 listed sources and cross examination of many other visual genealogies. Its aim is to focus on the delicate balance between comprehensibility, accuracy and accessibility.

You need to zoom in a long way to appreciate the complexity of the network, covering 230 music genres. There is nominally a timeline from top to bottom (starting in 1870), although the network connections are not strictly time-consistent. As the (mostly Belgian) creators (lead by Kwinten Crauwels) note:
The ideal genealogy is not only complete and correct, but also easy to understand despite its complexity. This is a utopian balance that can never be achieved but only approached. By choosing the right amount of genres, determining forms of hierarchy and analogy and ordering everything in a logical but authentic manner, a satisfactory balance can be obtained ... Musicmap is a platform in search for the perfect balance of popular music genres to provide a powerful tool for educational means or a complementary framework in the field of music metadata and automatic taxonomy.