Saturday, July 29, 2023

The Hybrid Model for Indo-European languages

New paper - Paul Heggarty et al. ,Language trees with sampled ancestors support a hybrid model for the origin of Indo-European languages.Science381,eabg0818(2023).DOI:10.1126/science.abg0818

*Note, in the paper the authors use dates in BP (before present) where present = 2000 CE. While I appreciate the strictly secular nomenclature, I prefer BCE dates as they are easier to comprehend for my brain (for now). So I convert these into BCE by subtracting 2000. eg. 6000 BP = 4000 BCE. Do keep this in mind while reading. Thank you.

Brief Overview of the Method


The authors aver that this method used Bayesian phylogenetic inference which is not similar to either Lexicostatistics or Glottochronology, both of which they consider deeply flawed.

This paper's Bayesian phylogenetic inference analysis is based on a new improved database (IE - CoR 1.0). The IE‑CoR 1.0 database contains data on relationships of cognacy (shared word origin) between 161 Indo-European languages in a reference set of 170 basic meanings. The new languages include the Nuristani branch, extinct Iranic languages from central Asia and a representative of sub-branches of Celtic which was missing from previous databases (Gaulish). The coverage prioritizes non-modern languages, providing a deeper phylogenetic signal and better chronological estimation. This database was contributed by 80 experts of different language sub-families to maximize data accuracy.

The authors state they improved the cognate encoding (keeping 1 lexeme for each cognate set rather than many synonyms used in previous databases which created lots of cognate sets per lexeme. This, for example, artificially elongated the branch length of modern Greek and the age of old Greek). The IE-CoR data set has highly consistent counts of cognate sets across all languages, very close to the target of 1 cognate set per meaning, per language. They also removed the constraints previously placed on ancient languages to be directly ancestral to modern languages which need not be the case. This previously forced 0 branch length (and therefore no divergence), simply forced the changes onto the next branch and elongated branch lengths artificially.

The database also solves the loanword problem in computational cladistics. "IE-CoR introduced the concept of loanword event, through which it has become possible to encode correctly both non-cognacy to the source lexeme, and subsequent cognacy between vertical descendants of that lexeme, once borrowed and fully integrated into the borrower language."

The IE-Cor database can be found here https://iecor.clld.org/


Important Discussion and Conclusions from the Paper


Heggarty et al reaffirm the position of the earliest Indo-European speakers in the south of the Caucasus around ~6100 BCE. They support a hybrid model in which the steppe was a secondary staging ground for European languages. Notably, the beginning of the split from Indo-Iranian into Indo-Aryan and Iranic is dated to ~3500 BCE, a finding wholly incompatible with the Andronovo hypothesis.

DensiTree showing final IE Tree with probability of topologies
DensiTree final output of the paper shows the probability distribution of various topologies. Orphan branches are sampled ancient languages in the database (some examples in red and yellow box markers)