Published April 3, 2026 | Version v1.0
Data paper Open

A Phylogenetic Tree of 3,397 World Languages via Multiple Sequence Alignment

  • 1. ROR icon University of Tübingen

Description

I apply the multiple sequence alignment (MSA) pipeline described in Jäger (2025) at global scale to produce a phylogenetic tree of 3,397 world languages. Starting from 185 Lexibank datasets, I align lexical forms for 103 selected concepts using a pair Hidden Markov Model (pHMM) trained to discriminate word pairs from linguistically proximatelanguage pairs against random pairs, and a T-Coffee progressive alignment scheme. Concept selection is guided by PyPythia phylogenetic difficulty scores: I find an optimal subset of k = 103 concepts (out of 210 candidates) that maximises phylogenetic signal. The resulting character matrix (3,397 taxa, 93,504 binary characters) is analysed with RAxML-NG under a binary substitution model. The best maximum-likelihood tree achieves a Generalised Quartet Distance (GQD; ?) of 0.036 against the Glottolog expert classification, corresponding to 96.4% quartet consistency. At the family level, 74 of 113 tested families (65.5%) are recovered as monophyletic. The main source of error is areal signal in mainland Southeast Asia (MSEA): 137 Austroasiatic, Hmong-Mien, and Tai-Kadai languages are placed inside the Sino-Tibetan clade due to shared contact-induced vocabulary and a transcription artefact in the ASJP encoding of tonal languages. I release the ultrametric tree, character matrix, and all per-concept alignments as a replication package.

Files

worldtree_2026_jaeger.pdf
Files (341.6 MiB)
Name Size
md5:f9d083c1c0004e4c54a478f66c8c3361
341.1 MiB Preview Download
md5:81d795742b5e5a42e1950b63aa7de261
521.5 KiB Preview Download

Additional details

Created:
April 8, 2026
Modified:
April 8, 2026