A Phylogenetic Tree of 3,397 World Languages via Multiple Sequence Alignment
Description
I apply the multiple sequence alignment (MSA) pipeline described in Jäger (2025) at global scale to produce a phylogenetic tree of 3,397 world languages. Starting from 185 Lexibank datasets, I align lexical forms for 103 selected concepts using a pair Hidden Markov Model (pHMM) trained to discriminate word pairs from linguistically proximatelanguage pairs against random pairs, and a T-Coffee progressive alignment scheme. Concept selection is guided by PyPythia phylogenetic difficulty scores: I find an optimal subset of k = 103 concepts (out of 210 candidates) that maximises phylogenetic signal. The resulting character matrix (3,397 taxa, 93,504 binary characters) is analysed with RAxML-NG under a binary substitution model. The best maximum-likelihood tree achieves a Generalised Quartet Distance (GQD; ?) of 0.036 against the Glottolog expert classification, corresponding to 96.4% quartet consistency. At the family level, 74 of 113 tested families (65.5%) are recovered as monophyletic. The main source of error is areal signal in mainland Southeast Asia (MSEA): 137 Austroasiatic, Hmong-Mien, and Tai-Kadai languages are placed inside the Sino-Tibetan clade due to shared contact-induced vocabulary and a transcription artefact in the ASJP encoding of tonal languages. I release the ultrametric tree, character matrix, and all per-concept alignments as a replication package.
Files
Additional details
- Created
-
2026-04-03
- Accuracy
The phylogenetic tree was evaluated against the Glottolog expert classification as an external reference. The main source of inaccuracy is areal lexical signal in mainland Southeast Asia, where contact-induced vocabulary interferes with genealogical signal. Additional inaccuracies arise from encoding artefacts in the ASJP alphabet and chance similarities in short word forms. Ultrametric branch lengths are in arbitrary units and do not correspond to absolute time.
- Completeness
The dataset covers 3,397 languages drawn from 185 Lexibank datasets. Languages were included only if they had attested forms for at least 40 of the 210 candidate concepts; the median concept coverage is 88 concepts per language. All 210 concept alignments are provided, though only the top 103 concepts (by phylogenetic signal) were used for the final tree.
- Conformity
Lexical data follows the ASJP sound class encoding. Language identifiers conform to Glottolog codes. The tree is provided in standard Newick format; the character matrix in standard PHYLIP format.
- Consistency
All data files were derived from a single pipeline applied uniformly to the Lexibank source data. Sound class conversion, alignment, binarisation, and phylogenetic inference follow consistent procedures throughout, with no manual intervention in the character matrix construction.
- Credibility
The tree is inferred from primary lexical data using established phylogenetic methods (RAxML-NG) and evaluated against the independently produced Glottolog expert classification. All source data, intermediate files, and analysis scripts are included in the replication package, allowing full reproduction of the results.
- Processability
All files are provided in standard, machine-readable formats: Newick (.nwk) for trees, PHYLIP (.phy) for the character matrix, and CSV for per-concept alignments. Language identifiers are Glottolog codes, enabling automated linking to external databases.
- Relevance
The dataset is intended for researchers in computational linguistics, historical linguistics, and linguistic typology. It provides a large-scale phylogenetic tree and the underlying character matrix suitable for comparative studies, typological analyses, and benchmarking of phylogenetic methods on linguistic data.
- Timeliness
The tree is based on Lexibank data and Glottolog 4.6, both current as of the time of analysis (2025–2026). The replication package includes all intermediate files, so results can be reproduced or updated as source databases are revised.
- Understandability
The replication package includes a README with step-by-step instructions for reproducing the analysis. File formats (Newick, PHYLIP, CSV) are standard and widely supported by existing tools in the field. The accompanying paper describes all processing steps, parameter choices, and evaluation methods in detail.
- Aggregation method
- other
- Analysis unit
- other
- Character set
- other
- Data File format
- zip, pdf (pdf/a)
- Data source type
- published research data
- Application category
- data analytics and processing software
- Build instructions
see README
- Copyright holder
- Gerhard Jäger
- Copyright year
- 2026
- Development status
- active
- Is accessible for free
- Yes