Published April 3, 2026 | Version v1.0
Data paper Open

A Phylogenetic Tree of 3,397 World Languages via Multiple Sequence Alignment

  • 1. ROR icon University of Tübingen

Description

I apply the multiple sequence alignment (MSA) pipeline described in Jäger (2025) at global scale to produce a phylogenetic tree of 3,397 world languages. Starting from 185 Lexibank datasets, I align lexical forms for 103 selected concepts using a pair Hidden Markov Model (pHMM) trained to discriminate word pairs from linguistically proximatelanguage pairs against random pairs, and a T-Coffee progressive alignment scheme. Concept selection is guided by PyPythia phylogenetic difficulty scores: I find an optimal subset of k = 103 concepts (out of 210 candidates) that maximises phylogenetic signal. The resulting character matrix (3,397 taxa, 93,504 binary characters) is analysed with RAxML-NG under a binary substitution model. The best maximum-likelihood tree achieves a Generalised Quartet Distance (GQD; ?) of 0.036 against the Glottolog expert classification, corresponding to 96.4% quartet consistency. At the family level, 74 of 113 tested families (65.5%) are recovered as monophyletic. The main source of error is areal signal in mainland Southeast Asia (MSEA): 137 Austroasiatic, Hmong-Mien, and Tai-Kadai languages are placed inside the Sino-Tibetan clade due to shared contact-induced vocabulary and a transcription artefact in the ASJP encoding of tonal languages. I release the ultrametric tree, character matrix, and all per-concept alignments as a replication package.

Files

worldtree_2026_jaeger.pdf

Files (358.2 MB)

Name Size Download all
md5:81d795742b5e5a42e1950b63aa7de261
534.0 kB Preview Download
md5:f9d083c1c0004e4c54a478f66c8c3361
357.7 MB Preview Download

Additional details

Dates

Created
2026-04-03

Data quality

Accuracy

The phylogenetic tree was evaluated against the Glottolog expert classification as an external reference. The main source of inaccuracy is areal lexical signal in mainland Southeast Asia, where contact-induced vocabulary interferes with genealogical signal. Additional inaccuracies arise from encoding artefacts in the ASJP alphabet and chance similarities in short word forms. Ultrametric branch lengths are in arbitrary units and do not correspond to absolute time.

Completeness

The dataset covers 3,397 languages drawn from 185 Lexibank datasets. Languages were included only if they had attested forms for at least 40 of the 210 candidate concepts; the median concept coverage is 88 concepts per language. All 210 concept alignments are provided, though only the top 103 concepts (by phylogenetic signal) were used for the final tree.

Conformity

Lexical data follows the ASJP sound class encoding. Language identifiers conform to Glottolog codes. The tree is provided in standard Newick format; the character matrix in standard PHYLIP format.

Consistency

All data files were derived from a single pipeline applied uniformly to the Lexibank source data. Sound class conversion, alignment, binarisation, and phylogenetic inference follow consistent procedures throughout, with no manual intervention in the character matrix construction.

Credibility

The tree is inferred from primary lexical data using established phylogenetic methods (RAxML-NG) and evaluated against the independently produced Glottolog expert classification. All source data, intermediate files, and analysis scripts are included in the replication package, allowing full reproduction of the results.

Processability

All files are provided in standard, machine-readable formats: Newick (.nwk) for trees, PHYLIP (.phy) for the character matrix, and CSV for per-concept alignments. Language identifiers are Glottolog codes, enabling automated linking to external databases.

Relevance

The dataset is intended for researchers in computational linguistics, historical linguistics, and linguistic typology. It provides a large-scale phylogenetic tree and the underlying character matrix suitable for comparative studies, typological analyses, and benchmarking of phylogenetic methods on linguistic data.

Timeliness

The tree is based on Lexibank data and Glottolog 4.6, both current as of the time of analysis (2025–2026). The replication package includes all intermediate files, so results can be reproduced or updated as source databases are revised.

Understandability

The replication package includes a README with step-by-step instructions for reproducing the analysis. File formats (Newick, PHYLIP, CSV) are standard and widely supported by existing tools in the field. The accompanying paper describes all processing steps, parameter choices, and evaluation methods in detail.

Study design and Methodology

Aggregation method
other
Analysis unit
other
Character set
other
Data source type
published research data

Software documentation

Application category
data analytics and processing software
Build instructions

see README

Copyright holder
Gerhard Jäger
Copyright year
2026
Development status
active
Is accessible for free
Yes