A Phylogenetic Tree of 3,397 World Languages via Multiple Sequence Alignment

Jäger, Gerhard

doi:10.57754/FDAT.ccfpp-z0113

Published April 3, 2026 | Version v1.0

Data paper Open

A Phylogenetic Tree of 3,397 World Languages via Multiple Sequence Alignment

Jäger, Gerhard (Contact person)¹

1. University of Tübingen

I apply the multiple sequence alignment (MSA) pipeline described in Jäger (2025) at global scale to produce a phylogenetic tree of 3,397 world languages. Starting from 185 Lexibank datasets, I align lexical forms for 103 selected concepts using a pair Hidden Markov Model (pHMM) trained to discriminate word pairs from linguistically proximatelanguage pairs against random pairs, and a T-Coffee progressive alignment scheme. Concept selection is guided by PyPythia phylogenetic difficulty scores: I find an optimal subset of k = 103 concepts (out of 210 candidates) that maximises phylogenetic signal. The resulting character matrix (3,397 taxa, 93,504 binary characters) is analysed with RAxML-NG under a binary substitution model. The best maximum-likelihood tree achieves a Generalised Quartet Distance (GQD; ?) of 0.036 against the Glottolog expert classification, corresponding to 96.4% quartet consistency. At the family level, 74 of 113 tested families (65.5%) are recovered as monophyletic. The main source of error is areal signal in mainland Southeast Asia (MSEA): 137 Austroasiatic, Hmong-Mien, and Tai-Kadai languages are placed inside the Sino-Tibetan clade due to shared contact-induced vocabulary and a transcription artefact in the ASJP encoding of tonal languages. I release the ultrametric tree, character matrix, and all per-concept alignments as a replication package.

Files

worldtree_2026_jaeger.pdf

Files (358.2 MB)

Name	Size	Download all
worldtree_2026_jaeger.pdf md5:81d795742b5e5a42e1950b63aa7de261	534.0 kB	Preview Download
worldtree_replication.zip md5:f9d083c1c0004e4c54a478f66c8c3361	357.7 MB	Preview Download

Additional details

Created: 2026-04-03

Accuracy: The phylogenetic tree was evaluated against the Glottolog expert classification as an external reference. The main source of inaccuracy is areal lexical signal in mainland Southeast Asia, where contact-induced vocabulary interferes with genealogical signal. Additional inaccuracies arise from encoding artefacts in the ASJP alphabet and chance similarities in short word forms. Ultrametric branch lengths are in arbitrary units and do not correspond to absolute time.
Completeness: The dataset covers 3,397 languages drawn from 185 Lexibank datasets. Languages were included only if they had attested forms for at least 40 of the 210 candidate concepts; the median concept coverage is 88 concepts per language. All 210 concept alignments are provided, though only the top 103 concepts (by phylogenetic signal) were used for the final tree.
Conformity: Lexical data follows the ASJP sound class encoding. Language identifiers conform to Glottolog codes. The tree is provided in standard Newick format; the character matrix in standard PHYLIP format.
Consistency: All data files were derived from a single pipeline applied uniformly to the Lexibank source data. Sound class conversion, alignment, binarisation, and phylogenetic inference follow consistent procedures throughout, with no manual intervention in the character matrix construction.
Credibility: The tree is inferred from primary lexical data using established phylogenetic methods (RAxML-NG) and evaluated against the independently produced Glottolog expert classification. All source data, intermediate files, and analysis scripts are included in the replication package, allowing full reproduction of the results.
Processability: All files are provided in standard, machine-readable formats: Newick (.nwk) for trees, PHYLIP (.phy) for the character matrix, and CSV for per-concept alignments. Language identifiers are Glottolog codes, enabling automated linking to external databases.
Relevance: The dataset is intended for researchers in computational linguistics, historical linguistics, and linguistic typology. It provides a large-scale phylogenetic tree and the underlying character matrix suitable for comparative studies, typological analyses, and benchmarking of phylogenetic methods on linguistic data.
Timeliness: The tree is based on Lexibank data and Glottolog 4.6, both current as of the time of analysis (2025–2026). The replication package includes all intermediate files, so results can be reproduced or updated as source databases are revised.
Understandability: The replication package includes a README with step-by-step instructions for reproducing the analysis. File formats (Newick, PHYLIP, CSV) are standard and widely supported by existing tools in the field. The accompanying paper describes all processing steps, parameter choices, and evaluation methods in detail.

Aggregation method: other
Analysis unit: other
Character set: other
Data source type: published research data

Application category: data analytics and processing software
Build instructions: see README
Copyright holder: Gerhard Jäger
Development status: active
Is accessible for free: Yes

	All versions	This version
Views	16	16
Downloads	15	15
Data volume	1.1 GB	1.1 GB

A Phylogenetic Tree of 3,397 World Languages via Multiple Sequence Alignment

Files

worldtree_2026_jaeger.pdf

Files (358.2 MB)

Additional details

Dates

Data quality

Study design and Methodology

Software documentation

A Phylogenetic Tree of 3,397 World Languages via Multiple Sequence Alignment

Creators

Description

Files

worldtree_2026_jaeger.pdf

Files (358.2 MB)

Additional details

Dates

Data quality

Study design and Methodology

Software documentation