Code and data accompanying "Tracking the change that leads to typological variation: Word order universals and the phylogenetic comparative method"
Files
wordorder-pcm-fdat.zip
Files
(24.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:0c7c82fda76f17bfc11498dc10b01233
|
24.8 MB | Preview Download |
Additional details
Data quality
- Accuracy
-
The word-order data derive from Gell-Mann & Ruhlen (2011) as the primary source. Glottolog matching proceeded in three stages: exact match after Unicode normalisation, rule-based pre-normalisation corrections for known name variants, and fuzzy matching (difflib SequenceMatcher, cutoff 0.82). All fuzzy matches were exhaustively reviewed by hand across multiple systematic passes, with family-level plausibility verified against Wikipedia; incorrect matches were explicitly excluded rather than accepted. Duplicate Glottocodes were resolved by union-merging attested word orders; entries without a valid Glottocode were discarded.
Phylogenetic trees were inferred with MrBayes under strict convergence criteria: average standard deviation of split frequencies (ASDSF) < 0.01 and maximum potential scale reduction factor (PSRF) ≤ 1.1 across all parameters. All 107 families reached full convergence. Each family's posterior is represented by 50 evenly subsampled post-burnin trees.
The CTMC posterior is based on 4 independent chains (500 warmup + 1,500 post-warmup samples each). Sensitivity of results to prior hyperparameters was assessed across 7 additional prior configurations (4 Dirichlet concentration values × 2 LogNormal rate-mean values); equilibrium distributions are consistent across all configurations.
- Completeness
-
The Gell-Mann & Ruhlen (2011) supplementary appendix lists 2,136 language entries. Of these, 520 could not be matched to any Glottolog languoid and were discarded; the remaining 1,616 matched entries were reduced to 1,567 after removing Glottolog Bookkeeping entries (unclassifiable or spurious languoids) and collapsing 2 pairs of duplicate Glottocodes by union-merging their attested word orders. The final sample of 1,567 languages is therefore complete with respect to the identifiable and classifiable portion of the GR list.
All 107 multi-member families have a full posterior tree sample (50 trees each) and are represented in the CTMC input. All generated figures, the main CTMC posterior, and the sensitivity posteriors for all 8 prior configurations are included.
The following inputs to the pipeline are not included in this archive: the Gell-Mann & Ruhlen (2011) supplementary PDF (under journal copyright), the Lexibank pruned wordlist, the Glottolog v5.3 languoid table, and the pair-HMM parameters — all of which are downloaded automatically by code/get_data.r from publicly available sources. The conda environment (env/) is excluded but fully reproducible from environment.yml.
- Conformity
-
Languages are identified throughout by Glottocodes, the standard identifiers of the Glottolog reference catalogue (v5.3). Phylogenetic tree files use standard formats: posterior tree samples are stored as APE-compatible NEXUS multiPhylo objects (.trees); consensus trees are in Newick format (.nwk); MrBayes input files follow the NEXUS standard. Lexical character matrices are in PHYLIP format (.phy), the standard input format for phylogenetic inference software. Tabular data are stored as plain CSV with a header row. The CTMC model input is stored as JSON. All file encodings are UTF-8. The repository is licensed under the MIT License.
- Consistency
-
Language identifiers (Glottocodes) are used consistently across all files: gellmann_ruhlen_filtered.csv, families.csv, the per-family directory names, and the CTMC model input. The six word-order states (SOV, SVO, VSO, VOS, OVS, OSV) are encoded identically in the source data, the tip likelihood vectors in ctmc_wordorder_data.json, and the CTMC posterior column labels. Branch lengths are consistently expressed in kiloyears (kya) across all posterior tree files. The sensitivity posteriors follow the same column structure as the main posterior (ctmc_nuts_posterior.csv) and differ only in the prior hyperparameters used during inference.
- Credibility
-
Language identifiers (Glottocodes) are used consistently across all files: gellmann_ruhlen_filtered.csv, families.csv, the per-family directory names, and the CTMC model input. The six word-order states (SOV, SVO, VSO, VOS, OVS, OSV) are encoded identically in the source data, the tip likelihood vectors in ctmc_wordorder_data.json, and the CTMC posterior column labels. Branch lengths are consistently expressed in kiloyears (kya) across all posterior tree files. The sensitivity posteriors follow the same column structure as the main posterior (ctmc_nuts_posterior.csv) and differ only in the prior hyperparameters used during inference.
- Processability
-
All data are stored in open, non-proprietary formats readable with standard scientific software. Tabular data (CSV) can be read with any spreadsheet application or data analysis environment (R, Python, Julia). Posterior tree samples (NEXUS multiPhylo) and consensus trees (Newick) can be processed with standard phylogenetics packages such as R's ape, phytools, or phangorn. PHYLIP character matrices are accepted by all major phylogenetic inference programs. The CTMC model input (JSON) and posterior (CSV) can be read directly in R, Python, or Julia without custom parsers. All analysis code is included in the code/ directory; software requirements and exact execution commands are documented in workflow.md and README.md. The R environment is fully specified in environment.yml and installable via conda/mamba; Julia dependencies are pinned in code/Project.toml and code/Manifest.toml.
- Relevance
-
The dataset supports research on cross-linguistic word-order universals and their diachronic dynamics. It provides a large-scale Bayesian phylogenetic analysis of word-order change covering 1,567 languages from 107 genealogical families and 124 singletons, making it one of the most comprehensive quantitative treatments of word-order evolution to date. The data are relevant to linguistic typologists, historical linguists, and researchers applying phylogenetic comparative methods to language. Beyond the specific findings reported in the accompanying paper, the posterior tree samples, lexical character matrices, and CTMC posterior are reusable for further analyses of word-order change, ancestral state reconstruction, or methodological comparisons. The sensitivity posteriors allow independent assessment of the robustness of the results to prior specification.
- Timeliness
-
The word-order classifications derive from Gell-Mann & Ruhlen (2011) and reflect the state of typological knowledge at that time. Language classifications and family memberships are based on Glottolog v5.3 (2024). Lexical data used for phylogenetic inference are drawn from the Lexibank pruned wordlist as distributed in the worldtree-replication v1.0 release. Root age calibrations follow Bouckaert et al. (2022). The data processing, phylogenetic inference, and statistical analysis were carried out in 2025–2026. For the 141 languages in the sample with a known last year of documentation (ranging from 3000 BCE to 2024 CE), the Last_Attested field in gellmann_ruhlen_filtered.csv records this information as sourced from Glottolog.
- Understandability
-
The repository includes three levels of documentation. README.md describes the overall study, repository structure, data availability, software requirements, and the high-level reproduction sequence. workflow.md provides a detailed step-by-step execution guide for every pipeline stage, including the inputs, outputs, and purpose of each script. supplement.md contains full technical documentation of the statistical methodology, model parameterisation, prior specification, convergence diagnostics, and sensitivity analysis results. The compiled manuscript (paper/jaeger_zs_2026.pdf) provides the scientific context for all data and results. Column names in tabular files are self-documenting (e.g. SOV, SVO, Glottocode, Match_Type); the structure of the CTMC posterior (r1–r30, pi1–pi6, tree_idx*, chain, iter) is explained in workflow.md. Variable encodings and filtering decisions are documented in the source code via inline comments.