Phylogenetic decorrelation of ordinal cultural variables from the Ethnographic Atlas
Description
Code and data for phylogenetic decorrelation of ordinal cultural variables prior to statistical modelling. For each of seven variables from the Ethnographic Atlas (Murdock 1967), a Bayesian latent-variable model is fitted in Stan, decomposing the underlying continuous trait into a Brownian motion component --- capturing phylogenetically structured variation along language-family trees inferred from ASJP lexical data --- and an independent individual-level residual. The posterior means of the residuals serve as phylogenetically decorrelated predictors for downstream analyses. Pre-computed phylogenetic trees and fitted Stan models are included so that the full workflow can be reproduced without re-running the computationally intensive inference steps.
Files
ordinal_decorrelation_fdat.zip
Files
(5.2 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:35b59adfc63ffc88a551cc3aab27cc6f
|
5.2 GB | Preview Download |
Additional details
Data quality
- Accuracy
-
The input data (Ethnographic Atlas variables, ASJP lexical data, Glottolog classifications) are drawn from established, curated, publicly available sources. The derived outputs (phylogenetic trees, decorrelated residuals) have been validated by comparison with the fitted Stan models. The pre-computed Stan model fits reproduce the stored residuals to within Monte Carlo error.
- Completeness
-
Complete — all files necessary to reproduce the analysis are included: input data, pre-computed phylogenetic trees, fitted Stan models, and derived outputs (decorrelated residuals). The only step that cannot be fully reproduced from the deposit alone is the MrBayes phylogenetic inference, which requires the ASJP character matrices (included) but takes several days on HPC hardware; pre-computed trees are provided as a substitute.
- Conformity
-
The deposit conforms to the FAIR data principles. All input data are drawn from publicly available, versioned sources (D-PLACE, ASJP v19, Glottolog) with documented licenses. File formats are open and non-proprietary (CSV, Newick, R binary via RDS). The replication workflow is fully documented in the README, and all software dependencies are specified in a machine-readable environment file (environment.yml).
- Consistency
-
The stored derived outputs (phylogenetic trees, decorrelated residuals) are internally consistent with the fitted Stan models included in the deposit, verified by re-extracting posterior means from the model files and comparing them against the output CSV. Differences are within Monte Carlo error for all seven variables.
- Credibility
-
The method rests on established Bayesian phylogenetic inference (MrBayes) and probabilistic latent-variable modelling (Stan). Model selection was performed via leave-one-out cross-validation. The phylogenetic trees are inferred from ASJP lexical data, a widely used resource in computational historical linguistics. The decorrelation approach is described in full in the accompanying paper (Mertner & Jäger, in preparation).
- Processability
-
All data files are in standard, open formats readable without specialist software: CSV for tabular data, Newick for phylogenetic trees. The computational environment is fully specified in environment.yml and can be recreated with a single Mamba command. R and Julia scripts are provided for each step of the workflow, with pre-computed intermediate outputs included so that individual steps can be run in isolation.
- Relevance
-
The deposit supports research on phylogenetic non-independence in cross-linguistic data, a methodological issue relevant to linguistic typology, cultural evolution, and the comparative study of human societies. The decorrelated residuals are used as predictors in a spatial regression model of linguistic diversity in Africa, but the method is applicable to any study using ordinal cultural or linguistic variables from the Ethnographic Atlas or similar cross-societal databases alongside phylogenetic information.
- Timeliness
-
The deposit accompanies a paper currently under preparation (Mertner & Jäger). The data and code are made available prior to publication to support open science practices and to allow independent verification of the results.
- Understandability
-
The README provides a step-by-step description of the replication workflow, with each step documented including its purpose, expected runtime, inputs, and outputs. The statistical model is described in the accompanying paper's supplementary materials. Variable names and column definitions are documented in the data schema file (data_with_latent_variables.csv.schema.json). A companion website with additional methodological detail is available at https://profgerhard.de/ordinal_decorrelation/.
Software documentation
- Application category
- data analytics and processing software
- Build instructions
-
See environment.yml for all software dependencies. The computational environment can be recreated with:
mamba env create -f environment.yml mamba activate ordinal-decorr
The Julia environment is set up with:
cd code julia --project=. -e "using Pkg; Pkg.instantiate()"
Detailed replication instructions are in the README.
- Code repository
- The code is maintained in this FDAT deposit. A companion website documenting the method is available at https://profgerhard.de/ordinal_decorrelation/.
- Copyright holder
- Gerhard Jäger
- Copyright year
- 2026
- Is accessible for free
- Yes
- Maintainer
- 0000-0002-9642-9359
- Operating system
- linux platform
- Programming language
- julia, r