Published May 25, 2026 | Version v1
Dataset Open

Phylogenetic decorrelation of ordinal cultural variables from the Ethnographic Atlas

  • 1. ROR icon University of Tübingen

Description

Code and data for phylogenetic decorrelation of ordinal cultural variables prior to statistical modelling. For each of seven variables from the Ethnographic Atlas (Murdock 1967), a Bayesian latent-variable model is fitted in Stan, decomposing the underlying continuous trait into a Brownian motion component --- capturing phylogenetically structured variation along language-family trees inferred from ASJP lexical data --- and an independent individual-level residual. The posterior means of the residuals serve as phylogenetically decorrelated predictors for downstream analyses. Pre-computed phylogenetic trees and fitted Stan models are included so that the full workflow can be reproduced without re-running the computationally intensive inference steps.

 

Files

ordinal_decorrelation_fdat.zip

Files (5.2 GB)

Name Size Download all
md5:35b59adfc63ffc88a551cc3aab27cc6f
5.2 GB Preview Download

Additional details

Data quality

Accuracy

The input data (Ethnographic Atlas variables, ASJP lexical data, Glottolog classifications) are drawn from established, curated, publicly available sources. The derived outputs (phylogenetic trees, decorrelated residuals) have been validated by comparison with the fitted Stan models. The pre-computed Stan model fits reproduce the stored residuals to within Monte Carlo error.

Completeness

Complete — all files necessary to reproduce the analysis are included: input data, pre-computed phylogenetic trees, fitted Stan models, and derived outputs (decorrelated residuals). The only step that cannot be fully reproduced from the deposit alone is the MrBayes phylogenetic inference, which requires the ASJP character matrices (included) but takes several days on HPC hardware; pre-computed trees are provided as a substitute.

Conformity

The deposit conforms to the FAIR data principles. All input data are drawn from publicly available, versioned sources (D-PLACE, ASJP v19, Glottolog) with documented licenses. File formats are open and non-proprietary (CSV, Newick, R binary via RDS). The replication workflow is fully documented in the README, and all software dependencies are specified in a machine-readable environment file (environment.yml).

Consistency

The stored derived outputs (phylogenetic trees, decorrelated residuals) are internally consistent with the fitted Stan models included in the deposit, verified by re-extracting posterior means from the model files and comparing them against the output CSV. Differences are within Monte Carlo error for all seven variables.

Credibility

The method rests on established Bayesian phylogenetic inference (MrBayes) and probabilistic latent-variable modelling (Stan). Model selection was performed via leave-one-out cross-validation. The phylogenetic trees are inferred from ASJP lexical data, a widely used resource in computational historical linguistics. The decorrelation approach is described in full in the accompanying paper (Mertner & Jäger, in preparation).

Processability

All data files are in standard, open formats readable without specialist software: CSV for tabular data, Newick for phylogenetic trees. The computational environment is fully specified in environment.yml and can be recreated with a single Mamba command. R and Julia scripts are provided for each step of the workflow, with pre-computed intermediate outputs included so that individual steps can be run in isolation.

Relevance

The deposit supports research on phylogenetic non-independence in cross-linguistic data, a methodological issue relevant to linguistic typology, cultural evolution, and the comparative study of human societies. The decorrelated residuals are used as predictors in a spatial regression model of linguistic diversity in Africa, but the method is applicable to any study using ordinal cultural or linguistic variables from the Ethnographic Atlas or similar cross-societal databases alongside phylogenetic information.

Timeliness

The deposit accompanies a paper currently under preparation (Mertner & Jäger). The data and code are made available prior to publication to support open science practices and to allow independent verification of the results.

Understandability

The README provides a step-by-step description of the replication workflow, with each step documented including its purpose, expected runtime, inputs, and outputs. The statistical model is described in the accompanying paper's supplementary materials. Variable names and column definitions are documented in the data schema file (data_with_latent_variables.csv.schema.json). A companion website with additional methodological detail is available at https://profgerhard.de/ordinal_decorrelation/.

Software documentation

Application category
data analytics and processing software
Build instructions

See environment.yml for all software dependencies. The computational environment can be recreated with:

mamba env create -f environment.yml mamba activate ordinal-decorr

The Julia environment is set up with:

cd code julia --project=. -e "using Pkg; Pkg.instantiate()"

Detailed replication instructions are in the README.

Code repository
The code is maintained in this FDAT deposit. A companion website documenting the method is available at https://profgerhard.de/ordinal_decorrelation/.
Copyright holder
Gerhard Jäger
Copyright year
2026
Is accessible for free
Yes
Maintainer
0000-0002-9642-9359
Operating system
linux platform
Programming language
julia, r