Published June 2, 2025 | Version 1.0.0
Software Open

Soil mapping data for spatial modeling of soil organic carbon with meaningless predictors

  • 1. ROR icon University of Tübingen

Contributors

Contact person:

  • 1. ROR icon University of Tübingen

Description

This dataset contains data for the training of machine learning regression models from 668 hypothetical case studies in 334 study areas across Europe, a R script for data analysis and the random forest models that reproduce the spurious correlation between the spatial distribution of soil organic carbon (SOC) and meaningless predictors. 

The 334 study areas are squared with 200 × 200 km and contain data from SOC from SoilGrids 2.0 (Poggio et al., 2021; https://doi.org/10.5194/soil-7-217-2021) as model outcome. 250 tiles are randomly distributed and 84 tiles are distributed regularly to account for bias towards areas covered by multiple randomly distributed tiles. The models are trained with 500 random samples and validated with another 1000 randomly selected samples from portrait images of researchers as independent covariates and SOC as outcome. The portrait images were reduced to greyscale with principal component analysis (PCA) and sRGB to linear luminance, which results in two series of 334 hypothetical case studies resulting in 668 hypothetical case studies in total..

The original portrait images are not included to protect personal rights and copyright. We thank Alexandre M. J.-C. Wadoux (https://doi.org/10.1111/ejss.12909) for providing the portrait images.

Files

Files (5.0 GB)

Name Size Download all
md5:2a717c6a4c09ee3256d818dbb68e83a6
13.1 kB Download
md5:36bc382eacdf6f42228c914c3ecd4c3d
4.5 GB Download
md5:f94024c539c929a9c02f4637394805de
159.7 kB Download
md5:8a58d9f37207a6da8efc08bfc443121e
192.5 kB Download
md5:daf6ddaf59ba2189c615b8d5af67b554
371.6 MB Download
md5:dd88cfa1953b953d313b712e8b89add9
124.8 MB Download

Additional details

Funding

Deutsche Forschungsgemeinschaft
ResourceCultures - CRC 1070: Resource Cultures 215859406

Data quality

Accuracy

Model accuracy ranges from 0.16 to 0.91 (concordance correlation coefficient), ‑2.33 to 3.95 g kg-1 SOC (ME) and  5.4 to 41.2 g kg-1 SOC (RMSE), while SOC in the case studies had an arithmetic mean of 100.4 g kg‑1.

Completeness

not relevant

Conformity

not relevant

Consistency

not relevant

Credibility

The SOC data is modelled data and was obtained from SoilGrids 2.0 (Poggio et al., 2021; https://doi.org/10.5194/soil-7-217-2021) and has an RMSE of 40 g kg-1.

Processability

not relevant

Relevance

not relevant

Timeliness

not relevant

Understandability

The data is understandable to data scientists in general and soil scientists with a focus on pedometrics in specific.

Study design and Methodology

Character set
utf-8
Software package
r

Software documentation

Application category
data analytics and processing software
Is accessible for free
Yes