Published March 14, 2017 | Version v1
Dataset Restricted

German Noun-Noun Compounds Dataset for Compositionality Tests

  • 1. ROR icon University of Tübingen

Description

The compounds in this dataset were extracted from the list of 54759 German compounds in GermaNet version 9.0, available at http://www.sfs.uni-tuebingen.de/lsd/compounds.shtml. 

As specified on the GermaNet page, the list of compound data is free for academic research as defined in GermaNet's academic research licence agreement (http://www.sfs.uni-tuebingen.de/lsd/licenses.shtml). For any other intended purposes, please contact the GermaNet team.

The following paper describes the automatic compound splitting that is performed before the manual post-correction. If you want to use the split compounds in the context of scientific or research work, please refer to the paper: 

Verena Henrich and Erhard Hinrichs: Determining Immediate Constituents of Compounds in GermaNet. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2011), Hissar, Bulgaria, September 2011, pp. 420-426.  [Download paper: http://www.aclweb.org/anthology/R11-1058]

The initial compound list was filtered to only those compounds that had a minimum frequency of 500 in the DECOW14AX corpus (https://webcorpora.org/), resulting in a list of 34497 compounds, which were split into the train, test and dev splits (with 24147, 6901 and 3449 compounds respectively). The words in the dataset are lowercased, but the original casing can be recovered if needed by consulting the GermaNet database.

The dataset was also filtered only for noun-noun compounds, using the part-of-speech information available in GermaNet. The filtering was done per split, and resulted 18796, 5410 and 2655 compounds in the training, test and dev splits (26861 compounds in total).

The train/test/dev files have the following format:  modifier head compound (e.g. basis tunnel basistunnel)

The dataset is referred to as the "nn-only" dataset in chapter 5 of Dima (2019).

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Related works

Funding

Deutsche Forschungsgemeinschaft
SFB 833:  Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Data quality

Accuracy

Not specified.

Completeness

Not specified.

Conformity

Not specified.

Consistency

Not specified.

Credibility

Not specified.

Processability

Not specified.

Relevance

Not specified.

Timeliness

Not specified.

Understandability

Not specified.