German Nominal Compounds Dataset for Compositionality Tests (deu-nn)

Dima, Corina

doi:10.57754/FDAT.54vmb-80e89

Published May 1, 2019 | Version v1

Dataset Open

German Nominal Compounds Dataset for Compositionality Tests (deu-nn)

Dima, Corina (Researcher)¹

1. University of Tübingen

If you want to use this dataset for research purposes, please refer to the following sources:

- Verena Henrich and Erhard Hinrichs: Determining Immediate Constituents of Compounds in GermaNet. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2011), Hissar, Bulgaria, September 2011, pp. 420-426. [Download paper: http://www.aclweb.org/anthology/R11-1058]

- Daniël de Kok, Sebastian Pütz. 2019. Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP).

- Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.

The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.

The nominal compounds in this dataset were extracted from the list of 54,759 German compounds provided by the lexical database GermaNet, version 9.0, available at http://www.sfs.uni-tuebingen.de/lsd/compounds.shtml.

As specified on the GermaNet page, "The list of compound data is free for academic research as defined in GermaNet's academic research licence agreement (http://www.sfs.uni-tuebingen.de/lsd/licenses.shtml).

For any other intended purposes, please contact the GermaNet team. Henrich and Hinrichs (2011) describe the automatic compound splitting that is performed before the manual post-correction.

The initial compound list was filtered to contain only those compounds and constituents that had a minimum frequency of 50 in the TüBa-D/DP treebank, resulting in a list of 32,246 compounds, which were split into the train, test and dev splits (with 22,591, 6,442 and 3,213 compounds respectively). The train/test/dev files have the following format, the single parts separated by space: modifier head compound (e.g. Apfel Baum Apfelbaum).

For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition.

The word embeddings for all constituents and compounds in this dataset are stored in the binary word2vec format in the file twe-lemmas.bin.

This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).

The embeddings for the constituents and compounds were trained jointly on the lemmatized version of the TüBa-D/DP treebank, using the word2vec package (Mikolov et al. 2013).

The treebank consists of articles from the newspaper taz, the German Wikipedia dump from January 20, 2018 and the German proceedings from the EuroParl corpus (Koehn, 2005; Tiedemann, 2012) and has a size of 64.9M sentences and 1.3B tokens. The word embeddings were trained using the skipgram model with negative sampling with an embedding dimension of 200, symmetric window of 10, 25 negative samples per positive training instance and a sample probability threshold of 0.0001. The minimum frequency cut-off was set to 50 for all words. The total vocabulary size amounts 403,030.

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

CMDI.xml

Files (328.3 MB)

Name	Size	Download all
CMDI.xml md5:931951fd556af941dfce12972d5e9501	26.0 kB	Preview Download
deu-nn-readme.txt md5:3926eef86e9f51f943fe339891967d46	3.1 kB	Preview Download
dev_text.txt md5:7c3e0279edae0dbcee4fc2b008ab1256	92.4 kB	Preview Download
test_text.txt md5:712a9d4711d2c9632499a0f923c8b784	183.8 kB	Preview Download
train_text.txt md5:8ef505d6daff5d21e7a186820d1dd3ce	648.1 kB	Preview Download
twe-lemmas.bin md5:cc6f3131c534b0370d3b41b2729b917b	327.3 MB	Download

Additional details

Is part of: Data paper: 10.57754/FDAT.721tn-jef87 (DOI)

Deutsche Forschungsgemeinschaft
SFB 833: Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Accuracy: Not specified.
Completeness: Not specified.
Conformity: Not specified.
Consistency: Not specified.
Credibility: Not specified.
Processability: Not specified.
Relevance: Not specified.
Timeliness: Not specified.
Understandability: Not specified.

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

German Nominal Compounds Dataset for Compositionality Tests (deu-nn)

Other (English)

Files

CMDI.xml

Files (328.3 MB)

Additional details

Related works

Funding

Data quality

German Nominal Compounds Dataset for Compositionality Tests (deu-nn)

Creators

Description

Other (English)

Files

CMDI.xml

Files (328.3 MB)

Additional details

Related works

Funding

Data quality