Published May 1, 2019 | Version v1
Dataset Open

German Nominal Compounds Dataset for Compositionality Tests (deu-nn)

  • 1. ROR icon University of Tübingen

Description

If you want to use this dataset for research purposes, please refer to the following sources:

                 - Verena Henrich and Erhard Hinrichs: Determining Immediate Constituents of Compounds in GermaNet. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2011), Hissar, Bulgaria, September 2011, pp. 420-426. [Download paper: http://www.aclweb.org/anthology/R11-1058]

                 - Daniël de Kok, Sebastian Pütz. 2019. Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP).

                 - Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.

The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.

The nominal compounds in this dataset were extracted from the list of 54,759 German compounds provided by the lexical database GermaNet, version 9.0, available at http://www.sfs.uni-tuebingen.de/lsd/compounds.shtml.

As specified on the GermaNet page, "The list of compound data is free for academic research as defined in GermaNet's academic research licence agreement (http://www.sfs.uni-tuebingen.de/lsd/licenses.shtml).

For any other intended purposes, please contact the GermaNet team. Henrich and Hinrichs (2011) describe the automatic compound splitting that is performed before the manual post-correction.

The initial compound list was filtered to contain only those compounds and constituents that had a minimum frequency of 50 in the TüBa-D/DP treebank, resulting in a list of 32,246 compounds, which were split into the train, test and dev splits (with 22,591, 6,442 and 3,213 compounds respectively). The train/test/dev files have the following format, the single parts separated by space: modifier head compound (e.g. Apfel Baum Apfelbaum).

For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition.

The word embeddings for all constituents and compounds in this dataset are stored in the binary word2vec format in the file twe-lemmas.bin.

This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).

The embeddings for the constituents and compounds were trained jointly on the lemmatized version of the TüBa-D/DP treebank, using the word2vec package (Mikolov et al. 2013).

The treebank consists of articles from the newspaper taz, the German Wikipedia dump from January 20, 2018 and the German proceedings from the EuroParl corpus (Koehn, 2005; Tiedemann, 2012) and has a size of 64.9M sentences and 1.3B tokens. The word embeddings were trained using the skipgram model with negative sampling with an embedding dimension of 200, symmetric window of 10, 25 negative samples per positive training instance and a sample probability threshold of 0.0001. The minimum frequency cut-off was set to 50 for all words. The total vocabulary size amounts 403,030.

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

CMDI.xml

Files (328.3 MB)

Name Size Download all
md5:931951fd556af941dfce12972d5e9501
26.0 kB Preview Download
md5:3926eef86e9f51f943fe339891967d46
3.1 kB Preview Download
md5:7c3e0279edae0dbcee4fc2b008ab1256
92.4 kB Preview Download
md5:712a9d4711d2c9632499a0f923c8b784
183.8 kB Preview Download
md5:8ef505d6daff5d21e7a186820d1dd3ce
648.1 kB Preview Download
md5:cc6f3131c534b0370d3b41b2729b917b
327.3 MB Download

Additional details

Related works

Is part of
Data paper: 10.57754/FDAT.721tn-jef87 (DOI)

Funding

Deutsche Forschungsgemeinschaft
SFB 833:  Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Data quality

Accuracy

Not specified.

Completeness

Not specified.

Conformity

Not specified.

Consistency

Not specified.

Credibility

Not specified.

Processability

Not specified.

Relevance

Not specified.

Timeliness

Not specified.

Understandability

Not specified.