Published May 1, 2019 | Version v1
Dataset Open

German Adjective-Noun Phrase Dataset for Compositionality Tests (deu-adj-n)

  • 1. ROR icon University of Tübingen

Description

 If you want to use this dataset for research purposes, please refer to the following sources:

                 - Daniël de Kok, Sebastian Pütz. 2019. Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP).

                 - Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.

The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.

The 119,434 German adjective-noun phrases in this dataset (splits: 83,603 train, 23,887 test, 11,944 dev instances) were extracted automatically from the TüBa-D/DP treebank. The treebank is composed of three different parts: 1) articles from the German newspaper taz; 2)  the German Wikipedia dump from January 20, 2018; 3) German proceedings from the EuroParl corpus (Koehn, 2005; Tiedemann, 2012). The treebank consists of 64.9M sentences and 1.3B tokens. The train/test/dev files have the following format, single parts are separated by space: adjective noun adj-noun phrase, where the adjective and the noun of the phrase are separated by the string _adj_n_ (e.g. kritisch Film kritisch_adj_n_Film).

The phrases were extracted with the part-of-speech tag information provided by the treebank. For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition.

The embeddings for all words and phrases in this dataset are stored in the word2vec format in twe-adj-n.bin. This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).

The embeddings for the adjectives, nouns and phrases were trained jointly on the lemmatized version of the TüBa-D/DP treebank, using the word2vec package (Mikolov et al. 2013).

The word embeddings were trained with the skipgram model with negative sampling, a symmetric window of 10 as context size, 25 negative samples per positive training instance and a sample probability threshold of 0.0001.

The resulting embeddings have a dimension of 200 and the vocabulary contains 476,137 words in total. The minimum frequency cut-off was set to 50 for all words.1

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

CMDI.xml

Files (393.5 MB)

Name Size Download all
md5:9daa657b5974b59a3aa43771e50aa10d
23.2 kB Preview Download
md5:648c7ab665449372ff2949ecf3dd0714
2.4 kB Preview Download
md5:b32ebb8e002a90faff0ed24d7f43b1b1
520.8 kB Preview Download
md5:324836625a20986ae7937325f044dad7
1.0 MB Preview Download
md5:9ef1c9b22904c047f16ecdb2578410a8
3.6 MB Preview Download
md5:32f0e0b96a9f4612837aea12d21e3a52
388.3 MB Download

Additional details

Related works

Is part of
Data paper: 10.57754/FDAT.721tn-jef87 (DOI)

Funding

Deutsche Forschungsgemeinschaft
SFB 833:  Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Data quality

Accuracy

Not specified.

Completeness

Not specified.

Conformity

Not specified.

Consistency

Not specified.

Credibility

Not specified.

Processability

Not specified.

Relevance

Not specified.

Timeliness

Not specified.

Understandability

Not specified.