German Adjective-Noun Phrase Dataset for Compositionality Tests (deu-adj-n)

de Kok, Daniël

doi:10.57754/FDAT.mqb3c-rmj69

Published May 1, 2019 | Version v1

Dataset Open

German Adjective-Noun Phrase Dataset for Compositionality Tests (deu-adj-n)

de Kok, Daniël (Researcher)¹

1. University of Tübingen

If you want to use this dataset for research purposes, please refer to the following sources:

- Daniël de Kok, Sebastian Pütz. 2019. Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP).

- Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.

The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.

The 119,434 German adjective-noun phrases in this dataset (splits: 83,603 train, 23,887 test, 11,944 dev instances) were extracted automatically from the TüBa-D/DP treebank. The treebank is composed of three different parts: 1) articles from the German newspaper taz; 2) the German Wikipedia dump from January 20, 2018; 3) German proceedings from the EuroParl corpus (Koehn, 2005; Tiedemann, 2012). The treebank consists of 64.9M sentences and 1.3B tokens. The train/test/dev files have the following format, single parts are separated by space: adjective noun adj-noun phrase, where the adjective and the noun of the phrase are separated by the string _adj_n_ (e.g. kritisch Film kritisch_adj_n_Film).

The phrases were extracted with the part-of-speech tag information provided by the treebank. For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition.

The embeddings for all words and phrases in this dataset are stored in the word2vec format in twe-adj-n.bin. This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).

The embeddings for the adjectives, nouns and phrases were trained jointly on the lemmatized version of the TüBa-D/DP treebank, using the word2vec package (Mikolov et al. 2013).

The word embeddings were trained with the skipgram model with negative sampling, a symmetric window of 10 as context size, 25 negative samples per positive training instance and a sample probability threshold of 0.0001.

The resulting embeddings have a dimension of 200 and the vocabulary contains 476,137 words in total. The minimum frequency cut-off was set to 50 for all words.1

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

CMDI.xml

Files (393.5 MB)

Name	Size	Download all
CMDI.xml md5:9daa657b5974b59a3aa43771e50aa10d	23.2 kB	Preview Download
deu-adj-n-readme.txt md5:648c7ab665449372ff2949ecf3dd0714	2.4 kB	Preview Download
dev_text.txt md5:b32ebb8e002a90faff0ed24d7f43b1b1	520.8 kB	Preview Download
test_text.txt md5:324836625a20986ae7937325f044dad7	1.0 MB	Preview Download
train_text.txt md5:9ef1c9b22904c047f16ecdb2578410a8	3.6 MB	Preview Download
twe-adj-n.bin md5:32f0e0b96a9f4612837aea12d21e3a52	388.3 MB	Download

Additional details

Is part of: Data paper: 10.57754/FDAT.721tn-jef87 (DOI)

Deutsche Forschungsgemeinschaft
SFB 833: Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Accuracy: Not specified.
Completeness: Not specified.
Conformity: Not specified.
Consistency: Not specified.
Credibility: Not specified.
Processability: Not specified.
Relevance: Not specified.
Timeliness: Not specified.
Understandability: Not specified.

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

German Adjective-Noun Phrase Dataset for Compositionality Tests (deu-adj-n)

Other (English)

Files

CMDI.xml

Files (393.5 MB)

Additional details

Related works

Funding

Data quality

German Adjective-Noun Phrase Dataset for Compositionality Tests (deu-adj-n)

Creators

Description

Other (English)

Files

CMDI.xml

Files (393.5 MB)

Additional details

Related works

Funding

Data quality