German Adverb-Adjective Phrase Dataset for Compositionality Tests (deu-adv-adj)

de Kok, Daniël

doi:10.57754/FDAT.7casr-x0p36

Published May 1, 2019 | Version v1

Dataset Open

German Adverb-Adjective Phrase Dataset for Compositionality Tests (deu-adv-adj)

de Kok, Daniël (Researcher)¹

1. University of Tübingen

If you want to use this dataset for research purposes, please refer to the following sources:

- Daniël de Kok, Sebastian Pütz. 2019. Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP).

- Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.

The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.

The German 23,488 adverb-adjective phrases (split into 16,441 train, 4,701 test, 2,346 dev instances) were extracted from the TüBa-D/DP treebank, which consists of articles from the newspaper taz, the German Wikipedia dump from January 20, 2018 and the German proceedings from the EuroParl corpus (Koehn, 2005; Tiedemann, 2012) and has a size of 64.9M sentences and 1.3B tokens.

The dataset was constructed with the help of the dependency annotations of the treebank. To collect the adverb-adjective phrases, head-dependent pairs were extracted that fulfilled the following requirements:

- the head is an attributive or predicative adjective and governs the dependent with the adverb relation

- the dependent immediately precedes the head

The extracted word pairs can have as the first element both real adverbs and adjectives which function as an adverb.

The train/test/dev files have the following format, the single parts are separated by space.

adverb adjective phrase, where the adverb and the adjective in the phrase are separated by the string _adv_adj_ (e.g. immer leer immer_adv_adj_leer).

For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition.

The word representations were trained on the lemmatized TüBa-D/DP treebank with the word2vec package. The embeddings were constructed using the skip-gram model with negative sampling (Mikolov et al., 2013).

The embedding size is 200, context size is a symmetric window of 10 words, 25 negative samples were used and a sample probability of 0.0001.

Representations were only trained for words and phrases with a minimum frequency of 30 occurrences. The final vocabulary contains 615,908 words.

The resulting embeddings are stored in the binary word2vec format in twe-adv-adj.bin, which can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

CMDI.xml

Files (501.7 MB)

Name	Size	Download all
CMDI.xml md5:922986f955e38a1874d6e7049c498cc0	23.9 kB	Preview Download
deu-adv-adj-readme.txt md5:061edf7bce2d64a30969dee3e0024ea5	2.6 kB	Preview Download
dev_text.txt md5:313337d4cd62a30eefcd61bdd98d60e5	99.6 kB	Preview Download
test_text.txt md5:0dbdbc2bbef60a8ce94d8af125599b8e	199.3 kB	Preview Download
train_text.txt md5:8ce871b9fa247f4ac67b89baa4e19e38	697.9 kB	Preview Download
twe-adv-adj.bin md5:931f86813142b3059ee38dbc0d6509b4	500.7 MB	Download

Additional details

Is part of: Data paper: 10.57754/FDAT.721tn-jef87 (DOI)

Deutsche Forschungsgemeinschaft
SFB 833: Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Accuracy: Not specified.
Completeness: Not specified.
Conformity: Not specified.
Consistency: Not specified.
Credibility: Not specified.
Processability: Not specified.
Relevance: Not specified.
Timeliness: Not specified.
Understandability: Not specified.

	All versions	This version
Views	1	1
Downloads	0	0
Data volume	0 Bytes	0 Bytes

German Adverb-Adjective Phrase Dataset for Compositionality Tests (deu-adv-adj)

Other (English)

Files

CMDI.xml

Files (501.7 MB)

Additional details

Related works

Funding

Data quality

German Adverb-Adjective Phrase Dataset for Compositionality Tests (deu-adv-adj)

Creators

Description

Other (English)

Files

CMDI.xml

Files (501.7 MB)

Additional details

Related works

Funding

Data quality