Published May 1, 2019 | Version v1
Dataset Open

German Adverb-Adjective Phrase Dataset for Compositionality Tests (deu-adv-adj)

  • 1. ROR icon University of Tübingen

Description

If you want to use this dataset for research purposes, please refer to the following sources:

                 - Daniël de Kok, Sebastian Pütz. 2019. Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP).

                 - Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.

The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.

The German 23,488 adverb-adjective phrases (split into 16,441 train, 4,701 test, 2,346 dev instances) were extracted from the TüBa-D/DP treebank, which consists of articles from the newspaper taz, the German Wikipedia dump from January 20, 2018 and the German proceedings from the EuroParl corpus (Koehn, 2005; Tiedemann, 2012) and has a size of 64.9M sentences and 1.3B tokens.

The dataset was constructed with the help of the dependency annotations of the treebank. To collect the adverb-adjective phrases, head-dependent pairs were extracted that fulfilled the following requirements:

                 - the head is an attributive or predicative adjective and governs the dependent with the adverb relation

                 - the dependent immediately precedes the head

The extracted word pairs can have as the first element both real adverbs and adjectives which function as an adverb.

The train/test/dev files have the following format, the single parts are separated by space.

                 adverb adjective phrase, where the adverb and the adjective in the phrase are separated by the string _adv_adj_ (e.g. immer leer immer_adv_adj_leer).

For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition.

The word representations were trained on the lemmatized TüBa-D/DP treebank with the word2vec package. The embeddings were constructed using the skip-gram model with negative sampling (Mikolov et al., 2013).

The embedding size is 200, context size is a symmetric window of 10 words, 25 negative samples were used and a sample probability of 0.0001.

Representations were only trained for words and phrases with a minimum frequency of 30 occurrences. The final vocabulary contains 615,908 words.

The resulting embeddings are stored in the binary word2vec format in twe-adv-adj.bin, which can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

CMDI.xml

Files (501.7 MB)

Name Size Download all
md5:922986f955e38a1874d6e7049c498cc0
23.9 kB Preview Download
md5:061edf7bce2d64a30969dee3e0024ea5
2.6 kB Preview Download
md5:313337d4cd62a30eefcd61bdd98d60e5
99.6 kB Preview Download
md5:0dbdbc2bbef60a8ce94d8af125599b8e
199.3 kB Preview Download
md5:8ce871b9fa247f4ac67b89baa4e19e38
697.9 kB Preview Download
md5:931f86813142b3059ee38dbc0d6509b4
500.7 MB Download

Additional details

Related works

Is part of
Data paper: 10.57754/FDAT.721tn-jef87 (DOI)

Funding

Deutsche Forschungsgemeinschaft
SFB 833:  Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Data quality

Accuracy

Not specified.

Completeness

Not specified.

Conformity

Not specified.

Consistency

Not specified.

Credibility

Not specified.

Processability

Not specified.

Relevance

Not specified.

Timeliness

Not specified.

Understandability

Not specified.