German Adverb-Adjective Phrase Dataset for Compositionality Tests (deu-adv-adj)
Description
If you want to use this dataset for research purposes, please refer to the following sources:
- Daniël de Kok, Sebastian Pütz. 2019. Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP).
- Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.
The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.
The German 23,488 adverb-adjective phrases (split into 16,441 train, 4,701 test, 2,346 dev instances) were extracted from the TüBa-D/DP treebank, which consists of articles from the newspaper taz, the German Wikipedia dump from January 20, 2018 and the German proceedings from the EuroParl corpus (Koehn, 2005; Tiedemann, 2012) and has a size of 64.9M sentences and 1.3B tokens.
The dataset was constructed with the help of the dependency annotations of the treebank. To collect the adverb-adjective phrases, head-dependent pairs were extracted that fulfilled the following requirements:
- the head is an attributive or predicative adjective and governs the dependent with the adverb relation
- the dependent immediately precedes the head
The extracted word pairs can have as the first element both real adverbs and adjectives which function as an adverb.
The train/test/dev files have the following format, the single parts are separated by space.
adverb adjective phrase, where the adverb and the adjective in the phrase are separated by the string _adv_adj_ (e.g. immer leer immer_adv_adj_leer).
For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition.
The word representations were trained on the lemmatized TüBa-D/DP treebank with the word2vec package. The embeddings were constructed using the skip-gram model with negative sampling (Mikolov et al., 2013).
The embedding size is 200, context size is a symmetric window of 10 words, 25 negative samples were used and a sample probability of 0.0001.
Representations were only trained for words and phrases with a minimum frequency of 30 occurrences. The final vocabulary contains 615,908 words.
The resulting embeddings are stored in the binary word2vec format in twe-adv-adj.bin, which can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).
Other (English)
Research carried out in work package A03 of the SFB 833.
Files
CMDI.xml
Files
(501.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:922986f955e38a1874d6e7049c498cc0
|
23.9 kB | Preview Download |
|
md5:061edf7bce2d64a30969dee3e0024ea5
|
2.6 kB | Preview Download |
|
md5:313337d4cd62a30eefcd61bdd98d60e5
|
99.6 kB | Preview Download |
|
md5:0dbdbc2bbef60a8ce94d8af125599b8e
|
199.3 kB | Preview Download |
|
md5:8ce871b9fa247f4ac67b89baa4e19e38
|
697.9 kB | Preview Download |
|
md5:931f86813142b3059ee38dbc0d6509b4
|
500.7 MB | Download |
Additional details
Related works
- Is part of
- Data paper: 10.57754/FDAT.721tn-jef87 (DOI)