Published May 1, 2019 | Version v1
Dataset Open

Dutch Adverb-Adjective Phrase Dataset for Compositionality Tests (nld-adv-adj)

  • 1. ROR icon University of Tübingen

Description

 If you want to use this dataset for research purposes, please refer to the following sources:

                 - Gertjan Van Noord, Gosse Bouma, Frank Van  Eynde,  Daniël  De  Kok,  Jelmer  Van  der Linde, Ineke Schuurman, Erik Tjong Kim Sang,

                 and Vincent Vandeghinste. 2013. Large Scale Syntactic Annotation of Written Dutch: Lassy. In Essential Speech and Language Technology for Dutch, pages 147–164. Springer.

                 - Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.

The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.

This dataset contains 4,540 Dutch adverb-adjective phrases (3,183 train, 907 test, 450 dev) extracted from the Lassy Large treebank (Van Noord et al., 2013), which consists of written texts (Wikipedia, newspapers) and texts of the medical domain.

The dataset was constructed with the help of the dependency annotations of the treebank. To collect the adverb-adjective phrases head-dependent pairs were extracted that fulfilled the following requirements:

                 - the head is an attributive or predicative adjective and governs dependent with the adverb relation

                 - the dependent immediately precedes the head

The extracted word pairs can have as the first element both real adverbs and adjectives which function as an adverb.

The train/test/dev files have the following format, the single parts are separated by tab. adverb adjective adv-adj_phrase (e.g. zeer moeizaam zeer_adv_adj_moeizaam)

For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition.

Word embeddings for all adverbs, adjectives and phrases are stored in the binary word2vec format in lassy-adv-adj.bin, wich can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).

The word embeddings were trained on the lemmatized Lassy Large treebank with the word2vec package. Representations for the adjectives, adverbs and phrases were trained jointly, for the phrase representations the adverb and the adjective were concatenated into a single unit using the separator _adv_adj_. The embeddings were constructed using the skip-gram model with negative sampling (Mikolov et al., 2013). The embedding size is 200, context size is a symmetric window of 10, 25 negative samples were used and a sample probability of 0.0001.

Representations were only trained for words and phrases with a minimum frequency of 30 occurrences. The total vocabulary size is 290,704.

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

CMDI.xml

Files (237.3 MB)

Name Size Download all
md5:9c7713080bccb49740324352987b6f40
27.4 kB Preview Download
md5:8b08c4282d77aa0d3f90e52d72c7a191
16.4 kB Preview Download
md5:4b4e14137d86dc3e0ec3fd42607b78e9
237.1 MB Download
md5:353909fa3a8e8ccc161144302dc02494
2.8 kB Preview Download
md5:a323a02bd2bb3f0d61049554213ea761
32.9 kB Preview Download
md5:3eea61cb64fe41407fdc5fc9bf142d07
115.1 kB Preview Download

Additional details

Related works

Is part of
Data paper: 10.57754/FDAT.721tn-jef87 (DOI)

Funding

Deutsche Forschungsgemeinschaft
SFB 833:  Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Data quality

Accuracy

Not specified.

Completeness

Not specified.

Conformity

Not specified.

Consistency

Not specified.

Credibility

Not specified.

Processability

Not specified.

Relevance

Not specified.

Timeliness

Not specified.

Understandability

Not specified.