Dutch Adverb-Adjective Phrase Dataset for Compositionality Tests (nld-adv-adj)

de Kok, Daniël

doi:10.57754/FDAT.k84a2-rpj39

Published May 1, 2019 | Version v1

Dataset Open

Dutch Adverb-Adjective Phrase Dataset for Compositionality Tests (nld-adv-adj)

de Kok, Daniël (Researcher)¹

1. University of Tübingen

If you want to use this dataset for research purposes, please refer to the following sources:

- Gertjan Van Noord, Gosse Bouma, Frank Van Eynde, Daniël De Kok, Jelmer Van der Linde, Ineke Schuurman, Erik Tjong Kim Sang,

and Vincent Vandeghinste. 2013. Large Scale Syntactic Annotation of Written Dutch: Lassy. In Essential Speech and Language Technology for Dutch, pages 147–164. Springer.

- Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.

The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.

This dataset contains 4,540 Dutch adverb-adjective phrases (3,183 train, 907 test, 450 dev) extracted from the Lassy Large treebank (Van Noord et al., 2013), which consists of written texts (Wikipedia, newspapers) and texts of the medical domain.

The dataset was constructed with the help of the dependency annotations of the treebank. To collect the adverb-adjective phrases head-dependent pairs were extracted that fulfilled the following requirements:

- the head is an attributive or predicative adjective and governs dependent with the adverb relation

- the dependent immediately precedes the head

The extracted word pairs can have as the first element both real adverbs and adjectives which function as an adverb.

The train/test/dev files have the following format, the single parts are separated by tab. adverb adjective adv-adj_phrase (e.g. zeer moeizaam zeer_adv_adj_moeizaam)

For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition.

Word embeddings for all adverbs, adjectives and phrases are stored in the binary word2vec format in lassy-adv-adj.bin, wich can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).

The word embeddings were trained on the lemmatized Lassy Large treebank with the word2vec package. Representations for the adjectives, adverbs and phrases were trained jointly, for the phrase representations the adverb and the adjective were concatenated into a single unit using the separator _adv_adj_. The embeddings were constructed using the skip-gram model with negative sampling (Mikolov et al., 2013). The embedding size is 200, context size is a symmetric window of 10, 25 negative samples were used and a sample probability of 0.0001.

Representations were only trained for words and phrases with a minimum frequency of 30 occurrences. The total vocabulary size is 290,704.

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

CMDI.xml

Files (237.3 MB)

Name	Size	Download all
CMDI.xml md5:9c7713080bccb49740324352987b6f40	27.4 kB	Preview Download
dev_text.txt md5:8b08c4282d77aa0d3f90e52d72c7a191	16.4 kB	Preview Download
lassy-adv-adj.bin md5:4b4e14137d86dc3e0ec3fd42607b78e9	237.1 MB	Download
nld-adv-adj-readme.txt md5:353909fa3a8e8ccc161144302dc02494	2.8 kB	Preview Download
test_text.txt md5:a323a02bd2bb3f0d61049554213ea761	32.9 kB	Preview Download
train_text.txt md5:3eea61cb64fe41407fdc5fc9bf142d07	115.1 kB	Preview Download

Additional details

Is part of: Data paper: 10.57754/FDAT.721tn-jef87 (DOI)

Deutsche Forschungsgemeinschaft
SFB 833: Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Accuracy: Not specified.
Completeness: Not specified.
Conformity: Not specified.
Consistency: Not specified.
Credibility: Not specified.
Processability: Not specified.
Relevance: Not specified.
Timeliness: Not specified.
Understandability: Not specified.

	All versions	This version
Views	0	0
Downloads	1	1
Data volume	2.8 kB	2.8 kB

Dutch Adverb-Adjective Phrase Dataset for Compositionality Tests (nld-adv-adj)

Other (English)

Files

CMDI.xml

Files (237.3 MB)

Additional details

Related works

Funding

Data quality

Dutch Adverb-Adjective Phrase Dataset for Compositionality Tests (nld-adv-adj)

Creators

Description

Other (English)

Files

CMDI.xml

Files (237.3 MB)

Additional details

Related works

Funding

Data quality