Dutch Adjective-Noun Phrase Dataset for Compositionality Tests (nld-adj-n)

Dima, Corina

doi:10.57754/FDAT.n659d-e8r84

Published May 1, 2019 | Version v1

Dataset Open

Dutch Adjective-Noun Phrase Dataset for Compositionality Tests (nld-adj-n)

Dima, Corina (Researcher)¹

1. University of Tübingen

If you want to use this dataset for research purposes, please refer to the following sources:

- Gertjan Van Noord, Gosse Bouma, Frank Van Eynde, Daniël De Kok, Jelmer Van der Linde, Ineke Schuurman, Erik Tjong Kim Sang,

and Vincent Vandeghinste. 2013. Large Scale Syntactic Annotation of Written Dutch: Lassy. In Essential Speech and Language Technology for Dutch, pages 147–164. Springer.

- Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.

The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.

The 83,392 Dutch adjective-noun phrases (58,347 train, 16,669 test, 8,376 dev) from this dataset were extracted from the Lassy Large treebank (Van Noord et al., 2013), which consists of written texts (Wikipedia, newspapers) and texts of the medical domain.

The train/test/dev files have the following format, the single parts are separated by tab.

adjective noun adj-noun phrase, where the adjective and the noun in the phrase are separated by the string _adj_n_ (e.g. politiek verlof politiek_adj_n_verlof).

For results of different composition models on this dataset see Dima et al. (2019) , No word is an island — a transformation weighting model for semantic composition.

The word embeddings were trained on the same treebank and the training corpus consists of 47.6M sentences and 700M tokens. Because adjectives and nouns are separate words, they were concatenated into a single unit (using the separator _adj_n_) for training the phrase representations.

The embeddings were learned with the skip-gram model with negative sampling (Mikolov et al., 2013) from the word2vec package. The embedding size is 200, context size is a symmetric window of 10, 25 negative samples were used and a sample probability of 0.0001.

Representations were only trained for words and phrases with a minimum frequency of 30 occurrences. The embeddings are stored in the binary word2vec format in lassy-adjn-lemmas.bin, which can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)). The vocabulary contains 355,236 words.

Files

CMDI.xml

Files (292.5 MB)

Name	Size	Download all
CMDI.xml md5:fa3b6d0b46481e516d154bc06795f7e4	25.2 kB	Preview Download
dev_text.txt md5:b7445b27c4973078cd8fd8a8542973e4	338.0 kB	Preview Download
lassy-adjn-lemmas.bin md5:530a738ac3d135e34befe8e005ca48bb	289.1 MB	Download
nld-adj-n-readme.txt md5:b45fe7775ecd6ec35d53ed96dfd22e46	2.4 kB	Preview Download
test_text.txt md5:a50d71c685fc6d352fa5202e38d84d97	672.5 kB	Preview Download
train_text.txt md5:775e6ef01db9c6cef0393e3bba0bd295	2.4 MB	Preview Download

Additional details

Is part of: Data paper: 10.57754/FDAT.721tn-jef87 (DOI)

Deutsche Forschungsgemeinschaft
SFB 833: Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Accuracy: Not specified.
Completeness: Not specified.
Conformity: Not specified.
Consistency: Not specified.
Credibility: Not specified.
Processability: Not specified.
Relevance: Not specified.
Timeliness: Not specified.
Understandability: Not specified.

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Dutch Adjective-Noun Phrase Dataset for Compositionality Tests (nld-adj-n)

Files

CMDI.xml

Files (292.5 MB)

Additional details

Related works

Funding

Data quality

Dutch Adjective-Noun Phrase Dataset for Compositionality Tests (nld-adj-n)

Creators

Description

Files

CMDI.xml

Files (292.5 MB)

Additional details

Related works

Funding

Data quality