English Adverb-Adjective Phrase Dataset for Compositionality Tests (eng-adv-adj)

de Kok, Daniël

doi:10.57754/FDAT.myevg-4a328

Published May 1, 2019 | Version v1

Dataset Open

English Adverb-Adjective Phrase Dataset for Compositionality Tests (eng-adv-adj)

de Kok, Daniël (Researcher)¹

1. University of Tübingen

If you want to use this dataset for research purposes, please refer to the following sources:

- Roland Schäfer. 2015. Processing and querying large web corpora with the COW14 architecture. In Proceedings of Challenges in the Management of Large Corpora 3 (CMLC-3), Lancaster. UCREL, IDS.

- Roland Schäfer and Felix Bildhauer. 2012. Building Large Corpora from the Web Using a New Efficient Tool Chain. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), pages 486–493, Istanbul, Turkey. European Language Resources Association (ELRA).

- Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.

The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.

This dataset contains 238,975 English adjective-noun phrases (split into 167,292 train, 47,803 test, 23,880 dev instances) that were automatically extracted from the ENCOW16AX treebank (Schäfer and Bildhauer, 2012; Schäfer, 2015).

The phrases were extracted with the help of the part-of-speech tag information provided by the treebank.

The train/test/dev files have the following format, single parts separated by tab: adjective noun adj-noun phrase, where the adjective and the noun are separated by the string _adj_n_ (e.g. good networking good_adj_n_networking).

For results of different composition models on this dataset see Dima et al. (2019) ), No word is an island — a transformation weighting model for semantic composition.

The word embeddings were trained on ENCOW16AX, which contains crawled web data from different sources. The training corpus was filtered to only contain sentences with a document quality of a or b to avoid noisy data.

To ensure that trained word embeddings for enough adjective-noun phrases are available, the embeddings were trained on word forms, instead of lemmas. The final training corpus for the word embeddings contains 89.0M sentences and 2.2B tokens.

The embeddings for the adjectives, nouns and phrases were trained jointly, with the word2vec package (Mikolov et al. 2013), using the skipgram model with negative sampling, a symmetric window of 10 as context size, 25 negative samples per positive training instance and a sample probability threshold of 0.0001. The resulting embeddings have a dimension of 200 and the vocabulary size is 478,372.

The minimum frequency cut-off was set to 50 for all words and phrases.

The embeddings are stored in the binary format of word2vec in encow-adj-n.bin. This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).

Files

CMDI.xml

Files (226.6 MB)

Name	Size	Download all
CMDI.xml md5:11dc7932aa27e1aecbc4af250ced72fe	27.1 kB	Preview Download
dev_text.txt md5:ad980682e1708ef7cb8afd3984bf158e	90.2 kB	Preview Download
encow-adv-adj.bin md5:9f095e919bb6280ced2e66a44c2897be	225.7 MB	Download
eng-adv-adj-reame.txt md5:e252c10b972be5ff0b5c4312a7525842	3.5 kB	Preview Download
test_text.txt md5:1401381526c7c1ac3748526b6dea6ad8	180.2 kB	Preview Download
train_text.txt md5:6b69d0fdd725f9762315f424994ad2a2	631.2 kB	Preview Download

Additional details

Is part of: Data paper: 10.57754/FDAT.721tn-jef87 (DOI)

Deutsche Forschungsgemeinschaft
SFB 833: Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Accuracy: Not specified.
Completeness: Not specified.
Conformity: Not specified.
Consistency: Not specified.
Credibility: Not specified.
Processability: Not specified.
Relevance: Not specified.
Timeliness: Not specified.
Understandability: Not specified.

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

English Adverb-Adjective Phrase Dataset for Compositionality Tests (eng-adv-adj)

Files

CMDI.xml

Files (226.6 MB)

Additional details

Related works

Funding

Data quality

English Adverb-Adjective Phrase Dataset for Compositionality Tests (eng-adv-adj)

Creators

Description

Files

CMDI.xml

Files (226.6 MB)

Additional details

Related works

Funding

Data quality