English Adjective-Noun Phrase Dataset for Compositionality Tests (eng-adj-n)

Description

If you want to use this dataset for research purposes, please refer to the following sources:

- Roland Schäfer. 2015. Processing and querying large web corpora with the COW14 architecture. In Proceedings of Challenges in the Management of Large Corpora 3 (CMLC-3), Lancaster. UCREL, IDS.

- Roland Schäfer and Felix Bildhauer. 2012. Building Large Corpora from the Web Using a New Efficient Tool Chain. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), pages 486–493, Istanbul, Turkey. European Language Resources Association (ELRA).

- Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.

The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.

This dataset contains 238,975 English adjective-noun phrases (split into 167,292 train, 47,803 test, 23,880 dev instances) that were automatically extracted from the ENCOW16AX treebank (Schäfer and Bildhauer, 2012; Schäfer, 2015).

The phrases were extracted with the help of the part-of-speech tag information provided by the treebank.

The train/test/dev files have the following format, single parts separated by tab: adjective noun adj-noun phrase, where the adjective and the noun are separated by the string _adj_n_ (e.g. good networking good_adj_n_networking).

For results of different composition models on this dataset see Dima et al. (2019) ), No word is an island — a transformation weighting model for semantic composition.

The word embeddings were trained on ENCOW16AX, which contains crawled web data from different sources. The training corpus was filtered to only contain sentences with a document quality of a or b to avoid noisy data.

To ensure that trained word embeddings for enough adjective-noun phrases are available, the embeddings were trained on word forms, instead of lemmas. The final training corpus for the word embeddings contains 89.0M sentences and 2.2B tokens.

The embeddings for the adjectives, nouns and phrases were trained jointly, with the word2vec package (Mikolov et al. 2013), using the skipgram model with negative sampling, a symmetric window of 10 as context size, 25 negative samples per positive training instance and a sample probability threshold of 0.0001. The resulting embeddings have a dimension of 200 and the vocabulary size is 478,372.

The minimum frequency cut-off was set to 50 for all words and phrases.

The embeddings are stored in the binary format of word2vec in encow-adj-n.bin. This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).1

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

CMDI.xml

Files (381.3 MiB)

Name	Size	Actions
test_text.txt md5:3cbdc5fad50b3e49f015fd998a58a7d3	1.8 MiB	Preview Download
dev_text.txt md5:e6bd54aec5f35e6fb917bc0054ce701e	900.0 KiB	Preview Download
eng-adj-n-readme.txt md5:d2c7a67ad9458501a860f1d7e60251ef	2.9 KiB	Preview Download
train_text.txt md5:ba2142066543aaecfb2e155efb1ece3e	6.2 MiB	Preview Download
CMDI.xml md5:b14ac4b09ed14aff10c67a485a111c72	25.9 KiB	Preview Download
encow-adj-n.bin md5:5d59d00b17b1bdf6813cadcfff9a5d9c	372.5 MiB	Download

English Adjective-Noun Phrase Dataset for Compositionality Tests (eng-adj-n)

Description

Other (English)

Files

Additional details