Preview

English Adjective-Noun Phrase Dataset for Compositionality Tests
-----------------------------------------------------------------

If you want to use this dataset for research purposes, please refer to the following sources:
- Roland Schäfer. 2015. Processing and querying large web corpora with the COW14 architecture. In Proceedings of Challenges in the Management of Large Corpora 3 (CMLC-3), Lancaster. UCREL, IDS.
- Roland Schäfer and Felix Bildhauer. 2012. Building Large Corpora from the Web Using a New Efficient Tool Chain. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 486–493, Istanbul, Turkey. European Language Resources Association (ELRA).
- Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.
Resources Association (ELRA).
The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.

This dataset contains 238,975 English adjective-noun phrases (split into 167,292 train, 47,803 test, 23,880 dev instances) that were automatically extracted from the ENCOW16AX treebank (Schäfer and Bildhauer, 2012; Schäfer, 2015).
The phrases were extracted with the help of the part-of-speech tag information provided by the treebank.
The train/test/dev files have the following format, single parts separated by space:

adjective noun adj-noun phrase

where the adjective and the noun are separated by the string _adj_n_ (e.g. good networking good_adj_n_networking).
For results of different composition models on this dataset see Dima et al. (2019) ), No word is an island — a transformation weighting model for semantic composition.

The word embeddings were trained on ENCOW16AX, which contains crawled web data from different sources. The training corpus was filtered to only contain sentences with a document quality of a or b to avoid noisy data.
To ensure that trained word embeddings for enough adjective-noun phrases are available, the embeddings were trained on word forms, instead of lemmas. The final training corpus for the word embeddings contains 89.0M sentences and 2.2B tokens.
The embeddings for the adjectives, nouns and phrases were trained jointly, with the word2vec package (Mikolov et al. 2013), using the skipgram model with negative sampling, a symmetric window of 10 as context size, 25 negative samples per positive training instance and a sample probability threshold of 0.0001. The resulting embeddings have a dimension of 200 and the vocabulary size is 478,372.
The minimum frequency cut-off was set to 50 for all words and phrases.
The embeddings are stored in the binary format of word2vec in encow-adj-n.bin. This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).