Published May 1, 2019 | Version v1
Dataset Open

English Adjective-Noun Phrase Dataset for Compositionality Tests (eng-adj-n)

  • 1. ROR icon University of Tübingen

Description

 If you want to use this dataset for research purposes, please refer to the following sources:

                 - Roland  Schäfer. 2015. Processing  and  querying large  web  corpora  with  the  COW14  architecture. In Proceedings of Challenges in the Management of Large Corpora 3 (CMLC-3),  Lancaster. UCREL, IDS.

                 - Roland Schäfer and Felix Bildhauer. 2012. Building Large Corpora from the Web Using a New Efficient  Tool  Chain. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12),  pages 486–493, Istanbul, Turkey. European Language Resources Association (ELRA).

                 - Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.

The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.

This dataset contains 238,975 English adjective-noun phrases (split into 167,292 train, 47,803 test, 23,880 dev instances) that were automatically extracted from the ENCOW16AX treebank (Schäfer  and  Bildhauer, 2012;  Schäfer,  2015).

The phrases were extracted with the help of the part-of-speech tag information provided by the treebank.

The train/test/dev files have the following format, single parts separated by tab: adjective noun adj-noun phrase, where the adjective and the noun are separated by the string _adj_n_ (e.g. good networking good_adj_n_networking).

For results of different composition models on this dataset see Dima et al. (2019) ), No word is an island — a transformation weighting model for semantic composition.

The word embeddings were trained on ENCOW16AX, which contains crawled web data from different sources. The training corpus was filtered to only contain sentences with a document quality of a or b to avoid noisy data.

To ensure that trained word embeddings for enough adjective-noun phrases are available, the embeddings were trained on word forms, instead of lemmas. The final training corpus for the word embeddings contains 89.0M sentences and 2.2B  tokens.

The embeddings for the adjectives, nouns and phrases were trained jointly, with the word2vec package (Mikolov et al. 2013), using the skipgram model with negative sampling, a symmetric window of 10 as context size, 25 negative samples per positive training instance and a sample probability threshold of 0.0001. The resulting embeddings have a dimension of 200 and the vocabulary size is 478,372.

The minimum frequency cut-off was set to 50 for all words and phrases.

The embeddings are stored in the binary format of word2vec in encow-adj-n.bin. This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).1

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

CMDI.xml
Files (381.3 MiB)
Name Size
md5:3cbdc5fad50b3e49f015fd998a58a7d3
1.8 MiB Preview Download
md5:e6bd54aec5f35e6fb917bc0054ce701e
900.0 KiB Preview Download
md5:d2c7a67ad9458501a860f1d7e60251ef
2.9 KiB Preview Download
md5:ba2142066543aaecfb2e155efb1ece3e
6.2 MiB Preview Download
md5:b14ac4b09ed14aff10c67a485a111c72
25.9 KiB Preview Download
md5:5d59d00b17b1bdf6813cadcfff9a5d9c
372.5 MiB Download

Additional details

Created:
August 6, 2024
Modified:
August 8, 2024