English Nominal Compounds Dataset for Compositionality Tests (eng-nn)
Description
If you want to use this dataset for research purposes, please refer to the following sources:
- Stephen Tratz. 2011. Semantically-enriched parsing for natural language understanding. Ph.D. thesis, University of Southern California.
- Roland Schäfer. 2015. Processing and querying large web corpora with the COW14 architecture. In Proceedings of Challenges in the Management of Large Corpora 3 (CMLC-3), Lancaster. UCREL, IDS.
- Roland Schäfer and Felix Bildhauer. 2012. Building Large Corpora from the Web Using a New Efficient Tool Chain. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), pages 486–493, Istanbul, Turkey. European Language Resources Association (ELRA).
- Christiane Fellbaum. 1998. WordNet. Wiley Online Library.
- Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.
The 16,978 English nominal compounds (11,824 train, 3,481 test, 1,673 dev) contains data from an existing compound dataset (Tratz, 2011), which is available here: [https://www.isi.edu/publications/licensed-sw/fanseparser/index.html] and which is provided under the Apache License 2.0.
Additionally, a selection of nominal compounds from the English WordNet 3.1 was added. The train/test/dev files have the following format, single parts separated by space: modifier head compound (e.g. space center space_center).
For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition.
The word embeddings were trained on a subcorpus of the ENCOW16AX treebank (Schäfer and Bildhauer, 2012; Schäfer, 2015), which contains only sentences with a document quality of a or b. The final training corpus for the word embeddings contains 89.0M sentences and 2.2B tokens. The compounds that were separated by a space were merged into a single unit for the embedding training, by artificially connecting the two constituents via an underscore. The embeddings for the single words were trained on the remaining occurrences of the constituents.
If you're using the dataset or word embeddings for research purposes, please refer to the sources mentioned above.
The word embeddings for all constituents and compounds in this dataset are stored in the word2vec format in encow-sample-compounds.bin. This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)). The embeddings for the constituents and compounds were trained with the word2vec package using the skipgram model with negative sampling (Mikolov et al. 2013) with an embedding dimension of 200, symmetric window of 10, 25 negative samples per positive training instance and a sample probability threshold of 0.0001. The minimum frequency cut-off was set to 50 for all words and the vocabulary size amounts 270,940 words.
Other (English)
Research carried out in work package A03 of the SFB 833.
Files
CMDI.xml
Files
(220.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:af0383cf3f0c9970b0e7ebca6eba8962
|
31.8 kB | Preview Download |
|
md5:f286331b3c691a088be3f93759f8912e
|
48.6 kB | Preview Download |
|
md5:d4b9e464c53710efea5c122e13132c34
|
219.5 MB | Download |
|
md5:f24425e6ed7afe120e37c82a3e79cc28
|
3.0 kB | Preview Download |
|
md5:e7a845c8ef0e9b384e9c856720d7df9d
|
100.8 kB | Preview Download |
|
md5:8d5ffa90864db58851fce35a09781ada
|
344.6 kB | Preview Download |
Additional details
Related works
- Is part of
- Data paper: 10.57754/FDAT.721tn-jef87 (DOI)