English Nominal Compounds Dataset for Compositionality Tests (eng-nn)

Dima, Corina

doi:10.57754/FDAT.4kzrx-mf324

Published May 1, 2019 | Version v1

Data paper Open

English Nominal Compounds Dataset for Compositionality Tests (eng-nn)

Dima, Corina (Researcher)¹

1. University of Tübingen

If you want to use this dataset for research purposes, please refer to the following sources:

- Stephen Tratz. 2011. Semantically-enriched parsing for natural language understanding. Ph.D. thesis, University of Southern California.

- Roland Schäfer. 2015. Processing and querying large web corpora with the COW14 architecture. In Proceedings of Challenges in the Management of Large Corpora 3 (CMLC-3), Lancaster. UCREL, IDS.

- Roland Schäfer and Felix Bildhauer. 2012. Building Large Corpora from the Web Using a New Efficient Tool Chain. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), pages 486–493, Istanbul, Turkey. European Language Resources Association (ELRA).

- Christiane Fellbaum. 1998. WordNet. Wiley Online Library.

- Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.

The 16,978 English nominal compounds (11,824 train, 3,481 test, 1,673 dev) contains data from an existing compound dataset (Tratz, 2011), which is available here: [https://www.isi.edu/publications/licensed-sw/fanseparser/index.html] and which is provided under the Apache License 2.0.

Additionally, a selection of nominal compounds from the English WordNet 3.1 was added. The train/test/dev files have the following format, single parts separated by space: modifier head compound (e.g. space center space_center).

For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition.

The word embeddings were trained on a subcorpus of the ENCOW16AX treebank (Schäfer and Bildhauer, 2012; Schäfer, 2015), which contains only sentences with a document quality of a or b. The final training corpus for the word embeddings contains 89.0M sentences and 2.2B tokens. The compounds that were separated by a space were merged into a single unit for the embedding training, by artificially connecting the two constituents via an underscore. The embeddings for the single words were trained on the remaining occurrences of the constituents.

If you're using the dataset or word embeddings for research purposes, please refer to the sources mentioned above.

The word embeddings for all constituents and compounds in this dataset are stored in the word2vec format in encow-sample-compounds.bin. This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)). The embeddings for the constituents and compounds were trained with the word2vec package using the skipgram model with negative sampling (Mikolov et al. 2013) with an embedding dimension of 200, symmetric window of 10, 25 negative samples per positive training instance and a sample probability threshold of 0.0001. The minimum frequency cut-off was set to 50 for all words and the vocabulary size amounts 270,940 words.

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

CMDI.xml

Files (220.0 MB)

Name	Size	Download all
CMDI.xml md5:af0383cf3f0c9970b0e7ebca6eba8962	31.8 kB	Preview Download
dev_text.txt md5:f286331b3c691a088be3f93759f8912e	48.6 kB	Preview Download
encow-sample-compounds.bin md5:d4b9e464c53710efea5c122e13132c34	219.5 MB	Download
eng-nn-readme.txt md5:f24425e6ed7afe120e37c82a3e79cc28	3.0 kB	Preview Download
test_text.txt md5:e7a845c8ef0e9b384e9c856720d7df9d	100.8 kB	Preview Download
train_text.txt md5:8d5ffa90864db58851fce35a09781ada	344.6 kB	Preview Download

Additional details

Is part of: Data paper: 10.57754/FDAT.721tn-jef87 (DOI)

Deutsche Forschungsgemeinschaft
SFB 833: Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Accuracy: Not specified.
Completeness: Not specified.
Conformity: Not specified.
Consistency: Not specified.
Credibility: Not specified.
Processability: Not specified.
Relevance: Not specified.
Timeliness: Not specified.
Understandability: Not specified.

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

English Nominal Compounds Dataset for Compositionality Tests (eng-nn)

Other (English)

Files

CMDI.xml

Files (220.0 MB)

Additional details

Related works

Funding

Data quality

English Nominal Compounds Dataset for Compositionality Tests (eng-nn)

Creators

Description

Other (English)

Files

CMDI.xml

Files (220.0 MB)

Additional details

Related works

Funding

Data quality