Published May 1, 2019 | Version v1
Dataset Open

English Adverb-Adjective Phrase Dataset for Compositionality Tests (eng-adv-adj)

  • 1. ROR icon University of Tübingen

Description

 If you want to use this dataset for research purposes, please refer to the following sources:

                 - Roland  Schäfer. 2015. Processing  and  querying large  web  corpora  with  the  COW14  architecture. In Proceedings of Challenges in the Management of Large Corpora 3 (CMLC-3),  Lancaster. UCREL, IDS.

                 - Roland Schäfer and Felix Bildhauer. 2012. Building Large Corpora from the Web Using a New Efficient  Tool  Chain. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12),  pages 486–493, Istanbul, Turkey. European Language Resources Association (ELRA).

                 - Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.

The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.

This dataset contains 238,975 English adjective-noun phrases (split into 167,292 train, 47,803 test, 23,880 dev instances) that were automatically extracted from the ENCOW16AX treebank (Schäfer  and  Bildhauer, 2012;  Schäfer,  2015).

The phrases were extracted with the help of the part-of-speech tag information provided by the treebank.

The train/test/dev files have the following format, single parts separated by tab: adjective noun adj-noun phrase, where the adjective and the noun are separated by the string _adj_n_ (e.g. good networking good_adj_n_networking).

For results of different composition models on this dataset see Dima et al. (2019) ), No word is an island — a transformation weighting model for semantic composition.

The word embeddings were trained on ENCOW16AX, which contains crawled web data from different sources. The training corpus was filtered to only contain sentences with a document quality of a or b to avoid noisy data.

To ensure that trained word embeddings for enough adjective-noun phrases are available, the embeddings were trained on word forms, instead of lemmas. The final training corpus for the word embeddings contains 89.0M sentences and 2.2B  tokens.

The embeddings for the adjectives, nouns and phrases were trained jointly, with the word2vec package (Mikolov et al. 2013), using the skipgram model with negative sampling, a symmetric window of 10 as context size, 25 negative samples per positive training instance and a sample probability threshold of 0.0001. The resulting embeddings have a dimension of 200 and the vocabulary size is 478,372.

The minimum frequency cut-off was set to 50 for all words and phrases.

The embeddings are stored in the binary format of word2vec in encow-adj-n.bin. This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).

Files

CMDI.xml

Files (226.6 MB)

Name Size Download all
md5:11dc7932aa27e1aecbc4af250ced72fe
27.1 kB Preview Download
md5:ad980682e1708ef7cb8afd3984bf158e
90.2 kB Preview Download
md5:9f095e919bb6280ced2e66a44c2897be
225.7 MB Download
md5:e252c10b972be5ff0b5c4312a7525842
3.5 kB Preview Download
md5:1401381526c7c1ac3748526b6dea6ad8
180.2 kB Preview Download
md5:6b69d0fdd725f9762315f424994ad2a2
631.2 kB Preview Download

Additional details

Related works

Is part of
Data paper: 10.57754/FDAT.721tn-jef87 (DOI)

Funding

Deutsche Forschungsgemeinschaft
SFB 833:  Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Data quality

Accuracy

Not specified.

Completeness

Not specified.

Conformity

Not specified.

Consistency

Not specified.

Credibility

Not specified.

Processability

Not specified.

Relevance

Not specified.

Timeliness

Not specified.

Understandability

Not specified.