Published March 14, 2017 | Version v1
Dataset Restricted

English Compounds Dataset for Compositionality Tests

  • 1. ROR icon University of Tübingen

Description

The ENglish COMpositionality dataset containing COMpounds (en-comcom) was constructed from two existing compound datasets - the Tratz (2011) dataset and the Ó'Séaghdha (2008) dataset - and a selection of the nominal compounds in the WordNet database.

The Tratz (2011) dataset contains 19158 compounds and is part of the semantically-enriched parser described in Tratz (2011) available at http://www.isi.edu/publications/licensed-sw/fanseparser/

The Ó'Séaghdha (2008) contains 1443 compounds and is available at http://www.cl.cam.ac.uk/~do242/Resources/1443_Compounds.tar.gz

Additional compounds were collected from the WordNet 3.1 (Fellbaum, 1998) 'data.noun' file. The extracted list contained 18775 compounds.

The combination of compounds from the three sources was additionaly pre-processed and frequency-filtered - details in Dima (2019). The final dataset has 27220 compounds. The train, test and dev splits contain 19054, 5444 and 2722 compounds.

The train/test/dev files have the following format:

                  modifier head compound (e.g. police car police_car)

For results of compositionality models evaluated on this dataset see Dima (2016), Dima (2019).

                 Dima, Corina. 2015. Reverse-engineering Language: A Study on the Semantic Compositionality of German Compounds. In Proceedings of EMNLP 2015, Lisbon, Portugal, pp. pp. 1637–1642

                 [Download paper: https://aclweb.org/anthology/D/D15/D15-1188.pdf]

                 - Dima, C. 2016. On the Compositionality and Semantic Interpretation of English Noun Compounds. In Proceedings of the 1st Workshop on Representation Learning for NLP @ ACL 2016, pages 27–39, Berlin, Germany.

                 - Dima, C. 2019. Composition Models for the Representation and Semantic Interpretation of Nominal Compounds. PhD thesis. University of Tübingen.

                 - Fellbaum, C. 1998. WordNet. Wiley Online Library.

                 - Ó Séaghdha, D. 2008. Learning compound noun semantics. PhD thesis, Computer Laboratory, University of Cambridge. Published as University of Cambridge Computer Laboratory Technical Report 735.

                 - Tratz, S. 2011. Semantically-enriched parsing for natural language understanding. PhD thesis, PhD Thesis, University of Southern California.

 

English nominal compounds - compositional distributional representations - semandtic composition

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Related works

Is described by
Text: 10.15496/publikation-28485 (DOI)
Text: http://aclweb.org/anthology/W16-1604 (URL)
Is part of
Collection: 10.57754/FDAT.dhnj7-0mb78 (DOI)

Funding

Deutsche Forschungsgemeinschaft
SFB 833:  Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Data quality

Accuracy

Not specified.

Completeness

Not specified.

Conformity

Not specified.

Consistency

Not specified.

Credibility

Not specified.

Processability

Not specified.

Relevance

Not specified.

Timeliness

Not specified.

Understandability

Not specified.