German Compounds Dataset for Compositionality Tests

Dima, Corina

doi:10.57754/FDAT.tyza5-9kj67

Published March 14, 2017 | Version v1

Dataset Open

German Compounds Dataset for Compositionality Tests

Dima, Corina (Researcher)¹

1. University of Tübingen

The compounds in this dataset were extracted from the list of 54759 German compounds in GermaNet version 9.0, available at

http://www.sfs.uni-tuebingen.de/lsd/compounds.shtml.

The following paper describes the automatic compound splitting that is performed before the manual post-correction. If you want to use the split compounds in the context of scientific or research work, please refer to the paper:

Verena Henrich and Erhard Hinrichs: Determining Immediate Constituents of Compounds in GermaNet. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2011), Hissar, Bulgaria, September 2011, pp. 420-426. [Download paper: http://www.aclweb.org/anthology/R11-1058]

The initial compound list was filtered to only those compounds that had a minimum frequency of 500 in the DECOW14AX corpus (https://webcorpora.org/), resulting in a list of 34497 compounds, which were split into the train, test and dev splits (with 24147, 6901 and 3449 compounds respectively).

The dataset contains of a dictionary file, cmh_dict.txt, containing 41732 unique words. 8580 of them are modifiers and/or heads of a compound. The modifiers and heads appear first in the dictionary, and the compounds appear last.

The train/test/dev files have the following format:

index_modifier index_head index_compound

where index_modifier, index_head and index_compound are the 1-based indices of the modifier, head and compound in the dictionary file (cmh_dict.txt)

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

CMDI.xml

Files (442.9 kB)

Name	Size	Download all
CMDI.xml md5:c61b03ca914d02b17302b2d70627eb67	14.4 kB	Preview Download
German_compounds_composition_dataset.zip md5:1f7e84ada6ee48e8eb9485f4c446da99	428.5 kB	Preview Download

Additional details

Is described by: Text: https://aclanthology.org/D15-1188.pdf (URL)

Deutsche Forschungsgemeinschaft
SFB 833: Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Accuracy: Not specified.
Completeness: Not specified.
Conformity: Not specified.
Consistency: Not specified.
Credibility: Not specified.
Processability: Not specified.
Relevance: Not specified.
Timeliness: Not specified.
Understandability: Not specified.

	All versions	This version
Views	1	1
Downloads	0	0
Data volume	0 Bytes	0 Bytes

German Compounds Dataset for Compositionality Tests

Other (English)

Files

CMDI.xml

Files (442.9 kB)

Additional details

Related works

Funding

Data quality

German Compounds Dataset for Compositionality Tests

Creators

Description

Other (English)

Files

CMDI.xml

Files (442.9 kB)

Additional details

Related works

Funding

Data quality