German Compounds Dataset for Compositionality Tests
Description
The compounds in this dataset were extracted from the list of 54759 German compounds in GermaNet version 9.0, available at
http://www.sfs.uni-tuebingen.de/lsd/compounds.shtml.
The following paper describes the automatic compound splitting that is performed before the manual post-correction. If you want to use the split compounds in the context of scientific or research work, please refer to the paper:
- Verena Henrich and Erhard Hinrichs: Determining Immediate Constituents of Compounds in GermaNet. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2011), Hissar, Bulgaria, September 2011, pp. 420-426. [Download paper: http://www.aclweb.org/anthology/R11-1058]
The initial compound list was filtered to only those compounds that had a minimum frequency of 500 in the DECOW14AX corpus (https://webcorpora.org/), resulting in a list of 34497 compounds, which were split into the train, test and dev splits (with 24147, 6901 and 3449 compounds respectively).
The dataset contains of a dictionary file, cmh_dict.txt, containing 41732 unique words. 8580 of them are modifiers and/or heads of a compound. The modifiers and heads appear first in the dictionary, and the compounds appear last.
The train/test/dev files have the following format:
index_modifier index_head index_compound
where index_modifier, index_head and index_compound are the 1-based indices of the modifier, head and compound in the dictionary file (cmh_dict.txt)
Other (English)
Research carried out in work package A03 of the SFB 833.
Files
CMDI.xml
Files
(442.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:c61b03ca914d02b17302b2d70627eb67
|
14.4 kB | Preview Download |
|
md5:1f7e84ada6ee48e8eb9485f4c446da99
|
428.5 kB | Preview Download |
Additional details
Related works
- Is described by
- Text: https://aclanthology.org/D15-1188.pdf (URL)