German Compounds Dataset for Compositionality Tests
Description
The compounds in this dataset were extracted from the list of 54759 German compounds in GermaNet version 9.0, available at
http://www.sfs.uni-tuebingen.de/lsd/compounds.shtml.
The following paper describes the automatic compound splitting that is performed before the manual post-correction. If you want to use the split compounds in the context of scientific or research work, please refer to the paper:
- Verena Henrich and Erhard Hinrichs: Determining Immediate Constituents of Compounds in GermaNet. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2011), Hissar, Bulgaria, September 2011, pp. 420-426. [Download paper: http://www.aclweb.org/anthology/R11-1058]
The initial compound list was filtered to only those compounds that had a minimum frequency of 500 in the DECOW14AX corpus (https://webcorpora.org/), resulting in a list of 34497 compounds, which were split into the train, test and dev splits (with 24147, 6901 and 3449 compounds respectively).
The dataset contains of a dictionary file, cmh_dict.txt, containing 41732 unique words. 8580 of them are modifiers and/or heads of a compound. The modifiers and heads appear first in the dictionary, and the compounds appear last.
The train/test/dev files have the following format:
index_modifier index_head index_compound
where index_modifier, index_head and index_compound are the 1-based indices of the modifier, head and compound in the dictionary file (cmh_dict.txt)
Other (English)
Research carried out in work package A03 of the SFB 833.
Additional details
- Accuracy
Not specified.
- Completeness
Not specified.
- Conformity
Not specified.
- Consistency
Not specified.
- Credibility
Not specified.
- Processability
Not specified.
- Relevance
Not specified.
- Timeliness
Not specified.
- Understandability
Not specified.