Published March 14, 2017 | Version v1
Dataset Restricted

Dataset of German lexicalized and transparent compounds

  • 1. ROR icon University of Tübingen

Description

Dataset extracted from the de-nncom-sem annotated dataset (8005 compounds). Contains 648 compounds that are annotated as lexicalized (in some way: lex_M, lex_H, lex_R, lex_HS, lex_MS). An additional 648 compounds that were not marked as lexicalized were randomly extracted from the 8005 dataset and added to this dataset, to make the data balanced.

Filtered for the compounds (and modif, heads) that occur with min freq. 101 in the word embeddings -> 1053.

Removed Medizinfrau, Modepuppe and Abendland to get to a neat 1050 compounds in the dataset (they were above 100).

Example entries:

                 Hefekranz;Hefe;Kranz;lex_HS;1

                 Bruchwand;Bruch;Wand;not_lexicalized;0

                 

The first 3 columns contain the compound, modifier and head. The fourth column contains the lexicalization labels annotated by Dr. Heike Telljohann. The lexicalized examples are coded with 1 on the fifth column, the non-lexicalized with 0.

Columns 5-7 list the frequencies of the compound, modifier and head respectively in the decow14ax full vocabulary.

The file de-ulex_dataset_freq.txt contains the original dataset with 1296 entries, while de-ulex_dataset_freq_gt100_shuf.txt contains the 1050 entries filtered for frequency > 100, which were used in chapter 6 of Dima (2019).

                 

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Related works

Is described by
Text: 10.15496/publikation-28485 (DOI)
Is part of
Collection: 10.57754/FDAT.dhnj7-0mb78 (DOI)

Funding

Deutsche Forschungsgemeinschaft
SFB 833:  Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Data quality

Accuracy

Not specified.

Completeness

Not specified.

Conformity

Not specified.

Consistency

Not specified.

Credibility

Not specified.

Processability

Not specified.

Relevance

Not specified.

Timeliness

Not specified.

Understandability

Not specified.