Published May 1, 2019 | Version v1
Software Open

Code for composition models used in "no word is an island" (commix)

  • 1. ROR icon University of Tübingen

Description

If you want to use this code for research purposes, please refer to the following sources: - Daniël de Kok, Sebastian Pütz. 2019. Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP). - Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics. The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license. The 119,434 German adjective-noun phrases in this dataset (splits: 83,603 train, 23,887 test, 11,944 dev instances) were extracted automatically from the TüBa-D/DP treebank. The treebank is composed of three different parts: 1) articles from the German newspaper taz; 2) the German Wikipedia dump from January 20, 2018; 3) German proceedings from the EuroParl corpus (Koehn, 2005; Tiedemann, 2012). The treebank consists of 64.9M sentences and 1.3B tokens.  The train/test/dev files have the following format, single parts are separated by space: adjective noun adj-noun phrase, where the adjective and the noun of the phrase are separated by the string _adj_n_ (e.g. kritisch Film kritisch_adj_n_Film). The phrases were extracted with the part-of-speech tag information provided by the treebank. For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition.  The embeddings for all words and phrases in this dataset are stored in the word2vec format in twe-adj-n.bin. This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).  The embeddings for the adjectives, nouns and phrases were trained jointly on the lemmatized version of the TüBa-D/DP treebank, using the word2vec package (Mikolov et al. 2013). The word embeddings were trained with the skipgram model with negative sampling, a symmetric window of 10 as context size, 25 negative samples per positive training instance and a sample probability threshold of 0.0001. The resulting embeddings have a dimension of 200 and the vocabulary contains 476,137 words in total. The minimum frequency cut-off was set to 50 for all words.

Files

CMDI.xml

Files (233.7 kB)

Name Size Download all
md5:622e0b37f29ff3dd5318a1753d537b98
74.0 kB Preview Download
md5:1419262c0cb4128bb5bc34a3bebd5c40
159.7 kB Preview Download

Additional details

Related works

Is part of
Data paper: 10.57754/FDAT.721tn-jef87 (DOI)

Funding

Deutsche Forschungsgemeinschaft
SFB 833:  Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Data quality

Accuracy

Not specified.

Completeness

Not specified.

Conformity

Not specified.

Consistency

Not specified.

Credibility

Not specified.

Processability

Not specified.

Relevance

Not specified.

Timeliness

Not specified.

Understandability

Not specified.