Code for composition models used in "no word is an island" (commix)

de Kok, Daniël

doi:10.57754/FDAT.6f2bf-y3e65

Published May 1, 2019 | Version v1

Software Open

Code for composition models used in "no word is an island" (commix)

de Kok, Daniël (Researcher)¹

1. University of Tübingen

If you want to use this code for research purposes, please refer to the following sources: - Daniël de Kok, Sebastian Pütz. 2019. Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP). - Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics. The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license. The 119,434 German adjective-noun phrases in this dataset (splits: 83,603 train, 23,887 test, 11,944 dev instances) were extracted automatically from the TüBa-D/DP treebank. The treebank is composed of three different parts: 1) articles from the German newspaper taz; 2) the German Wikipedia dump from January 20, 2018; 3) German proceedings from the EuroParl corpus (Koehn, 2005; Tiedemann, 2012). The treebank consists of 64.9M sentences and 1.3B tokens. The train/test/dev files have the following format, single parts are separated by space: adjective noun adj-noun phrase, where the adjective and the noun of the phrase are separated by the string _adj_n_ (e.g. kritisch Film kritisch_adj_n_Film). The phrases were extracted with the part-of-speech tag information provided by the treebank. For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition. The embeddings for all words and phrases in this dataset are stored in the word2vec format in twe-adj-n.bin. This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)). The embeddings for the adjectives, nouns and phrases were trained jointly on the lemmatized version of the TüBa-D/DP treebank, using the word2vec package (Mikolov et al. 2013). The word embeddings were trained with the skipgram model with negative sampling, a symmetric window of 10 as context size, 25 negative samples per positive training instance and a sample probability threshold of 0.0001. The resulting embeddings have a dimension of 200 and the vocabulary contains 476,137 words in total. The minimum frequency cut-off was set to 50 for all words.

Files

CMDI.xml

Files (233.7 kB)

Name	Size	Download all
CMDI.xml md5:622e0b37f29ff3dd5318a1753d537b98	74.0 kB	Preview Download
software.zip md5:1419262c0cb4128bb5bc34a3bebd5c40	159.7 kB	Preview Download

Additional details

Is part of: Data paper: 10.57754/FDAT.721tn-jef87 (DOI)

Deutsche Forschungsgemeinschaft
SFB 833: Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Accuracy: Not specified.
Completeness: Not specified.
Conformity: Not specified.
Consistency: Not specified.
Credibility: Not specified.
Processability: Not specified.
Relevance: Not specified.
Timeliness: Not specified.
Understandability: Not specified.

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Code for composition models used in "no word is an island" (commix)

Files

CMDI.xml

Files (233.7 kB)

Additional details

Related works

Funding

Data quality

Code for composition models used in "no word is an island" (commix)

Creators

Description

Files

CMDI.xml

Files (233.7 kB)

Additional details

Related works

Funding

Data quality