Vector representations of German words and compounds

Dima, Corina

doi:10.57754/FDAT.fx84s-dxe33

Published March 14, 2017 | Version v1

Dataset Restricted

Vector representations of German words and compounds

Dima, Corina (Researcher)¹

1. University of Tübingen

Word representations used in Dima(2015), Dima (2019). The vectors were generated from the decow14ax corpus (https://corporafromtheweb.org/), ~10 billion words of raw text. Corpus pre-processing: words lowercased, punctuation removed, each number was replaced by the string 'NUMBER'.

Embeddings trained using a minimum word frequency of 100, leading to a vocabulary 1,029,270 words. The vocabulary file 'decow14ax_all_min_100.vocab' contains these word representations and their frequency in the support corpus. 'decow14ax_full.vocab' contains the full vocabulary generated for the corpus (no cut-off).

The embeddings were trained with GloVe, for 15 iterations, using a 10-word symmetric window of text (20 words surrounding a particular word). The files are suffixed with the dimensionality of the vector representations: 50 dimensional, 100 dimensional, 200 dimensional and 300 dimensional.

MAX_ITER=15

WINDOW_SIZE=10

BINARY=0

NUM_THREADS=8

X_MAX=100

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Deutsche Forschungsgemeinschaft
SFB 833: Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Accuracy: Not specified.
Completeness: Not specified.
Conformity: Not specified.
Consistency: Not specified.
Credibility: Not specified.
Processability: Not specified.
Relevance: Not specified.
Timeliness: Not specified.
Understandability: Not specified.

	All versions	This version
Views	1	1
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Vector representations of German words and compounds

Other (English)

Files

Restricted

Additional details

Funding

Data quality

Vector representations of German words and compounds

Creators

Description

Other (English)

Files

Restricted

Additional details

Funding

Data quality