Vector representations of German words and compounds
Description
Word representations used in Dima(2015), Dima (2019). The vectors were generated from the decow14ax corpus (https://corporafromtheweb.org/), ~10 billion words of raw text. Corpus pre-processing: words lowercased, punctuation removed, each number was replaced by the string 'NUMBER'.
Embeddings trained using a minimum word frequency of 100, leading to a vocabulary 1,029,270 words. The vocabulary file 'decow14ax_all_min_100.vocab' contains these word representations and their frequency in the support corpus. 'decow14ax_full.vocab' contains the full vocabulary generated for the corpus (no cut-off).
The embeddings were trained with GloVe, for 15 iterations, using a 10-word symmetric window of text (20 words surrounding a particular word). The files are suffixed with the dimensionality of the vector representations: 50 dimensional, 100 dimensional, 200 dimensional and 300 dimensional.
MAX_ITER=15
WINDOW_SIZE=10
BINARY=0
NUM_THREADS=8
X_MAX=100
Other (English)
Research carried out in work package A03 of the SFB 833.
Additional details
- Accuracy
Not specified.
- Completeness
Not specified.
- Conformity
Not specified.
- Consistency
Not specified.
- Credibility
Not specified.
- Processability
Not specified.
- Relevance
Not specified.
- Timeliness
Not specified.
- Understandability
Not specified.