Vector representations of English words and compounds
Description
Word representations used in Dima (2019). The vectors were generated from the concatenated encow14ax (https://corporafromtheweb.org/) and English Wikipedia - Müller and Schutze (2015) version, ~9 billion words of text. The corpus was also pre-processed for compounds, i.e. the compounds from the en-comcom dataset were linked with an underscore and treated as a single word - e.g. 'police car' was rewritten to 'police_car'.
Embeddings trained using a minimum word frequency of 100, leading to a vocabulary 424,014 words. The vocabulary words and their frequency in the corpus can be found in the file 'glove_encow14ax_enwiki_9B.400k_min100.vocab'. Word representations with 4 different vector dimensionalities - 50 dimensional, 100 dimensional, 200 dimensional, 300 dimensional.
The embeddings were trained with GloVe, for 15 iterations, using a 10-word symmetric window of text (20 words surrounding a particular word).
MAX_ITER=15
WINDOW_SIZE=10
BINARY=0
NUM_THREADS=8
X_MAX=100
Other (English)
Research carried out in work package A03 of the SFB 833.