Published March 14, 2017 | Version v1
Dataset Restricted

Vector representations of English words and compounds

  • 1. ROR icon University of Tübingen

Description

Word representations used in Dima (2019). The vectors were generated from the concatenated encow14ax (https://corporafromtheweb.org/) and English Wikipedia - Müller and Schutze (2015) version, ~9 billion words of text. The corpus was also pre-processed for compounds, i.e. the compounds from the en-comcom dataset were linked with an underscore and treated as a single word - e.g. 'police car' was rewritten to 'police_car'.

Embeddings trained using a minimum word frequency of 100, leading to a vocabulary 424,014 words. The vocabulary words and their frequency in the corpus can be found in the file 'glove_encow14ax_enwiki_9B.400k_min100.vocab'. Word representations with 4 different vector dimensionalities - 50 dimensional, 100 dimensional, 200 dimensional, 300 dimensional.

The embeddings were trained with GloVe, for 15 iterations, using a 10-word symmetric window of text (20 words surrounding a particular word).

                 

MAX_ITER=15

WINDOW_SIZE=10

BINARY=0

NUM_THREADS=8

X_MAX=100

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Funding

Deutsche Forschungsgemeinschaft
SFB 833:  Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Data quality

Accuracy

Not specified.

Completeness

Not specified.

Conformity

Not specified.

Consistency

Not specified.

Credibility

Not specified.

Processability

Not specified.

Relevance

Not specified.

Timeliness

Not specified.

Understandability

Not specified.