Vector representations of English words and compounds

Dima, Corina

doi:10.57754/FDAT.9jj4v-6dj64

Published March 14, 2017 | Version v1

Dataset Restricted

Vector representations of English words and compounds

Dima, Corina (Researcher)¹

1. University of Tübingen

Word representations used in Dima (2019). The vectors were generated from the concatenated encow14ax (https://corporafromtheweb.org/) and English Wikipedia - Müller and Schutze (2015) version, ~9 billion words of text. The corpus was also pre-processed for compounds, i.e. the compounds from the en-comcom dataset were linked with an underscore and treated as a single word - e.g. 'police car' was rewritten to 'police_car'.

Embeddings trained using a minimum word frequency of 100, leading to a vocabulary 424,014 words. The vocabulary words and their frequency in the corpus can be found in the file 'glove_encow14ax_enwiki_9B.400k_min100.vocab'. Word representations with 4 different vector dimensionalities - 50 dimensional, 100 dimensional, 200 dimensional, 300 dimensional.

The embeddings were trained with GloVe, for 15 iterations, using a 10-word symmetric window of text (20 words surrounding a particular word).

MAX_ITER=15

WINDOW_SIZE=10

BINARY=0

NUM_THREADS=8

X_MAX=100

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Deutsche Forschungsgemeinschaft
SFB 833: Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Accuracy: Not specified.
Completeness: Not specified.
Conformity: Not specified.
Consistency: Not specified.
Credibility: Not specified.
Processability: Not specified.
Relevance: Not specified.
Timeliness: Not specified.
Understandability: Not specified.

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Vector representations of English words and compounds

Other (English)

Files

Restricted

Additional details

Funding

Data quality

Vector representations of English words and compounds

Creators

Description

Other (English)

Files

Restricted

Additional details

Funding

Data quality