Published September 15, 2020
| Version v1
Collection
Open
Embeddings trained on CONLL2017 Corpora (conll2017-embeddings) - Collection
Description
This is the collection, bracketing three research data sets.
The embeddings were trained with finalfrontier on the CONLL2017 corpora with more than 100m tokens. For all languages embeddings, were trained with the skip- and structgram algorithms and contain subword ngrams. All embeddings are stored in the finalfusion format and can be used an processed with tools provided by the finalfusion ecosystem.
- N-Gram range (inclusive): 3 - 6
- Number of hashing buckets: 2^21
- Hashing function: FNV-1a
- Window size: 10
- Negative Samples: 5
- Dimensions: 300
- Minimum Token Frequency: 30
Other (English)
Research carried out in work package A03 of the SFB 833.
Files
CMDI.xml
Files
(13.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:6626910845b67d2e3c98265899cf5b36
|
13.7 kB | Preview Download |
Additional details
Related works
- Has part
- Dataset: 10.57754/FDAT.eh5fz-7ec28 (DOI)
- Dataset: 10.57754/FDAT.2gr88-44y24 (DOI)
- Dataset: 10.57754/FDAT.q21vw-0fp88 (DOI)