Published September 15, 2020
| Version v1
Dataset
Open
Embeddings trained on CONLL2017 Corpora (conll2017-embeddings) - Part 2
Description
The embeddings were trained with finalfrontier on the CONLL2017 corpora with more than 100m tokens. For all languages embeddings, were trained with the skip- and structgram algorithms and contain subword ngrams. All embeddings are stored in the finalfusion format and can be used an processed with tools provided by the finalfusion ecosystem.
- N-Gram range (inclusive): 3 - 6
- Number of hashing buckets: 2^21
- Hashing function: FNV-1a
- Window size: 10
- Negative Samples: 5
- Dimensions: 300
- Minimum Token Frequency: 30
Other (English)
Research carried out in work package A03 of the SFB 833.
Files
CMDI_Part2.xml
Files
(63.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:041145d1bb911ff7b428e17815e4eafc
|
28.9 kB | Preview Download |
|
md5:b87313439a4cbe11bdd872eef336f1b7
|
10.7 GB | Preview Download |
|
md5:82f91dd495c3b06b5e9816f80837d630
|
13.4 GB | Preview Download |
|
md5:7fac23f9dc5ab9d8d55965f0355150b3
|
15.1 GB | Preview Download |
|
md5:0d656c127940cbc2ab2e48961bf25b1d
|
8.4 GB | Preview Download |
|
md5:b0cf845fda7999ca447ed56d0ab7fbbb
|
15.7 GB | Preview Download |
Additional details
Related works
- Is part of
- Collection: 10.57754/FDAT.n64dr-wre27 (DOI)