Published September 15, 2020
| Version v1
Dataset
Open
Embeddings trained on CONLL2017 Corpora (conll2017-embeddings) - Part 3
Description
The embeddings were trained with finalfrontier on the CONLL2017 corpora with more than 100m tokens. For all languages embeddings, were trained with the skip- and structgram algorithms and contain subword ngrams. All embeddings are stored in the finalfusion format and can be used an processed with tools provided by the finalfusion ecosystem.
- N-Gram range (inclusive): 3 - 6
- Number of hashing buckets: 2^21
- Hashing function: FNV-1a
- Window size: 10
- Negative Samples: 5
- Dimensions: 300
- Minimum Token Frequency: 30
Other (English)
Research carried out in work package A03 of the SFB 833.
Files
CMDI_Part3.xml
Files
(94.8 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:d3e2afc9b10b713b188b207a7ec46b58
|
32.3 kB | Preview Download |
|
md5:94956fe63496c55681c0b1aadae8805d
|
6.4 GB | Preview Download |
|
md5:d770e24d7ff460314cc8f377541654e1
|
10.8 GB | Preview Download |
|
md5:fe04617b5c0815d0f530acf0410ca940
|
13.4 GB | Preview Download |
|
md5:90300040756128cdd8658eecc3cf57bc
|
17.2 GB | Preview Download |
|
md5:f63f2c123b325f494a29c18426d63f3b
|
11.8 GB | Preview Download |
|
md5:93b920f41d97be2e6bcd7c16fc2ef864
|
16.5 GB | Preview Download |
|
md5:bdf2f6fd8cfb52d52308b70e439dda9f
|
12.2 GB | Preview Download |
|
md5:9138c3ac609298775eb0fcf131f42889
|
6.6 GB | Preview Download |
Additional details
Related works
- Is part of
- Collection: 10.57754/FDAT.n64dr-wre27 (DOI)