Published September 15, 2020
| Version v1
Dataset
Open
Embeddings trained on CONLL2017 Corpora (conll2017-embeddings) - Part 1
Description
The embeddings were trained with finalfrontier on the CONLL2017 corpora with more than 100m tokens. For all languages embeddings, were trained with the skip- and structgram algorithms and contain subword ngrams. All embeddings are stored in the finalfusion format and can be used an processed with tools provided by the finalfusion ecosystem.
- N-Gram range (inclusive): 3 - 6
- Number of hashing buckets: 2^21
- Hashing function: FNV-1a
- Window size: 10
- Negative Samples: 5
- Dimensions: 300
- Minimum Token Frequency: 30
Other (English)
Research carried out in work package A03 of the SFB 833.
Files
CMDI_Part1.xml
Files
(94.9 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:723f6154a22664631bc1ed7ec86012ee
|
11.4 GB | Preview Download |
|
md5:c8017a9eaf60fe786429d9903c4890c5
|
12.6 GB | Preview Download |
|
md5:7f3c6a491afe8d98e1fabee988beb15c
|
32.1 kB | Preview Download |
|
md5:d6d851e829c1dcb5e2d3a805a9afd951
|
16.7 GB | Preview Download |
|
md5:da2ae4803fc557de9ac5e86eb440617b
|
15.2 GB | Preview Download |
|
md5:2fab0a5480b6641ab1406289dce5ceb9
|
13.4 GB | Preview Download |
|
md5:8e18e7a2e376043a86e839b322efdecc
|
5.2 GB | Preview Download |
|
md5:7dc2a89784b8c17074772726b299f811
|
12.8 GB | Preview Download |
|
md5:41322143d6f9a91ca27079f7a9ab1dc4
|
7.6 GB | Preview Download |
Additional details
Related works
- Is part of
- Collection: 10.57754/FDAT.n64dr-wre27 (DOI)