Published September 15, 2020 | Version v1
Dataset Open

Embeddings trained on CONLL2017 Corpora (conll2017-embeddings) - Part 1

  • 1. ROR icon University of Tübingen

Description

The embeddings were trained with finalfrontier on the CONLL2017 corpora with more than 100m tokens. For all languages embeddings, were trained with the skip- and structgram algorithms and contain subword ngrams. All embeddings are stored in the finalfusion format and can be used an processed with tools provided by the finalfusion ecosystem.

  • N-Gram range (inclusive): 3 - 6
  • Number of hashing buckets: 2^21
  • Hashing function: FNV-1a
  • Window size: 10
  • Negative Samples: 5
  • Dimensions: 300
  • Minimum Token Frequency: 30

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

CMDI_Part1.xml

Files (94.9 GB)

Name Size Download all
md5:723f6154a22664631bc1ed7ec86012ee
11.4 GB Preview Download
md5:c8017a9eaf60fe786429d9903c4890c5
12.6 GB Preview Download
md5:7f3c6a491afe8d98e1fabee988beb15c
32.1 kB Preview Download
md5:d6d851e829c1dcb5e2d3a805a9afd951
16.7 GB Preview Download
md5:da2ae4803fc557de9ac5e86eb440617b
15.2 GB Preview Download
md5:2fab0a5480b6641ab1406289dce5ceb9
13.4 GB Preview Download
md5:8e18e7a2e376043a86e839b322efdecc
5.2 GB Preview Download
md5:7dc2a89784b8c17074772726b299f811
12.8 GB Preview Download
md5:41322143d6f9a91ca27079f7a9ab1dc4
7.6 GB Preview Download

Additional details

Related works

Is part of
Collection: 10.57754/FDAT.n64dr-wre27 (DOI)

Funding

Deutsche Forschungsgemeinschaft
SFB 833:  Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Data quality

Accuracy

Not specified.

Completeness

Not specified.

Conformity

Not specified.

Consistency

Not specified.

Credibility

Not specified.

Processability

Not specified.

Relevance

Not specified.

Timeliness

Not specified.

Understandability

Not specified.