Published August 20, 2020 | Version v1
Dataset Open

Tokenized OFAI Million Post Corpus

  • 1. ROR icon University of Tübingen

Description

This corpus is based on the Million Post Corpus created by the OFAI. It contains the tokenized comments and articles in plain text without association of comments to their articles. The text has been tokenized using the SoMaJo tokenizer.

Other (English)

Research carried out in work package A03 of the SFB 833.

Files

CMDI.xml

Files (101.8 MB)

Name Size Download all
md5:404fb848d812b079070b784ee391d8fa
15.8 kB Preview Download
md5:e22dcf042c6fc79902bb0d35fc56cbdc
101.7 MB Download

Additional details

Funding

Deutsche Forschungsgemeinschaft
SFB 833:  Bedeutungskonstitution - Dynamik und Adaptivität sprachlicher Strukturen 75650358

Data quality

Accuracy

Not specified.

Completeness

Not specified.

Conformity

Not specified.

Consistency

Not specified.

Credibility

Not specified.

Processability

Not specified.

Relevance

Not specified.

Timeliness

Not specified.

Understandability

Not specified.