Published August 20, 2020
| Version v1
Dataset
Open
Tokenized OFAI Million Post Corpus
Description
This corpus is based on the Million Post Corpus created by the OFAI. It contains the tokenized comments and articles in plain text without association of comments to their articles. The text has been tokenized using the SoMaJo tokenizer.
Other (English)
Research carried out in work package A03 of the SFB 833.
Files
CMDI.xml
Files
(101.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:404fb848d812b079070b784ee391d8fa
|
15.8 kB | Preview Download |
|
md5:e22dcf042c6fc79902bb0d35fc56cbdc
|
101.7 MB | Download |