Published February 25, 2014 | Version v1
Dataset Open

NoSta-D -- Korpus von Nicht-Standardvarietäten des Deutschen

Contributors

Contact person:

  • 1. ROR icon Humboldt-Universität zu Berlin

Description

Corpus of different varieties of German. The subcorpora are subsets of other corpora, specified in parentheses: 1.) historical data (Anselm Corpus), chat data (Dortmund Chat Corpus), learner data (Falko), spoken data (BeMaTaC), literary prose (Kafka); 2.) newspaper texts (TüBa-D/Z). The subcorpora chat, spoken data, prose, and newspaper consist of approximately 5,000 tokens each, historical data of 1,000 tokens, and learner data of 2,900 tokens.

Each subcorpus is annotated with the following information: token and sentence boundaries; normalization; POS tags and dependency relations;  named entities; coreference. 

Other (German)

Legal ownership:

Seminar für Sprachwissenschaft, Universität Tübingen (Tüba-D/Z);

Public domain (Kafka);

Simone Schultz-Balluff & Klaus-Peter Wegera (Anselm Corpus);

Michael Beißwenger & Angelika Storrer (Dortmund Chat Corpus);

Institut für deutsche Sprache und Linguistik, HU Berlin (Falko; BaMaTaC); and

Anke Lüdeling, Stefanie Dipper, Marc Reznicek, Burkhard Dietterle (Annotations).

Files

DipperEtAltoappearNOSDAC.pdf

Files (19.0 MB)

Name Size Download all
md5:819283cbca1f3b8cfc8e038458a876cd
19.6 kB Preview Download
md5:7fcadcb833efff6999dee4afcce53324
114.0 kB Preview Download
md5:462c83bce9c94b8a1f9c0dcb511683da
18.9 MB Download

Additional details

Additional titles

Alternative title (English)
NoSta-D -- A corpus of non-standard varieties of German

Data quality

Accuracy

Not specified.

Completeness

Not specified.

Conformity

Not specified.

Consistency

Not specified.

Credibility

Not specified.

Processability

Not specified.

Relevance

Not specified.

Timeliness

Not specified.

Understandability

Not specified.