NoSta-D -- Korpus von Nicht-Standardvarietäten des Deutschen
Creators
Description
Corpus of different varieties of German. The subcorpora are subsets of other corpora, specified in parentheses: 1.) historical data (Anselm Corpus), chat data (Dortmund Chat Corpus), learner data (Falko), spoken data (BeMaTaC), literary prose (Kafka); 2.) newspaper texts (TüBa-D/Z). The subcorpora chat, spoken data, prose, and newspaper consist of approximately 5,000 tokens each, historical data of 1,000 tokens, and learner data of 2,900 tokens.
Each subcorpus is annotated with the following information: token and sentence boundaries; normalization; POS tags and dependency relations; named entities; coreference.
Other (German)
Legal ownership:
Seminar für Sprachwissenschaft, Universität Tübingen (Tüba-D/Z);
Public domain (Kafka);
Simone Schultz-Balluff & Klaus-Peter Wegera (Anselm Corpus);
Michael Beißwenger & Angelika Storrer (Dortmund Chat Corpus);
Institut für deutsche Sprache und Linguistik, HU Berlin (Falko; BaMaTaC); and
Anke Lüdeling, Stefanie Dipper, Marc Reznicek, Burkhard Dietterle (Annotations).
Files
DipperEtAltoappearNOSDAC.pdf
Additional details
Additional titles
- Alternative title (English)
- NoSta-D -- A corpus of non-standard varieties of German