KiDKo: Main corpus and complementary corpus

The KiezDeutsch-Korpus (KiDKo) is a multi-modal digital corpus of spontaneous discourse data from informal, oral peer group situations in multi- and monoethnic speech communities. 
It has been developed by project B6 (PI: Heike Wiese) of the collaborative research centre Information Structure (SFB 632) at the University of Potsdam from 2008 to 2015.

Wiese, Heike; Rehbein, Ines; Schalowski, Sören; Freywald, Ulrike, & Mayr, Katharina (2010ff). KiDKo - A corpus of spontaneous conversations among adolescents in multiethnic and monoethnic urban Germany. Potsdam: University of Potsdam.

Data collection

Spontaneous speech data of young people, from self-recordings: informal conversations between friends, mostly in German.

Speakers

9th grade students who were between 14 and 17 years old at the time of recording; initial contact via two schools: one in Berlin-Kreuzberg and one in Berlin-Hellersdorf with, respectively, 84.4% and 4.8% of students having a "non-German background language" (i.e., on a questionnaire issued by the Berlin school administration parents indicated that the main language spoken at home is not German) (see also Wiese et al. 2012).

You can find detailled information about the anchor speakers here.

Here is a table giving the figures for individual speakers' shares of the corpus.

Size

Main corpus: ~ 228,000 tokens;
  17 anchor speakers (10 male, 7 female)
Complementary corpus: ~ 105,000 tokens; 
  6 anchor speakers (5 male, 1 female)

Corpus features

(cf. Rehbein, Schalowski & Wiese 2014)

The corpus consists of audio recordings with aligned, anonymised transcriptions. The corpus contains part-of-speech (POS) information (Rehbein & Schalowski 2014) and provides an additional orthographic normalisation layer as well as the translation of Turkish code switching. Another annotation level provides information on syntactic chunks and topological fields.

The transcription of the data was carried out in EXMARaLDA (Extensible Markup Language for Discourse Annotation) (Schmidt & Wörner 2005). Transcription conventions are based on a modified form of 'GAT basic' (Selting et al. 1998) (i.e. mostly orthographical transcription while marking certain prosodic features, such as upper case for stress, specific characters for pauses and lengthening, and parenthesis for non-verbal material).

Each transcript contains meta-information on socio-demographic features and the linguistic background of the speakers (for all anchor speakers: sex, residential area, family language).

Corpus access

The corpus is available online via ANNIS. ANNIS is an open source platform that allows browser-based searches of linguistically annotated corpora. 

If you want to access KiDKo using ANNIS, please fill in the license request form. Your login details will be sent to you via email.

For legal reasons, we are not allowed to make the audio files accessible online. Instead, we have set up a local workstation in Potsdam where you can access the audio data. If you are interested to do so, please contact us and arrange an appointment (heike.wiese at uni-potsdam.de).

Information on KiDKo and ANNIS

Here you can find a general overview and introduction, with some examples on KiDKo searches.

Here you can find information on the transcription and normalisation in KiDKo.

STTS Guidelines (Stuttgart-Tübingen Tagset)

Overview of the STTS POS tagset

Extended POS tagset used for the annotation of parts-of-speech in KiDKo

Quickstart - working with ANNIS and KiDKo

ANNIS User Guide

References

Rehbein, I., Schalowski, S., and Wiese, H. (2014). The KiezDeutsch Korpus (KiDKo) Release 1.0.
In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC),
May 24-31, 2014. Reykjavik, Iceland.

Rehbein, I., and Schalowski, S. (2013). STTS goes Kiez ‐ Experiments on Annotating and Tagging Urban Youth Language. Journal for Language Technology and Computational Linguistics 28: 199-227 (Themenheft "Das STTS-Tagset für Wortartentagging - Stand und Perspektiven").

Selting, Margret; Auer, Peter; Barden, Birgit, Bergmann, Jörg; Couper-Kuhlen, Elizabeth; Günthner, Susanne; Meier, Christoph; Quasthoff, Uta; Schlobinski, Peter; Uhmann, Susanne (1998). Gesprächsanalytisches Transkriptionssystem (GAT). Linguistische Berichte 173: 91-122.

Wiese, Heike; Freywald, Ulrike; Schalowski, Sören, & Mayr, Katharina (2012). Das KiezDeutsch- Korpus. Spontansprachliche Daten Jugendlicher aus urbanen Wohngebieten. Deutsche Sprache 40:97-123.

Zeldes, A., Ritz, J., Lüdeling, A., and Chiarcos, C. (2009). Annis: A search tool for multi-layer annotated corpora.In Proceedings of Corpus Linguistics, July 20-23, 2009. Liverpool, UK.