Humboldt-Universität zu Berlin - Sprach- und literaturwissenschaftliche Fakultät - Institut für Slawistik und Hungarologie

Corpus Linguistics

The project line "Spoken Corpus" uses the complex surroundings of the development and use of a speech corpus for didactic and scientific aims.

Incentive:

There are no serious corpora of spoken language for BCMS/Albanian available. In order to develop them, trained empirical, corpus and computational linguists are needed. Such training in the framework of "Digital Humanities" is also vital for the employability of graduates.

Aims:

  • Development of a joint curriculum or teaching plan (topics, handouts, training material, literature) for the whole workflow from data collection to analysis

  • Creation of a joint working environment for corpus development and use, including the use of tools for collaboration (e.g. SLACK, Moodle, ...)

  • Construction of a corpus of spoken language with all varieties of BCMS/Albanian, including bilinguals, heritage speakers and data from experiments (like CHILDES), equipped with a multi-layered linguistic and "social" annotation

  • Training of students and staff in all respective domains of action

  • Empowerment for fieldwork, also in politically sensitive regions

Domains of action:

A Tool Creation

B Corpus Design

C Corpus Creation

D Corpus Analysis

E Social Corpus Linguistics

Competencies

linguistic fieldwork, linguistic experimentation, transcription, corpus creation, sampling, annoation, tagging, scripting, statistics, socio-, ethno- and "simple" linguistic analysis, translation, project management, ... you name it

Formats

  • Workshops for qualification of staff in Berlin and partner countries

  • Individual staff training mobilities on topics A-E

  • Individual student mobilities with integration in project

Steering Committee

  • Bardh Rugova (Prishtina)
  • Branimir Stanković (Niš)
  • Ismail Palić (Sarajevo)
  • Ivana Vučina (Beograd)
  • Jelena Petković (Kragujevac)
  • Philipp Wasserscheidt (HU)

Projects

  • BosCO
    Construction of a corpus of spoken language with all varieties of BCMS in Bosnia & Herzegovina as well as Bosnakian varieties outside BiH, including bilinguals, heritage speakers and data from experiments (like CHILDES), equipped with a multi-layered linguistic and "social" annotation. In collaboration with the University of Sarajevo and the Institut za Jezik Sarajevo.
  • Voices of the city
    Corpus with urban varities of Central and Southern Serbian Cities. In collaboration with the universities of Niš and Kragujevac
  • Corpus of Narratives
    Corpus of spoken personal narratives. The material is taken from ethnological fieldwork in the Southern Banat. Annotations are made on several levels with the focus of narrative elements, the structure of personal narratives and constructions in the sense of construction grammar. In collaboration with the Balkanologic Institute of the Serbian Academie of Sciences and Arts.
    http://poincare.matf.bg.ac.rs/~andjelkaz/diwna/

  • Corpus of Serbian in Hungary
    Spoken and written corpus of the Serbian minority in Hungary. In collaboration with ELTE and the Serbian Institut, Budapest.