Humboldt-Universität zu Berlin - Sprach- und literaturwissenschaftliche Fakultät - Institut für Slawistik und Hungarologie

SPOC - Spoken Corpus

The project line "Spoken Corpus" uses the complex surroundings of the development and use of a speech corpus for didactic and scientific aims.


There are no serious corpora of spoken language for BCMS/Albanian available. In order to develop them, trained empirical, corpus and computational linguists are needed. Such training in the framework of "Digital Humanities" is also vital for the employability of graduates.


  • Development of a joint curriculum or teaching plan (topics, handouts, training material, literature) for the whole workflow from data collection to analysis

  • Creation of a joint working environment for corpus development and use, including the use of tools for collaboration (e.g. SLACK, Moodle, ...)

  • Construction of a corpus of spoken language with all varieties of BCMS/Albanian, including bilinguals, heritage speakers and data from experiments (like CHILDES), equipped with a multi-layered linguistic and "social" annotation

  • Training of students and staff in all respective domains of action

  • Empowerment for fieldwork, also in politically sensitive regions

Domains of action:

A Linguistic Fieldwork, Sampling

B Corpus Linguistics

C Computational Linguistics

D Language Theory

E Socio-, Ethnolinguistics


linguistic fieldwork, linguistic experimentation, transcription, corpus creation, sampling, annoation, tagging, scripting, statistics, socio-, ethno- and "simple" linguistic analysis, translation, project management, ... you name it


  • Workshops for qualification of staff in Berlin and partner countries

  • Individual teaching mobilities on topics A-E

  • Individual student mobilities with integration in project

  • Virtual Joint seminars with staff exchange

  • Student projects