Humboldt-Universität zu Berlin - Sprach- und literaturwissenschaftliche Fakultät - Institut für Slawistik und Hungarologie

RRuDi – a Russian Diachronic Online Corpus

Introduction

RRuDi is a collection of texts from Russian language history, made accessible online for linguistic research. The corpus has been experimentally annotated morphosyntactically using the "best bet" of three taggers: (i) an Old Church Slavonic tagger being developed in Regensburg (viz. for illustration), (ii) an Old Russian guesser being developed in Regensburg, and (iii) a modern Russian tagger (TreeTagger with Serge Sharoff's parameter files - viz.). Larger parts of the texts have been corrected by hand and annotated semi-manually with syntactic information relevant to the DFG project "Corpus linguistics and diachronic syntax: Grammaticalization of non-canonical subjects in Slavonic languages": Subtypes of null subjects, subtypes of reflexive verb forms, passives and -no/-to forms (or their relatives).

Technology

The purpose of devising RRuDi within our current project was to do research into the diachronic development of various subtypes of null subjects, quirky subjects and reflexive constructions in Russian. This actually requires quite deep annotation, as it depends on formal, but also semantic and contextual (coreferentiality) information. We planned to combine the mature automatic tools available for modern Russian with a good deal of manual annotation, at least for larger excerpts, as has been common in diachronic corpus linguistics (cf. the Helsinki Corpus). In our experience, the annotation process should be maximally flexible as to the forging of "shortcuts", such as (regular expression) rules and easy replacements over annotations; but, at the same time, it should be restrictive as to the available feature values and manual input routines, in order to avoid typing errors and the like. In the ideal case, external automatic annotators (e.g, taggers) can be easily applied at any point without breaking the current annotation. A tool which combines these properties in a very convenient way, is GATE. (Another one would be UIMA.)

We used GATE for the whole annotation process. The taggers were integrated as external "Generic Taggers", patched up with many JAPE rules for postprocessing. 

GATE uses a standoff-XML format in one large file per text. In a way, this is already close to the input format for Annis-2, PAULA. But some conversion is necessary. A tool which came in handy was the Exporter from GATE devised by the American National Corpus. Its output consists in XML elements for the GATE annotation types, together with the respective spans on the token baseline, and annotation feature values. From this, we convert further into the EXMARaLDA format with the help of a python program, ordering the information in tiers and relabeling annotation features according to a configurable specification. EXMARaLDA XML can be processed by the SaltNPepper convertor into the respective Annis-2 database tables automatically.    

Annis-2 is the database and web interface of choice for this type of corpus. Its main purpose is to visualize and make queryable "complex multilevel linguistic corpora with diverse types of annotation".

Why so complicated?

First of all, because the desired annotation itself is complicated. It applies to overlapping, but non-identical units at several levels. It represents, within the corpus and visibly for everybody, the kind of linguistic categorisations that used to be hidden in old-fashioned card files, but also in personal electronic databases. Secondly, because the annotation process is difficult, depending on a flexible combination of automatic, semi-automatic, and fully manual steps. And finally, because the procedure described will pay off in the long run. It is strong enough to allow for future extensions, notably into real syntactic annotation. 

RRuDi's corpus composition and all technical aspects of the project are discussed in more detail in Roland Meyer's 2011 habilitation thesis (ch. 2), available upon request from the address below.

Access

At present, access is provided to some texts (login rrudi/rrudi). After moving to a new version of the server, the corpus has to be imported anew, which takes a couple of days. If you would like to use RRuDi (for research purposes only), please print out and sign the license agreement, and return it to us by fax or scan it in and send it by email. You will then receive credentials to access the full corpus. 

Comments

or requests: Please turn to roland <dot> meyer <at> hu-berlin <dot> de