RuCor
Russian coreference corpus

Links

Corpus description

RuCor is the first open corpus of Russian language where anaphorical and coreferential relations between noun groups are annotated. The current version of RuCor contains 156636 tokens. Apart from the annotation of coreferential and anaphorical relations morphological annotation is also provided.

The elaboration of RuCor started in 2013 as a part of the project RU­EVAL­2014, campaign evaluating the quality of Russian NLP tools to resolve anaphora and extract coreference chains.

RuCor includes prosaic texts of different length and genres: news, science, fiction, blogs.

This resource is aimed at theoretical linguists working in the field of anaphora and coreference as well as at NLP systems’ developers and at all those who are fascinated by Russian syntax and discourse.

All materials are open and available for download. If you quote examples retrieved from RuCor, please, cite RuCor as the source as well as the author of the text in question and the name of the text.

The Web ­interface was designed by Dmitrij Gorshkov. The tool uses MySQL database engine for corpus management.

Corpus users

Our target audience are specialists in theoretical and applied linguistics, students and lecturers in linguistics.

RuCor can be used for a variety of purposes in theoretical research: primarily, for narrow-oriented studies of anaphora and coreference, but also for more global studies of syntax and discourse structure, typology of anaphora, cognitive aspects of reference and referential choice.

Texts taken from this corpus can serve lecturers and students as data during seminars and lectures dedicated to corpus technologies in applied linguistics and dedicated to anaphora and coreference in discourse. Psycholinguists might be tempted by the possibility to determine factors influencing referential choice in different types of texts.

NLP developers can use texts of RuCor as a training set for machine learning algorithms of anaphora and coreference resolution or as a golden standars to evaluate the success of their software.

General statistics

number of texts 181
number of tokens 156637
number of coreference chains 3638
number of selected noun groups 16558

Distribution of text genres

<
news45%
essays21%
fiction18%
science9%
blogs, comments5%
Russian Wikipedia2%

Team

Project coordinator:
Svetlana Toldova (National Research University Higher School of Economics)
Software:
Dmitry Gorshkov
List of tags and instruction for annotators:
  • Alina Ladygina (University of Tübingen)
  • Ilya Azerkovitch (National Research University Higher School of Economics)
  • Julia Grishina (university of Potsdam)
  • Maria Vasilyeva (Lomonosov Moscow State University)
  • Asia Rojtberg (National Research University Higher School of Economics)
  • Matvej Kurzukov (Lomonosov Moscow State University)
  • Aleksander Kostiuk (Lomonosov Moscow State University)
  • Max Ionov (Lomonosov Moscow State University / Goethe University Frankfurt)
  • Galina Sim (Institute of linguistics of the Russian academy of science)
Text selection and annotation:
all metioned above and
  • Anastasia Ionova (Lomonosov Moscow State University / Leiden University)
  • Aleksander Pecheny (Institute of Russian language of the Russian academy of scienc)
  • Viktoria Danilova (University of Gamburg)
...and many others to whom we are very grateful for their fruitful impact

Citing RuCor

If you use RuCor in your research, please cite this paper (bib):
Toldova S.Ju., Roytberg A., Nedoluzhko А., Kurzukov M., Ladygina A., Vasilyeva M., Azerkovich I., Grishina Y., Sim G., Ivanova A., Gorshkov D. Evaluating Anaphora and Coreference Resolution for Russian // Komp'juternaja lingvistika i intellektual'nye tehnologii. Po materialam ezhegodnoj Mezhdunarodnoj konferencii «Dialog» (Bekasovo, june 4—8, 2014). Vypusk 13(20). M. : Izd-vo RGGU, 2014. P. 681-695.

Acknowledgements

RuCor development is supported by grant 15-07-09306 А "Evaluation benchmark for information retrieval"