RuCor is the first open corpus of Russian language where anaphorical and coreferential relations between noun groups are annotated. The current version of RuCor contains 156636 tokens. Apart from the annotation of coreferential and anaphorical relations morphological annotation is also provided.
The elaboration of RuCor started in 2013 as a part of the project RUEVAL2014, campaign evaluating the quality of Russian NLP tools to resolve anaphora and extract coreference chains.
RuCor includes prosaic texts of different length and genres: news, science, fiction, blogs.
This resource is aimed at theoretical linguists working in the field of anaphora and coreference as well as at NLP systems’ developers and at all those who are fascinated by Russian syntax and discourse.
All materials are open and available for download. If you quote examples retrieved from RuCor, please, cite RuCor as the source as well as the author of the text in question and the name of the text.
The Web interface was designed by Dmitrij Gorshkov. The tool uses MySQL database engine for corpus management.
Our target audience are specialists in theoretical and applied linguistics, students and lecturers in linguistics.
RuCor can be used for a variety of purposes in theoretical research: primarily, for narrow-oriented studies of anaphora and coreference, but also for more global studies of syntax and discourse structure, typology of anaphora, cognitive aspects of reference and referential choice.
Texts taken from this corpus can serve lecturers and students as data during seminars and lectures dedicated to corpus technologies in applied linguistics and dedicated to anaphora and coreference in discourse. Psycholinguists might be tempted by the possibility to determine factors influencing referential choice in different types of texts.
NLP developers can use texts of RuCor as a training set for machine learning algorithms of anaphora and coreference resolution or as a golden standars to evaluate the success of their software.
|number of texts||181|
|number of tokens||156637|
|number of coreference chains||3638|
|number of selected noun groups||16558|
Toldova S.Ju., Roytberg A., Nedoluzhko А., Kurzukov M., Ladygina A., Vasilyeva M., Azerkovich I., Grishina Y., Sim G., Ivanova A., Gorshkov D. Evaluating Anaphora and Coreference Resolution for Russian // Komp'juternaja lingvistika i intellektual'nye tehnologii. Po materialam ezhegodnoj Mezhdunarodnoj konferencii «Dialog» (Bekasovo, june 4—8, 2014). Vypusk 13(20). M. : Izd-vo RGGU, 2014. P. 681-695.