Yet Another RussNet
The Yet Another RussNet (YARN) project was initiated in 2013 and aims at creating a large open WordNet-like machine-readable thesaurus for the Russian language through crowdsourcing. The project objectives include creating the thesaurus, developing free and libré open source software to operate with it, designing the necessary data schemes and models, writing technical and user documentation. The thesaurus is available under the CC BY-SA license.
YARN is conceptually similar to Princeton WordNet and its followers: it consists of synsets—groups of quasi-synonyms corresponding to a concept. Concepts are linked to each other, primarily via hierarchical hyponymic/hypernymic relationships. According to the project's outline, YARN contains nouns, verbs, and adjectives. We aim at splitting the process of thesaurus creation into smaller tasks and developing custom interfaces for each of them. The first step is an online tool for building noun synsets based on content of existing dictionaries. The goal of this stage is to establish YARN core content, test and validate crowdsourcing approach, prepare annotated data for automatic methods, and create a basis for the work with the other parts of speech.
As mentioned above, important characteristics of the project are its openness and recruitment of volunteers. Our crowdsourcing approach is different, for example, from the one described in, where AMT turkers form synsets using the criterion of contextual substitutability directly. In our case, editors assemble synsets using word lists and definitions from dictionaries as “raw material”. Obviously, such a task implies minimal lexicographical skills and is more complicated than an average task offered to AMT workers. Our target editors are college or university students, preferably from linguistics departments, who are native Russian speakers. It is desirable that students are instructed by a university teacher and may seek their advice in complex cases. As in the case of Wikipedia and Wiktionary, we foresee two levels of contributors: line editors and administrators with the corresponding privileges. According to our expectations, the total number of line editors can reach two hundreds throughout a year.
We use both XML and CSV formats to represent our import and export data. Please consult the format specification on NLPub, which is available in English.
The list of our publications is available.
This work is supported by the Russian Foundation for the Humanities, projects no. 13-04-12020 “New Open Electronic Thesaurus for Russian” and no. 16-04-12019 “RussNet and YARN thesauri integration”, by the Russian Foundation for Basic Research, project no. 16-37-00354 мол_а “Adaptive Crowdsourcing Methods for Linguistic Resources”, and by the Mikhail Prokhorov Foundation.