DLCL hosts hackathon for computational criticism of Russian literature

On February 13-14, 2020, the Computational Criticism of Russian Literature Research Unit in DLCL convened a hackathon with some of its participants from beyond Stanford, along with experts in applying natural-language processing methods to humanities research questions. Andrew Janco (Digital Scholarship Librarian, Haverford College), Hilah Kohen (literary translator and English-language news editor for the Russian news site Meduza), and Aaron Thompson (graduate student in Slavic Languages and Literatures at the University of Virginia) joined Yulia Ilchuk (professor in the DLCL’s Slavic department), Masha Gorshkova (graduate student in the DLCL’s Slavic department), and Quinn Dombrowski (DLCL's Academic Technology Specialist) for the event, which was centered on what models and data sets were available, or feasible to create, as a step towards developing a version of David Bamman’s BookNLP literary text annotation tool for Russian literature. Bamman, a professor at the I-School at UC Berkeley, joined the group for Friday’s hands-on work session.

The Computational Criticism of Russian Literature Research Unit (or “Russian NLP” for short) has been holding monthly virtual meetings since the fall. While attendance fluctuates depending on the scholars’ availability in a given month, group members have included scholars from McGill University, Harvard University, the University of Houston, Yale University, and the Higher School of Economics in Moscow. Over the course of the year, the group has surveyed available NLP tools and resources, written Jupyter notebooks to make those tools more accessible to less-technical scholars, and explored approaches to developing an NLP pipeline that would be able to perform the same kinds of annotation for Russian that BookNLP performs for English, including named-entity recognition, character name clustering, pronominal coreference resolution (to attribute the actions of “he” and “she” to specific characters), and quotation speaker identification. The in-person event at Stanford was designed to build upon the work done by the group virtually, leading to a clearer plan for how specifically to undertake developing a tool like BookNLP for another language.

On Thursday afternoon, the out-of-town visitors had an opportunity to share their current projects with Stanford’s DH community as part of the new monthly series of DH lightning talks at CESTA. Andrew Janco presented his work towards improving support for scholars to build domain-specific NLP models (i.e. that are specifically designed to work well for a particular kind of text, like 20th century Soviet realism or children’s fairy tales). This project also involves developing an interface for subject-area experts to give feedback on automated text annotations in order to improve the model's performance. Hilah Kohen gave an overview of some of the major online resources for understanding contemporary Russian culture, describing the value of capturing these corpora for current and future use. Aaron Thompson spoke about the potential of using regular expressions (a kind of complex syntax for searching text) for working with highly-inflected languages like Russian.

The all-day Friday work session started with a brainstorm of the specific kinds of features that Russian scholars would like to see in an NLP pipeline. While there was some overlap with the functionality of BookNLP for English, some of the ideas were specific to Russian (e.g. identifying Slavonicisms, recent loanwords, or dialect forms), or would inevitably draw upon features of Russian morphology (e.g. the system of Russian name forms, and the mostly-productive rules for generating different diminutive forms). Other ideas included annotating passive constructions (which have been increasingly adopted by non-binary people to avoid grammatically-gendered verbal constructions), annotating the use of pronouns indicating politeness, annotating metaphorical language, annotating the passage of narrative time, annotating ethnicity specifically and character descriptions in general, and effectively lemmatizing text written using pre-Revolutionary orthography.

While English is the focus of an overwhelming majority of modern NLP research and development effort, Russian is comparatively well-resourced, especially when compared to most other Slavic languages. Multiple treebanks — texts that have already ben thoroughly annotated for their grammatical structure — are available for Russian. 

There are multiple lemmatizes — which take word forms found in text and convert them to their dictionary form, which is particularly important for a  language like Russian where words can occur in 15+ different forms depending on context. There are also tools for sentiment analysis (to evaluate how “positive” or “negative” a text is), and a corpus annotated for pronominal coreference (to connect pronouns with their associated names or nouns). In addition, there are many relevant word lists published in PDFs or as text on websites (e.g. lists of recent borrowings from English, or lists of diminutive name forms) that could be incorporated into an NLP pipeline, either simply as a list of words to search for in the text, or as a set of rules that could could also identify related forms (e.g. diminutives from foreign names or words). To see what, concretely, would be involved with converting one of these lists into a format that could be used in a pipeline, during the hackathon some of the participants worked on scraping, cleaning, and reformatting a list of Russian name diminutives, while other participants began work on gathering a corpus of texts that could be used as the basis for Russian genre classification using machine learning — showing an algorithm examples of texts from different genres, and then asking it to predict how a new text should be classified.

By the end of a day of discussion and hands-on work, the group reached the conclusion that — while they would be able to use existing tools like MyStem for part-of-speech tagging — the most effective way to get the data they need to implement some of the features that hold the most promise for addressing interesting disciplinary questions will be to annotate texts manually, and use those annotations as training data. David Bamman’s experience building the LitBank annotated literary data set was especially useful for helping the group imagine how to go about doing this, and enlisting colleagues to contribute to the text annotation work.

Between now and the end of the academic year, the Russian NLP Research Unit hopes to follow up on the hackathon by working towards a hand-annotated corpus that can be used as the basis for training a model for replicating the same kind of annotations on new texts. While replicating BookNLP for Russian with expanded functionality of interest to Russian literary scholars may not be feasible by the end of this year, the work that the Russian NLP Research Unit has done over the past year will be published online for scholars to consult and build upon.