Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: Women in Machine Learning

Computational models of Language Variation in Literary Narratives

Krishnapriya Vishnubhotla


Abstract:

The availability of massive text datasets is promising for understanding linguistic variation on a large scale. Literary authors often have distinctive writing styles; this style changes over time, with the genre of the text, and even through the course of a single novel.This work aims to develop techniques that can model stylistic variation in character voices within literary texts. A non-trivial problem here is that of reliably attributing quotations within a novel to the characters that utter them, a task called quotation attribution. It is particularly challenging in literary texts because of the large amount of variation in narrative style and structure, and the lack of annotated datasets in this domain to train models. We have currently annotated a set of 25 full-length English language novels for various aspects of quotation and coreference within them, and this is by an order of magnitude the largest such dataset for literary texts. A preliminary stylometric classification model achieves an average accuracy of 0.60 on this dataset. We are currently working on improving this model with contextual features obtained using PLLMs that are fine-tuned for character identification and quotation attribution in a semi-supervised setup.The resulting quotation attribution model, when applied to a large-scale corpus of literary novels, can be used to analyse several questions of interest regarding the choices authors make when writing their characters, and how this varies based on the demographic characteristics of the author, across different decades, and across genres. Do female authors write female characters with more or less stylistic distinctiveness compared with male authors? Do certain authors write more “balanced” characters across the board? How does this change across decades and centuries? Our work has the potential to answer these questions in a data-driven manner, and shed a light on various biases, implicit and explicit, that exist in the literary canon.

Chat is not available.