Poster
in
Workshop: Generative AI and Creativity: A dialogue between machine learning researchers and creative professionals
FicSim: An Ethically Constructed Dataset for Long-Context Semantic Similarity Comparison within Fiction
Natasha Johnson · Amanda Bertsch · Emma Strubell
As language models continue to advance in their ability to process long and complex texts, there has been growing interest in their application within computational literary studies (CLS). With the increasing development of CLS tools, many researchers have turned to public domain eBook collections, such as Project Gutenberg, to test their models. However, issues of large-scale web scraping and model contamination challenge the reliability of such evaluations and call for novel methods of data collection. In response to this, we assemble a dataset of literature that has been excluded from large-scale scraping, alongside similarity scores in a variety of literary categories. This dataset, FicSim, can be used to evaluate models for long-context semantic textual similarity comparison within fiction. Throughout our data-collection process, we prioritize author agency and rely on continual, informed author consent. We thus demonstrate how high-quality literary datasets can be constructed without undermining authors’ rights to their work.