Poster
in
Workshop: Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI
Evaluations Using Wikipedia without Data Leakage: From Trusting Articles to Trusting Edit Processes
Lucie-Aimée Kaffee · Isaac Johnson
Keywords: [ Data Leakage ] [ Wikipedia ] [ Evaluation ]
In the evolving landscape of artificial intelligence and machine learning, Wikipedia has emerged as a pivotal resource for factual information. As one of the most comprehensive and accessible repositories of human knowledge, it serves as a critical reference point in many contexts, including the evaluation of machine learning models. However, the integration of Wikipedia into the training data for numerous AI models introduces a unique challenge: the potential circularity in using Wikipedia as both a training source and an evaluation benchmark. This provocations paper explores the implications of this dual role, questioning the reliability and objectivity of evaluations that rely on a dataset inherently intertwined with the models being assessed. By critically examining the use of Wikipedia in factual evaluations, this paper aims to provoke discussion on the validity of current evaluation methodologies and the necessity of developing more robust, diverse, and independent benchmarks. Specifically, we recommend revising benchmarks that capture a static snapshot of Wikipedia content to instead adopt a model based on the continuous work of Wikipedia editors to build a trustworthy encyclopaedia.