Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: Latinx in AI

Towards automatic identification of self-reported COVID-19 tweets: introducing a multilingual manually annotated dataset, baseline systems, and exploratory evaluations

Ramya Tekumalla · Juan Banda · Luis Alberto Robles Hernandez


Abstract:

In recent times, social networks like Twitter have emerged as vital platforms for sharing personal thoughts, opinions, and most importantly, health-related information, especially pertaining to COVID-19. Users tend to share very detailed and personal narratives that could be utilized by researchers to capture true self-reported health data. While the data is easily accessible, the process to differentiate between health-related self-reports and informal discussion is quite tricky as it relies on either manual curation or the availability of large manually annotated datasets for machine learning models to be trained on. Manually annotating data is an immensely time-consuming task since, in general, the intervention of a subject matter expert is required, even more, in languages other than English, such as Spanish. In this work, we release two manually annotated datasets, one in English and one in Spanish, comprising of 36,548 tweets containing self-reported COVID-19 symptoms to aid machine learning models in extracting self-reported COVID-19 tweets. Using a very large set of experiments, we demonstrate how these datasets can be leveraged using classical and modern machine learning algorithms to identify unlabeled self-report tweets. Additionally, we perform a stratified analysis of how (and if) data augmentation and automatic translation could help train more generalizable models.

Chat is not available.