Poster
in
Affinity Workshop: Women in Machine Learning
Adversarial Analysis of Fake News Detectors
Annat Koren · Hunter Ireland · Sandra Luo · Eryn Jagelski-Buchler · Edoardo Serra · Francesca Spezzano
In recent years, machine learning models have been developed to mitigate the problem of fake news. dEFEND[2], a state-of-the-art natural language processing (NLP) model, uses news contents, comments, and the relation between the two to detect fake news. We aim to expose vulnerabilities in the model so that it can be strengthened against attempts to use manipulated data to mislead it.Attacks on fake news detection models are a growing concern and active area of research. One product of this is MALCOM[1], a GAN-based malicious comment generator that reportedly forces fake or real classifications with success rates upwards of 93%. MALCOM generates stylistically similar and topic-relevant comments to the input text, alleviating common problems with attacks on NLP models (e.g., producing nonsensical examples). However, it is possible to detect that these comments are computer-generated. We instead use real comments from the same dataset, so that they are indistinguishable from the rest. This approach aims to match Le et al.’s[1] results in a less complex and computationally expensive way.Using the FakeNewsNet dataset[3], we develop an attack by grouping articles and their preexisting comments into topics, then computing their similarity, or “distance,” from each other. Using this, we identify both generic and topic-specific comments that can sway dEFEND’s classification of an article. For comparison, we implement CopyCat, a baseline attack used by Le et al.[1] that “randomly retrieves a comment from a relevant article in the train set which has the target label.” Preliminary results show that our novel attack techniques outperform our implementation of CopyCat in most cases, as measured by attack success rate, i.e., the percentage of time an attack fools dEFEND into misclassifying an article. An ongoing area of research is creating a defense to mitigate these attacks, e.g., by filtering comments post-training, based on properties identified as being adversarial.