Hannah Rose Kirk
Hannah is a 3rd year DPhil student in Social Data Science at the OII. Hannah's research centres on the role of granular and diverse human feedback for aligning large language models.
New research from the Oxford Internet Institute (OII) exposes critical problems in how emoji-based online hate is tackled, with OII researchers uncovering critical weaknesses in how Artificial Intelligence (AI) systems detect hate speech involving emoji. Lead author of the study, OII DPhil researcher Hannah Rose Kirk explains more.
Social media platforms have opened up unprecedented channels of communication. While this greater communication brings about many benefits, it has also allowed for an expansion in the scope and virality of online harms. The volume of hate shared online far exceeds what human moderators can feasibly deal with, and AI content moderation systems have the potential to relieve the burden placed on these moderators. However, humans are creative in the way they express hate and the diversification in modalities of hate (such as the use of emoji) outpace what AI systems can understand.
Yet AI systems only learn complex societal phenomena and linguistic patterns from repeated exposure in their training data and it is only by exposing, then addressing their weaknesses, can we hope to make them stronger. While humans can translate the visual meaning of emoji, AI language models process emoji as characters, not pictures. These language models have a comprehensive grasp on textual constructions, yet there is still a need for the models to be taught what emoji mean in various contexts, and how different emoji condition the likelihood of hatefulness in a given tweet, post or comment.
Our research paper, ‘Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate’, by Oxford researchers Hannah Rose Kirk, Dr Bertie Vidgen, Paul Rottger and Associate Professor Dr Scott A. Hale, marks the first study to expose systematic weaknesses to emoji-based hate, using a suite of 3,930 challenging test cases.
Evaluating performance on these tests shows existing machine learning systems for content moderation, such as the widely used Perspective API from Google Jigsaw, make key errors when detecting online hate which contains emoji. They struggle to detect simple forms of hate using emoji, and performance drops considerably when a word is replaced with its emoji equivalent.
These findings show there is a risk that users could use emoji to bypass moderation filters and target people with hate. Without better content moderation systems, we could see a repeat of the racist online abuse targeted at professional athletes in the past few weeks.
Using an innovative, iterative human-and-model-in-the-loop approach to training AI, we also show that it is possible to fix this problem by training models on diverse and adversarial emoji-based hate. Retraining a model on just 2,000 examples resulted in large improvements, increasing both the model’s ability to identify a wider range of emoji-based hate (what is technically called ‘recall’) and to minimise the amount of non-hateful content that it incorrectly flags (called ‘precision’). The research shows the importance of constantly keeping hate detection systems up-to-date as online hate itself evolves.
Looking ahead, it is almost impossible to fully understand how big tech moderates online hate because the companies do not make their models available for scrutiny, even to academic researchers. However, the amount of racist abuse following the Euro 2020 final clearly shows there are problems with emoji-based hate. Keeping pace with new forms of hate is a cat-and-mouse-game. Detecting emoji-based hate is not a fundamentally difficult task, and the failure of commercial AI systems to detect such hate is deeply troubling.
We have posted a preprint of our unpublished work on the University of Oxford preprint server, and released our two datasets on https://github.com/HannahKirk/Hatemoji.
We hope sharing our work with the wider research community encourages better detection of emoji-based hate moving forward. The study is not yet peer reviewed.
Download the full study, ‘Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based hate’, by researchers Hannah Rose Kirk, Dr Bertie Vidgen, Paul Rottger and Associate Professor Dr Scott A. Hale.