Hannah Rose Kirk
Hannah is a 3rd year DPhil student in Social Data Science at the OII. Hannah's research centres on the role of granular and diverse human feedback for aligning large language models.
Content Warning: This article contains some examples of emoji-based abuse.
England’s final game in the Euros on Sunday night was an emotional occasion. The dark side of disappointment fuelled racist abuse targeted at England players Bukayo Saka, Marcus Rashford and Jadon Sancho. While Twitter and Instagram’s algorithms claim to detect hateful and abusive content, there appears to be a large emoji-shaped hole in their defences. Despite users repeatedly flagging racist use of the monkey ( ), banana ( ) and watermelon emoji ( ), abuse expressed in emoji form was not flagged by Instagram or Twitter’s content moderation algorithms.
Emoji are everywhere: one report suggests over 95% of internet users have used an emoji and over 10 billion are sent every day. With new emoji added to the Unicode standard every year, alongside the skin tone modifiers introduced in 2015, what can be said with emoji is growing. Neuroscience research has suggested humans process emoji in the same way as human faces, and adding an emoji to text increases the emotive impact and valency of the message. Given the universality of emoji and how our brains process them, the harm from abuse conveyed in emoji is certainly at least on par with its textual equivalent. So why are content moderation algorithms failing to detect emojified abuse?
Artificial intelligence can help with abuse that is beyond the scale which human moderators can feasibly deal with. In recent years, large-scale language models have been trained and released by technology giants like Google, Microsoft and Facebook. By learning from millions or billions of examples of online content, models commonly amass vocabularies of over 30,000 unique words, compared to about 20,000 words for the average native English speaker. The capabilities of these models are impressive, with some surpassing human-level performance in general language understanding tests.
Once they are given a general understanding of language, AI models can be fine-tuned for a specific task, such as detecting online abuse, and it seems as if such models are extremely effective in distinguishing between malignant and benign content online. But just below the surface lurks critical weaknesses. The way in which we train and AI content moderation solutions is failing us.
The key obstacle for AI models is in learning how the same emoji can be used in hateful and non-hateful contexts. Not all uses of , , and should be banned, and disallowing benign use of these emoji raises free speech issues. When humans process sentences, words and emoji, context builds understanding and we learn these nuances from amassing experiences from childhood through to adulthood. Not every use of the monkey emoji is racist, but sending a monkey emoji to a Black footballer is rooted in a long-history of dehumanising comparisons and abusive slurs. Teaching an artificially-intelligent model to distinguish between these contexts is paramount in protecting against the considerable harm online abuse places on its victims.
Despite having an impressive grasp of how language works, AI language models have seen very little emoji. They are trained on a corpora of books, articles and websites, even the entirety of English Wikipedia, but these texts rarely feature emoji. The lack of emoji in training datasets causes a model to err when faced with real-world social media data—either by missing hateful emoji content (false negative) or by incorrectly flagging innocuous uses (false positives). Internet slang is constantly evolving, and in the specific context of abuse detection, algorithms and malicious users engage in a cat-and-mouse game with new and creative ways being devised to circumvent censorship filters. The use of emoji for racist abuse is an example of a change which has outpaced current detection models, allowing offensive messages to continue being shared even if equivalent abuse expressed in text would be removed.
There are two problems when it comes to detecting abuse online expressed in emoji. First, models have rarely been shown hateful emoji examples in their training. Second, models are rarely evaluated against emoji-based hate; so, their weaknesses in detection are unknown. Our ongoing research at the University of Oxford and The Alan Turing Institute marks the first attempt to address these problems.
Without much existing research or datasets telling us how emoji are used in online abuse, we constructed the first taxonomy of emoji-based hate. The taxonomy tests specific ways in which emoji are used on social media. For example, an emoji can be substituted for a protected identity e.g. “I hate ”, or it can be substituted for a violent action towards these protected groups e.g., “I will Muslims”. In line with the predominant ways emoji were abusive in the Euros 2020 context, a person or group of peoples can be described as similar to a dehumanising emoji e.g. “Black footballers are like ”. Alternatively, a number of these emoji substitutions can be combined in double swaps to create ‘emoji only’ hate e.g. “ = ” or “ ”. How do current content moderation systems perform against these forms of emoji-based hate?
One explanation for why the abuse towards England players was not removed from social media platforms is the software used to detect it. Google’s solution ‘Perspective‘, an automatic content moderation tool used by Reddit, The New York Times and the Wall Street Journal, is woefully inadequate in detecting emoji-based hate. Out of over 100 hateful examples where we use an emoji to represent a protected group, Perspective correctly flagged only 14%. Out of nearly 300 examples of sentences containing a double emoji swap, it gets none right.
AIs trained by humans are more successful in detecting uses of emoji that cross the line. We hired human annotators and asked them to try to tick a hate speech model by using emoji. In doing so, we collected over 5,000 examples of sentences which include racial-based, religion-based, sexuality-based and gender-based hatred, expressed using emoji. This form of AI development represents a human centric solution to hate speech detection, where humans draw on their own lived experiences of online abuse to teach the model about hateful language. To clearly teach a model where the line lies between unacceptable and acceptable content online, we asked our annotators to create an equal and opposite non-hateful example for each hateful one. Consider the hateful statement “Black footballers are ”. We can make this statement non-hateful by either changing the emoji “Black footballers are ❤️” or by changing the words “My favourite animals are ”. In seeing these pairs of examples, our model is more agile in its decisions allowing legitimate uses of emoji to remain whilst flagging unacceptable and racist expressions.
The contributions of this research demonstrate a double-edge sword in the abusive use of emoji. On one hand, current systems are severely lacking in ability to pick up on this form of hate and protect against its harms on social media. The events following England’s defeat in the Euros 2020 final are indicative that emoji abuse is slipping through the cracks of automated content moderation systems. Yet, on the other hand, we show that this problem is a relatively easy one to fix. By teaching AI models about emoji in just 5,000 examples, we achieve large gains in model performance relative to Perspective and existing AI language models trained on textual hate speech. Generally, our models surpass current solutions by 20 percentage points, and on some specific constructions of emoji-based hate, they achieve up to 80 percentage point gains. The current lag in the detection of emoji-based hatred on Twitter or Instagram compared to its textual counterpart cannot then be excused, because for these social media giants, efforts to correct for weaknesses come relatively cheaply.
Football may have brought the damage of emoji-based hate to the forefront of public discussion, but the harm extends far beyond the three England players who were victimised this time. Our research demonstrates academics, policymakers and development teams at social media companies need take note of the troubling spread of emoji-based hate and teach AI models to recognize it.
We are thankful for support from the Volkswagen Foundation, Meedan, Keble College, Rewire and Dynabench as well as all our annotators and collaborators.