Skip down to main content

OII researchers to explore LLM interpretability at EMNLP 2025

EMNLP 2025 Banner (1)

OII researchers to explore LLM interpretability at EMNLP 2025

Published on
31 Oct 2025
Researchers, including those from the Oxford Internet Institute’s Reasoning with Machine’s Lab (OxRML), will attend the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) in Suzhou, China from 4-9 November.

OII researchers will contribute to debates on LLM safety and transparency through the presentation of two peer-reviewed papers tackling some of the biggest challenges in LLM development, including; revealing the safety mechanisms in LLM post-training, and testing whether LLMs can accurately explain their own actions to human users. 

Their work employs rigorous methods and evaluations to advance the interpretable use of LLMs, supporting their safe and trustworthy deployment. 

Presentations to watch: 

  • How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis’ will be presented on Friday, November 7 at 14:00-15:30 as a poster in Hall C (Session 15). Yushi Yang, a DPhil student, presents research showing that Direct Preference Optimization (DPO), a popular post-training method, reduces toxic outputs by shifting activations across four neuron groups. Editing these groups directly replicates DPO without further training, offering more interpretable safety interventions. 
  • LLMs Don’t Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations’ will be presented on Friday, November 7 at 14:00-15:30 as a poster in Hall C (Session 15). Harry Mayne, a DPhil student, presents evidence that LLMs struggle to explain their own behaviour, raising concerns about deploying models in high-stakes domains such as healthcare and finance.

  • Calvin Yixiang Cheng, a DPhil student at OII, is presenting his co-authored research on how to apply LLMs to map conspiracy narratives. Calvin will be presenting his research at the Poster session on Sunday, November 9, Computational Approaches to Discourse, Context and Document-Level Inferences (CODI) Workshop, Room A303.  

 

About EMNLP 

EMNLP is one of the most prestigious conferences in the field of natural language processing (NLP) and AI. It showcases the latest breakthroughs in NLP and real-world applications of large language models (LLMs). Research themes range from applied NLP to safety and interpretability of LLM systems. 

 

OII researchers at EMNLP 2025 

Yushi Yang 

Yushi Yang, a DPhil student at OII, is presenting her co-authored research into the interpretable mechanisms of safety in LLM post-training.  Yushi will be presenting her research at the Poster session on Friday, November 7, 14:00-15:30 (Session 15). 

Yushi explains: “Our work shows that the popular LLM post-training method, Direct Preference Optimization (DPO) reduces toxic outputs by shifting activations across four neuron groups inside LLMs. Simply editing these groups replicates DPO with no training required. These results deepen understanding of safety fine-tuning and enable more interpretable, controllable safety interventions in LLMs.” 

Download her co-authored paper, How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis 

Authors: Yushi Yang, Filip Sondej, Harry Mayne, Andrew Lee, Adam Mahdi 

 

Calvin Yixiang Cheng 

Calvin, a DPhil student at OII, is presenting his co-authored research on how to apply LLMs to map conspiracy narratives. Calvin will be presenting his research at the Poster session on Sunday, November 9, Computational Approaches to Discourse, Context and Document-Level Inferences (CODI) Workshop, Room A303.  

Calvin says, “Grounded in narrative theories, we used LLMs to map conspiracy narratives across social media platforms based on atomic claims. While some narratives are shared across platforms, their framing shows significant differences. We also found that users on platforms with homogeneous ideologies (e.g., Truth Social) tend to engage more with verifiable but false claims, suggesting that echo platforms may reinforce users’ engagement with misinformation that aligns with their beliefs rather than with truthfulness.” 

Authors: Mohsen Molseh, Scott A. Hale, Calvin Yixiang Cheng, and Anna George 

 

Harry Mayne 

Harry Mayne, a DPhil student at OII, is presenting his co-authored research on whether LLMs are able to accurately explain their own behaviours. Harry will be presenting his research at the Poster session on Friday, November 7, 14:00-15:30 in Hall C (Session 15). 

Harry explains: “We show that LLMs are unreliable at explaining their own decision-making because they have limited understanding of their own behaviours. This inability to self-explain decisions raises serious concerns about deploying models in high-stakes settings such as healthcare and finance.” 

Download his co-authored paper, LLMs Don’t Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations 

Authors: Harry Mayne, Ryan Kearns, Yushi Yang, Andrew Bean, Eoin Delaney, Chris Russell, Adam Mahdi 

 

About the Reasoning with Machines Lab (OxRML) 

The Reasoning with Machines Lab’s research programme focuses on LLM evaluation, AI safety, mechanistic interpretability, and human-LLM interaction. Researchers combine mathematical rigour with empirical investigation to advance understanding of how large language models process information, solve tasks, and interact with humans in real-world contexts. Harry Mayne and Yushi Yang are part of OxRML.

Dr. Adam Mahdi is an Associate Professor at the Oxford Internet Institute and leads the Reasoning with Machines Lab. Adam concludes: “The rapid progress of LLMs brings huge potential but also critical questions about transparency and trust. I’m incredibly proud that our Lab is advancing rigorous, interpretable methods to discover how these models behave and how they can explain themselves.” 

 

More information 

To find out more about the OII’s ongoing research in AI and related fields, please contact press@oii.ox.ac.uk or explore our website. 

Privacy Overview
Oxford Internet Institute

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookies
  • moove_gdrp_popup -  a cookie that saves your preferences for cookie settings. Without this cookie, the screen offering you cookie options will appear on every page you visit.

This cookie remains on your computer for 365 days, but you can adjust your preferences at any time by clicking on the "Cookie settings" link in the website footer.

Please note that if you visit the Oxford University website, any cookies you accept there will appear on our site here too, this being a subdomain. To control them, you must change your cookie preferences on the main University website.

Google Analytics

This website uses Google Tags and Google Analytics to collect anonymised information such as the number of visitors to the site, and the most popular pages. Keeping these cookies enabled helps the OII improve our website.

Enabling this option will allow cookies from:

  • Google Analytics - tracking visits to the ox.ac.uk and oii.ox.ac.uk domains

These cookies will remain on your website for 365 days, but you can edit your cookie preferences at any time via the "Cookie Settings" button in the website footer.