OII researchers will contribute to debates on LLM safety and transparency through the presentation of two peer-reviewed papers tackling some of the biggest challenges in LLM development, including; revealing the safety mechanisms in LLM post-training, and testing whether LLMs can accurately explain their own actions to human users. 
Their work employs rigorous methods and evaluations to advance the interpretable use of LLMs, supporting their safe and trustworthy deployment. 
Presentations to watch: 
- ‘How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis’ will be presented on Friday, November 7 at 14:00-15:30 as a poster in Hall C (Session 15). Yushi Yang, a DPhil student, presents research showing that Direct Preference Optimization (DPO), a popular post-training method, reduces toxic outputs by shifting activations across four neuron groups. Editing these groups directly replicates DPO without further training, offering more interpretable safety interventions. 
- ‘LLMs Don’t Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations’ will be presented on Friday, November 7 at 14:00-15:30 as a poster in Hall C (Session 15). Harry Mayne, a DPhil student, presents evidence that LLMs struggle to explain their own behaviour, raising concerns about deploying models in high-stakes domains such as healthcare and finance.
- Calvin Yixiang Cheng, a DPhil student at OII, is presenting his co-authored research on how to apply LLMs to map conspiracy narratives. Calvin will be presenting his research at the Poster session on Sunday, November 9, Computational Approaches to Discourse, Context and Document-Level Inferences (CODI) Workshop, Room A303.  
 
About EMNLP 
EMNLP is one of the most prestigious conferences in the field of natural language processing (NLP) and AI. It showcases the latest breakthroughs in NLP and real-world applications of large language models (LLMs). Research themes range from applied NLP to safety and interpretability of LLM systems. 
 
OII researchers at EMNLP 2025 
Yushi Yang 
Yushi Yang, a DPhil student at OII, is presenting her co-authored research into the interpretable mechanisms of safety in LLM post-training.  Yushi will be presenting her research at the Poster session on Friday, November 7, 14:00-15:30 (Session 15). 
Yushi explains: “Our work shows that the popular LLM post-training method, Direct Preference Optimization (DPO) reduces toxic outputs by shifting activations across four neuron groups inside LLMs. Simply editing these groups replicates DPO with no training required. These results deepen understanding of safety fine-tuning and enable more interpretable, controllable safety interventions in LLMs.” 
Download her co-authored paper, How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis’ 
Authors: Yushi Yang, Filip Sondej, Harry Mayne, Andrew Lee, Adam Mahdi 
 
Calvin Yixiang Cheng 
Calvin, a DPhil student at OII, is presenting his co-authored research on how to apply LLMs to map conspiracy narratives. Calvin will be presenting his research at the Poster session on Sunday, November 9, Computational Approaches to Discourse, Context and Document-Level Inferences (CODI) Workshop, Room A303.  
Calvin says, “Grounded in narrative theories, we used LLMs to map conspiracy narratives across social media platforms based on atomic claims. While some narratives are shared across platforms, their framing shows significant differences. We also found that users on platforms with homogeneous ideologies (e.g., Truth Social) tend to engage more with verifiable but false claims, suggesting that echo platforms may reinforce users’ engagement with misinformation that aligns with their beliefs rather than with truthfulness.” 
Authors: Mohsen Molseh, Scott A. Hale, Calvin Yixiang Cheng, and Anna George 
 
Harry Mayne 
Harry Mayne, a DPhil student at OII, is presenting his co-authored research on whether LLMs are able to accurately explain their own behaviours. Harry will be presenting his research at the Poster session on Friday, November 7, 14:00-15:30 in Hall C (Session 15). 
Harry explains: “We show that LLMs are unreliable at explaining their own decision-making because they have limited understanding of their own behaviours. This inability to self-explain decisions raises serious concerns about deploying models in high-stakes settings such as healthcare and finance.” 
Download his co-authored paper, LLMs Don’t Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations 
Authors: Harry Mayne, Ryan Kearns, Yushi Yang, Andrew Bean, Eoin Delaney, Chris Russell, Adam Mahdi 
 
About the Reasoning with Machines Lab (OxRML) 
The Reasoning with Machines Lab’s research programme focuses on LLM evaluation, AI safety, mechanistic interpretability, and human-LLM interaction. Researchers combine mathematical rigour with empirical investigation to advance understanding of how large language models process information, solve tasks, and interact with humans in real-world contexts. Harry Mayne and Yushi Yang are part of OxRML.
Dr. Adam Mahdi is an Associate Professor at the Oxford Internet Institute and leads the Reasoning with Machines Lab. Adam concludes: “The rapid progress of LLMs brings huge potential but also critical questions about transparency and trust. I’m incredibly proud that our Lab is advancing rigorous, interpretable methods to discover how these models behave and how they can explain themselves.” 
 
More information 
To find out more about the OII’s ongoing research in AI and related fields, please contact press@oii.ox.ac.uk or explore our website.