OII | OII researchers to explore LLM interpretability at EMNLP 2025

Published on
31 Oct 2025

Researchers, including those from the Oxford Internet Institute’s Reasoning with Machine’s Lab (OxRML), will attend the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) in Suzhou, China from 4-9 November.

OII researchers will contribute to debates on LLM safety and transparency through the presentation of two peer-reviewed papers tackling some of the biggest challenges in LLM development, including; revealing the safety mechanisms in LLM post-training, and testing whether LLMs can accurately explain their own actions to human users.

Their work employs rigorous methods and evaluations to advance the interpretable use of LLMs, supporting their safe and trustworthy deployment.

Presentations to watch:

‘How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis’ will be presented on Friday, November 7 at 14:00-15:30 as a poster in Hall C (Session 15). Yushi Yang, a DPhil student, presents research showing that Direct Preference Optimization (DPO), a popular post-training method, reduces toxic outputs by shifting activations across four neuron groups. Editing these groups directly replicates DPO without further training, offering more interpretable safety interventions.

‘LLMs Don’t Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations’ will be presented on Friday, November 7 at 14:00-15:30 as a poster in Hall C (Session 15). Harry Mayne, a DPhil student, presents evidence that LLMs struggle to explain their own behaviour, raising concerns about deploying models in high-stakes domains such as healthcare and finance.
Calvin Yixiang Cheng, a DPhil student at OII, is presenting his co-authored research on how to apply LLMs to map conspiracy narratives. Calvin will be presenting his research at the Poster session on Sunday, November 9, Computational Approaches to Discourse, Context and Document-Level Inferences (CODI) Workshop, Room A303.

About EMNLP

EMNLP is one of the most prestigious conferences in the field of natural language processing (NLP) and AI. It showcases the latest breakthroughs in NLP and real-world applications of large language models (LLMs). Research themes range from applied NLP to safety and interpretability of LLM systems.

OII researchers at EMNLP 2025

Yushi Yang

Yushi Yang, a DPhil student at OII, is presenting her co-authored research into the interpretable mechanisms of safety in LLM post-training.  Yushi will be presenting her research at the Poster session on Friday, November 7, 14:00-15:30 (Session 15).

Yushi explains: “Our work shows that the popular LLM post-training method, Direct Preference Optimization (DPO) reduces toxic outputs by shifting activations across four neuron groups inside LLMs. Simply editing these groups replicates DPO with no training required. These results deepen understanding of safety fine-tuning and enable more interpretable, controllable safety interventions in LLMs.”

Download her co-authored paper, How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis’

Authors: Yushi Yang, Filip Sondej, Harry Mayne, Andrew Lee, Adam Mahdi

Calvin Yixiang Cheng

Calvin, a DPhil student at OII, is presenting his co-authored research on how to apply LLMs to map conspiracy narratives. Calvin will be presenting his research at the Poster session on Sunday, November 9, Computational Approaches to Discourse, Context and Document-Level Inferences (CODI) Workshop, Room A303.

Calvin says, “Grounded in narrative theories, we used LLMs to map conspiracy narratives across social media platforms based on atomic claims. While some narratives are shared across platforms, their framing shows significant differences. We also found that users on platforms with homogeneous ideologies (e.g., Truth Social) tend to engage more with verifiable but false claims, suggesting that echo platforms may reinforce users’ engagement with misinformation that aligns with their beliefs rather than with truthfulness.”

Authors: Mohsen Molseh, Scott A. Hale, Calvin Yixiang Cheng, and Anna George

Harry Mayne

Harry Mayne, a DPhil student at OII, is presenting his co-authored research on whether LLMs are able to accurately explain their own behaviours. Harry will be presenting his research at the Poster session on Friday, November 7, 14:00-15:30 in Hall C (Session 15).

Harry explains: “We show that LLMs are unreliable at explaining their own decision-making because they have limited understanding of their own behaviours. This inability to self-explain decisions raises serious concerns about deploying models in high-stakes settings such as healthcare and finance.”

Download his co-authored paper, LLMs Don’t Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

Authors: Harry Mayne, Ryan Kearns, Yushi Yang, Andrew Bean, Eoin Delaney, Chris Russell, Adam Mahdi

About the Reasoning with Machines Lab (OxRML)

The Reasoning with Machines Lab’s research programme focuses on LLM evaluation, AI safety, mechanistic interpretability, and human-LLM interaction. Researchers combine mathematical rigour with empirical investigation to advance understanding of how large language models process information, solve tasks, and interact with humans in real-world contexts. Harry Mayne and Yushi Yang are part of OxRML.

Dr. Adam Mahdi is an Associate Professor at the Oxford Internet Institute and leads the Reasoning with Machines Lab. Adam concludes: “The rapid progress of LLMs brings huge potential but also critical questions about transparency and trust. I’m incredibly proud that our Lab is advancing rigorous, interpretable methods to discover how these models behave and how they can explain themselves.”

More information

To find out more about the OII’s ongoing research in AI and related fields, please contact press@oii.ox.ac.uk or explore our website.

Related People

Dr Adam Mahdi

Senior Research Fellow

Adam Mahdi’s research focuses on digital health and application of machine learning in social sciences. He is the director of the UKRI-funded OxCOVID19 Project and a fellow at Wolfson College, University of Oxford.

View profile

Harry Mayne

DPhil Student

Harry is a DPhil student at the OII working on explainability and interpretability in large language models.

View profile

Calvin Yixiang Cheng

DPhil Student

Calvin is a DPhil candidate in Social Data Science at the Oxford Internet Institute. His research examines how digital technologies shape the diffusion of misinformation, the formation of ideology, and the dynamics of public opinion.

View profile

Yushi Yang

DPhil Student

Yushi is a DPhil student in Social Data Science at the OII, and an aspiring data scientist. Her research focuses on developing interpretability methods to improve safety mechanisms in large language models and ensure robust harm mitigation.

View profile