Oxford researchers to explore metrics and fairness at NeurIPS 2025
Published on 25 Nov 2025
Researchers from the Oxford Internet Institute at the University of Oxford will be at NeurIPS 2025 in San Diego from 1- 7 December, 2025, contributing to one of the world’s leading AI conferences.
Researchers from the Oxford Internet Institute at the University of Oxford will be at NeurIPS 2025 in San Diego from 1- 7 December, 2025. They will take part in presentations, poster sessions and workshopslooking at how AI systems are measured and compared, how fairness can be improved in generative models, and how multilingual and community-centred perspectives can strengthen the design and oversight of AI.
Featured OII research and events:
1.Measuring what matters
Paper: Measuring What Matters: Construct Validity in Large Language Model Benchmarks Presenters: Ryan Kearns,Adam Mahdi, Franziska Sofia Hafner Lead authors: Andrew M. Bean, Ryan Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne Senior authors: Adam Mahdi, Luc Rocher Session: Thu 4 Dec 11am – 2pm PST, Exhibit Hall C, D, E #107 Summary: This study examines the scientific robustness of 445 AI benchmarks, the standardised evaluations used to compare and rank AI systems. The researchers found that many of these benchmarks are built on unclear definitions or weak analytical methods, making it difficult to draw reliable conclusions about AI progress, capabilities or safety. The article makes recommendations for better benchmarking. More info: See our previously distributed press release for further information: Study identifies weaknesses in how AI systems are evaluated
2.FairImagen: Bias Mitigation in Text-to-Image Models
Paper: FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models Authors: Zihao Fu,Ryan Brown, Shun Shao,Kai Rawal, Eoin Delaney,Chris Russell Poster session 4: Thu 4 Dec 4:30 – 7.30pm PST, Exhibit Hall C, D, E #1208 Summary: FairImagen introduces a post-processing technique that reduces demographic bias in image generation systems without requiring model retraining. It improves fairness across gender and race while maintaining realism and contextual accuracy. Read the preprint.
3. Evaluating LLM-as-a-Judge under Multilingual, Multimodal Settings
Paper: Evaluating LLM-as-a-Judge under Multilingual, Multimodal Settings Authors: Shreyansh Padarha, Elizaveta Semenova, Bertie Vidgen, Adam Mahdi,Scott A. Hale Workshop: Evaluating the Evolving LLM Lifecycle, 8am – 5pm, 7 December, Upper Level Room 2 Summary: This paper introduces PolyVis, a new benchmark to assess “judge” models — large language models used to evaluate the performance of other models — across 12 languages and multimodal tasks combining vision and language. The findings reveal that LLM-judge performance depends not just on scale or data but on complex interactions between language, modality, and task type. The authors argue for tailored, context-aware evaluation frameworks to capture where models succeed or fail. Find out more information and read the paper.
4. Queer in AI Poster Session and Workshop Poster session: 6pm – 9pm, 2 December, Hall C Workshop: 9am – 5pm, 4 December, Room Upper 32AB Convenor: Franziska Sofia Hafner Summary: The Queer in AI workshop is dedicated to advancing equitable AI practices and contributing to a future where technology serves the needs of all communities. The workshop is committed to advocating for a more ethically grounded approach to AI development, ensuring that diverse voices are heard and valued in conversations about the future of technology.Find out more information about the workshop.
Meet OII researchers at NeurIPS
OII researchers will be available for interviews and commentary throughout the conference. Please contact the press office to pre-arrange times to speak to them.
Anthea Milnes, Head of Communications or Sara Spinks/ Veena McCoole , Media and Communications Manager