OII | Study identifies weaknesses in how AI systems are evaluated

Published on
4 Nov 2025

Largest systematic review of AI benchmarks highlights need for clearer definitions and stronger scientific standards.

A new study led by the Oxford Internet Institute (OII) at the University of Oxford and involving a team of 42 researchers from leading global institutions including EPFL, Stanford University, the Technical University of Munich, UC Berkeley, the UK AI Security Institute, the Weizenbaum Institute, and Yale University, has found that many of the tests used to measure the capabilities and safety of large language models (LLMs) lack scientific rigour.

In Measuring What Matters: Construct Validity in Large Language Model Benchmarks, accepted for publication in the upcoming NeurIPS conference proceedings, researchers review 445 AI benchmarks – the standardised evaluations used to compare and rank AI systems.

The researchers found that many of these benchmarks are built on unclear definitions or weak analytical methods, making it difficult to draw reliable conclusions about AI progress, capabilities or safety.

“Benchmarks underpin nearly all claims about advances in AI,” says Andrew Bean, lead author of the study. “But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to.”

Benchmarks play a central role in how AI systems are designed, deployed, and regulated. They guide research priorities, shape competition between models, and are increasingly referenced in policy and regulatory frameworks, including the EU AI Act, which calls for risk assessments based on “appropriate technical tools and benchmarks.”

The study warns that if benchmarks are not scientifically sound, they may give developers and regulators a misleading picture of how capable or safe AI systems really are.

“This work reflects the kind of large-scale collaboration the field needs,” adds Dr. Adam Mahdi. “By bringing together leading AI labs, we’re starting to tackle one of the most fundamental gaps in current AI evaluation.”

Key findings

Lack of statistical rigour

Only 16% of the reviewed studies used statistical methods when comparing model performance. This means that reported differences between systems or claims of superiority could be due to chance rather than genuine improvement.

Vague or contested definitions

Around half of the benchmarks aimed to measure abstract ideas such as reasoning or harmlessness without clearly defining what those terms mean. Without a shared understanding of these concepts, it is difficult to ensure that benchmarks are testing what they intend to.

Examples

Confounding formatting rules – A test might ask a model to solve a simple logic puzzle but also require it to present the answer in a very specific, complicated format. If the model gets the puzzle right but fails the formatting, it looks worse than it really is.
Brittle performance – A model might do well on short, primary school-style maths questions, but if you change the numbers or wording slightly, it suddenly fails. This shows it may be memorising patterns rather than truly understanding the problem
Unsupported claims – If a model scores well on multiple-choice questions from medical exams, people might claim it has doctor-level expertise. But passing an exam is only one small part of what doctors do, so the result can be misleading.

Recommendations for better benchmarking

The authors stress that these problems are fixable. Drawing on established methods from fields such as psychometrics and medicine, they propose eight recommendations to improve the validity of AI benchmarks. These include:

Define and isolate: Provide a precise, operational definition for the concept being measured and control for unrelated factors.

Build representative evaluations: Ensure test items represent real-world conditions and cover the full scope of the target skill or behaviour.

Strengthen analysis and justification: Use statistical methods to report uncertainty and enable robust comparisons; conduct detailed error analysis to understand why a model fails; and justify why the benchmark is a valid measure for its intended purpose.

The team also provides a Construct Validity Checklist, a practical tool researchers, developers, and regulators can use to assess whether an AI benchmark follows sound design principles before relying on its results. The checklist is available at https://oxrml.com/measuring-what-matters/

The paper, Measuring What Matters: Construct Validity in Large Language Model Benchmarks, will be published as part of the NeurIPS 2025 peer-reviewed conference proceedings in San Diego from 2-7 December. The peer-reviewed paper is available on request.

Media spokespeople

Lead author: Andrew Bean, Doctoral Student, Oxford Internet Institute, University of Oxford

Senior authors: Adam Mahdi, Associate Professor, and Luc Rocher, Associate Professor, Oxford Internet Institute, University of Oxford

Contact 

For more information and briefings, please contact:
Anthea Milnes, Head of Communications
Sara Spinks / Veena McCoole, Media and Communications Manager     

T: +44 (0)1865 280527

M: +44 (0)7551 345493 

E: press@oii.ox.ac.uk   

About the Oxford Internet Institute (OII)    

The Oxford Internet Institute (OII) has been at the forefront of exploring the human impact of emerging technologies for 25 years. As a multidisciplinary research and teaching department, we bring together scholars and students from diverse fields to examine the opportunities and challenges posed by transformative innovations such as artificial intelligence, large language models, machine learning, digital platforms, and autonomous agents.

About the University of Oxford   

Oxford University was placed number one in the Times Higher Education World University Rankings for the tenth year running in 2025. At the heart of this success are the twin-pillars of our ground-breaking research and innovation and our distinctive educational offer. Oxford is world-famous for research and teaching excellence and home to some of the most talented people from across the globe.

Funding information

A.M.B. is supported in part by the Clarendon Scholarships and the Oxford Internet Institute’s Research Programme on AI & Work.

A.M. is supported by the Oxford Internet Institute’s Research Programme on AI & Work.

R.O.K. is supported by a Fellowship from the Cosmos Institute. H.M. is supported by ESRC [ES/P000649/1] and would like to acknowledge the London Initiative for Safe AI.

C.E. is supported by the EPSRC Centre for Doctoral Training in Health Data Science (EP/S02428X/1) and the AXA Research Fund.

F.L. is supported by Clarendon and Jason Hu studentships.

H.R.K.’s PhD is supported by the Economic and Social Research Council grant ES/P000649/1.

M.G. was supported by the SMARTY (PCI2024-153434) project funded by the Agencia Estatal de Investigación (doi:10.13039/501100011033) and by the European Commission through the Chips Act Joint Undertaking project SMARTY (Grant 101140087). This material is based in part upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-2139841.

O.D. is supported by the UKRI’s EPSRC AIMS CDT grant (EP/S024050/1).

J.R is supported by the Engineering and Physical Sciences Research Council.

J.B. would like to acknowledge funding by the Federal Ministry of Education and Research of Germany (BMBF) under grant no. 16DII131.

A. Bibi would like to acknowledge the UK AISI systemic safety grant.

A. Bosselut gratefully acknowledges the support of the Swiss National Science Foundation (No. 215390), Innosuisse (PFFS-21-29), the EPFL Center for Imaging, Sony Group Corporation, and a Meta LLM Evaluation Research Grant.

Related People

Andrew Bean

DPhil Student

Andrew holds a B.S. in Applied Mathematics from Yale University and an MSc in Social Data Science from the OII. He is a Clarendon Scholar and was previously a Thouron Prize winner at the University of Cambridge (Pembroke College).

View profile

Professor Adam Mahdi

Associate Professor

Adam Mahdi’s research focuses on digital health and application of machine learning in social sciences. He is the director of the UKRI-funded OxCOVID19 Project and a fellow at Wolfson College, University of Oxford.

View profile

Dr Luc Rocher

Associate Professor

Luc conducts human-centred computing research to understand how data and algorithms impact society. They work to make digital power visible to the public and guide the development of accountable, sustainable, and safe algorithms for all.

View profile

Related Project

Benchmarking Large Language Models for Self-Diagnosis

Our work investigates applications of large language models (LLMs) in healthcare settings, with a particular focus on interactions between LLMs and human users. The project focuses on LLMs for medical self-diagnosis.

Related People

Andrew Bean

Professor Adam Mahdi

Dr Luc Rocher

Related Project

Benchmarking Large Language Models for Self-Diagnosis

Related Topics: