A new study led by the Oxford Internet Institute (OII) at the University of Oxford and involving a team of 42 researchers from leading global institutions including EPFL, Stanford University, the Technical University of Munich, UC Berkeley, the UK AI Security Institute, the Weizenbaum Institute, and Yale University, has found that many of the tests used to measure the capabilities and safety of large language models (LLMs) lack scientific rigour.  
In Measuring What Matters: Construct Validity in Large Language Model Benchmarks, accepted for publication in the upcoming NeurIPS conference proceedings, researchers review 445 AI benchmarks – the standardised evaluations used to compare and rank AI systems.  
The researchers found that many of these benchmarks are built on unclear definitions or weak analytical methods, making it difficult to draw reliable conclusions about AI progress, capabilities or safety. 
“Benchmarks underpin nearly all claims about advances in AI,” says Andrew Bean, lead author of the study. “But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to.” 
Benchmarks play a central role in how AI systems are designed, deployed, and regulated. They guide research priorities, shape competition between models, and are increasingly referenced in policy and regulatory frameworks, including the EU AI Act, which calls for risk assessments based on “appropriate technical tools and benchmarks.” 
The study warns that if benchmarks are not scientifically sound, they may give developers and regulators a misleading picture of how capable or safe AI systems really are. 
“This work reflects the kind of large-scale collaboration the field needs,” adds Dr. Adam Mahdi. “By bringing together leading AI labs, we’re starting to tackle one of the most fundamental gaps in current AI evaluation.” 
Key findings 
-  Lack of statistical rigour
 
Only 16% of the reviewed studies used statistical methods when comparing model performance. This means that reported differences between systems or claims of superiority could be due to chance rather than genuine improvement. 
-  Vague or contested definitions
 
Around half of the benchmarks aimed to measure abstract ideas such as reasoning or harmlessness without clearly defining what those terms mean. Without a shared understanding of these concepts, it is difficult to ensure that benchmarks are testing what they intend to.
Examples 
- Confounding formatting rules – A test might ask a model to solve a simple logic puzzle but also require it to present the answer in a very specific, complicated format. If the model gets the puzzle right but fails the formatting, it looks worse than it really is. 
 
- Brittle performance – A model might do well on short, primary school-style maths questions, but if you change the numbers or wording slightly, it suddenly fails. This shows it may be memorising patterns rather than truly understanding the problem
 
- Unsupported claims – If a model scores well on multiple-choice questions from medical exams, people might claim it has doctor-level expertise. But passing an exam is only one small part of what doctors do, so the result can be misleading.
 
Recommendations for better benchmarking 
The authors stress that these problems are fixable. Drawing on established methods from fields such as psychometrics and medicine, they propose eight recommendations to improve the validity of AI benchmarks. These include: 
- Define and isolate: Provide a precise, operational definition for the concept being measured and control for unrelated factors. 
 
- Build representative evaluations: Ensure test items represent real-world conditions and cover the full scope of the target skill or behaviour.  
 
- Strengthen analysis and justification: Use statistical methods to report uncertainty and enable robust comparisons; conduct detailed error analysis to understand why a model fails; and justify why the benchmark is a valid measure for its intended purpose. 
 
The team also provides a Construct Validity Checklist, a practical tool researchers, developers, and regulators can use to assess whether an AI benchmark follows sound design principles before relying on its results. The checklist is available at https://oxrml.com/measuring-what-matters/  
The paper, Measuring What Matters: Construct Validity in Large Language Model Benchmarks, will be published as part of the NeurIPS 2025 peer-reviewed conference proceedings in San Diego from 2-7 December. The peer-reviewed paper is available on request.  
Media spokespeople 
Lead author: Andrew Bean, Doctoral Student, Oxford Internet Institute, University of Oxford 
Senior authors: Adam Mahdi, Associate Professor, and Luc Rocher, Associate Professor, Oxford Internet Institute, University of Oxford 
Contact  
For more information and briefings, please contact: 
Anthea Milnes, Head of Communications 
Sara Spinks / Veena McCoole, Media and Communications Manager      
T: +44 (0)1865 280527 
M: +44 (0)7551 345493  
E: press@oii.ox.ac.uk    
 
About the Oxford Internet Institute (OII)     
The Oxford Internet Institute (OII) has been at the forefront of exploring the human impact of emerging technologies for 25 years. As a multidisciplinary research and teaching department, we bring together scholars and students from diverse fields to examine the opportunities and challenges posed by transformative innovations such as artificial intelligence, large language models, machine learning, digital platforms, and autonomous agents. 
 
About the University of Oxford    
Oxford University was placed number one in the Times Higher Education World University Rankings for the tenth year running in 2025. At the heart of this success are the twin-pillars of our ground-breaking research and innovation and our distinctive educational offer. Oxford is world-famous for research and teaching excellence and home to some of the most talented people from across the globe.   
 
Funding information 
- A.M.B. is supported in part by the Clarendon Scholarships and the Oxford Internet Institute’s Research Programme on AI & Work.  
 
- A.M. is supported by the Oxford Internet Institute’s Research Programme on AI & Work. 
 
- R.O.K. is supported by a Fellowship from the Cosmos Institute. H.M. is supported by ESRC [ES/P000649/1] and would like to acknowledge the London Initiative for Safe AI. 
 
- C.E. is supported by the EPSRC Centre for Doctoral Training in Health Data Science (EP/S02428X/1) and the AXA Research Fund.  
 
- F.L. is supported by Clarendon and Jason Hu studentships.  
 
- H.R.K.’s PhD is supported by the Economic and Social Research Council grant ES/P000649/1.  
 
- M.G. was supported by the SMARTY (PCI2024-153434) project funded by the Agencia Estatal de Investigación (doi:10.13039/501100011033) and by the European Commission through the Chips Act Joint Undertaking project SMARTY (Grant 101140087). This material is based in part upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-2139841.  
 
- O.D. is supported by the UKRI’s EPSRC AIMS CDT grant (EP/S024050/1).  
 
- J.R is supported by the Engineering and Physical Sciences Research Council.  
 
- J.B. would like to acknowledge funding by the Federal Ministry of Education and Research of Germany (BMBF) under grant no. 16DII131.  
 
- A. Bibi would like to acknowledge the UK AISI systemic safety grant.  
 
- A. Bosselut gratefully acknowledges the support of the Swiss National Science Foundation (No. 215390), Innosuisse (PFFS-21-29), the EPFL Center for Imaging, Sony Group Corporation, and a Meta LLM Evaluation Research Grant.