Skip down to main content
PRESS RELEASE -

Study identifies weaknesses in how AI systems are evaluated 

Mid-Century Abstract Vector Pattern
PRESS RELEASE -

Study identifies weaknesses in how AI systems are evaluated 

Published on
4 Nov 2025
Largest systematic review of AI benchmarks highlights need for clearer definitions and stronger scientific standards.

A new study led by the Oxford Internet Institute (OII) at the University of Oxford and involving a team of 42 researchers from leading global institutions including EPFL, Stanford University, the Technical University of Munich, UC Berkeley, the UK AI Security Institute, the Weizenbaum Institute, and Yale University, has found that many of the tests used to measure the capabilities and safety of large language models (LLMs) lack scientific rigour.  

In Measuring What Matters: Construct Validity in Large Language Model Benchmarks, accepted for publication in the upcoming NeurIPS conference proceedings, researchers review 445 AI benchmarks – the standardised evaluations used to compare and rank AI systems.  

The researchers found that many of these benchmarks are built on unclear definitions or weak analytical methods, making it difficult to draw reliable conclusions about AI progress, capabilities or safety. 

“Benchmarks underpin nearly all claims about advances in AI,” says Andrew Bean, lead author of the study. “But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to.” 

Benchmarks play a central role in how AI systems are designed, deployed, and regulated. They guide research priorities, shape competition between models, and are increasingly referenced in policy and regulatory frameworks, including the EU AI Act, which calls for risk assessments based on “appropriate technical tools and benchmarks.” 

The study warns that if benchmarks are not scientifically sound, they may give developers and regulators a misleading picture of how capable or safe AI systems really are. 

“This work reflects the kind of large-scale collaboration the field needs,” adds Dr. Adam Mahdi. “By bringing together leading AI labs, we’re starting to tackle one of the most fundamental gaps in current AI evaluation.” 

Key findings 

  1. Lack of statistical rigour

Only 16% of the reviewed studies used statistical methods when comparing model performance. This means that reported differences between systems or claims of superiority could be due to chance rather than genuine improvement. 

  1. Vague or contested definitions

Around half of the benchmarks aimed to measure abstract ideas such as reasoning or harmlessness without clearly defining what those terms mean. Without a shared understanding of these concepts, it is difficult to ensure that benchmarks are testing what they intend to.

Examples 

  • Confounding formatting rules – A test might ask a model to solve a simple logic puzzle but also require it to present the answer in a very specific, complicated format. If the model gets the puzzle right but fails the formatting, it looks worse than it really is. 
  • Brittle performance – A model might do well on short, primary school-style maths questions, but if you change the numbers or wording slightly, it suddenly fails. This shows it may be memorising patterns rather than truly understanding the problem
  • Unsupported claims – If a model scores well on multiple-choice questions from medical exams, people might claim it has doctor-level expertise. But passing an exam is only one small part of what doctors do, so the result can be misleading.

Recommendations for better benchmarking 

The authors stress that these problems are fixable. Drawing on established methods from fields such as psychometrics and medicine, they propose eight recommendations to improve the validity of AI benchmarks. These include: 

  • Define and isolate: Provide a precise, operational definition for the concept being measured and control for unrelated factors. 
  • Build representative evaluations: Ensure test items represent real-world conditions and cover the full scope of the target skill or behaviour.  
  • Strengthen analysis and justification: Use statistical methods to report uncertainty and enable robust comparisons; conduct detailed error analysis to understand why a model fails; and justify why the benchmark is a valid measure for its intended purpose. 

The team also provides a Construct Validity Checklist, a practical tool researchers, developers, and regulators can use to assess whether an AI benchmark follows sound design principles before relying on its results. The checklist is available at https://oxrml.com/measuring-what-matters/  

The paper, Measuring What Matters: Construct Validity in Large Language Model Benchmarks, will be published as part of the NeurIPS 2025 peer-reviewed conference proceedings in San Diego from 2-7 December. The peer-reviewed paper is available on request.  

Media spokespeople 

Lead author: Andrew Bean, Doctoral Student, Oxford Internet Institute, University of Oxford 

Senior authors: Adam Mahdi, Associate Professor, and Luc Rocher, Associate Professor, Oxford Internet Institute, University of Oxford 

Contact 

For more information and briefings, please contact:
Anthea Milnes, Head of Communications
Sara Spinks / Veena McCoole, Media and Communications Manager     

T: +44 (0)1865 280527 

M: +44 (0)7551 345493  

E:press@oii.ox.ac.uk    

 

About the Oxford Internet Institute (OII)   

The Oxford Internet Institute (OII) has been at the forefront of exploring the human impact of emerging technologies for 25 years. As a multidisciplinary research and teaching department, we bring together scholars and students from diverse fields to examine the opportunities and challenges posed by transformative innovations such as artificial intelligence, large language models, machine learning, digital platforms, and autonomous agents. 

 

About the University of Oxford 

Oxford University was placed number one in the Times Higher Education World University Rankings for the tenth year running in 2025. At the heart of this success are the twin-pillars of our ground-breaking research and innovation and our distinctive educational offer. Oxford is world-famous for research and teaching excellence and home to some of the most talented people from across the globe.   

 

Funding information 

  • A.M.B. is supported in part by the Clarendon Scholarships and the Oxford Internet Institute’s Research Programme on AI & Work.  
  • A.M. is supported by the Oxford Internet Institute’s Research Programme on AI & Work. 
  • R.O.K. is supported by a Fellowship from the Cosmos Institute. H.M. is supported by ESRC [ES/P000649/1] and would like to acknowledge the London Initiative for Safe AI. 
  • C.E. is supported by the EPSRC Centre for Doctoral Training in Health Data Science (EP/S02428X/1) and the AXA Research Fund.  
  • F.L. is supported by Clarendon and Jason Hu studentships.  
  • H.R.K.’s PhD is supported by the Economic and Social Research Council grant ES/P000649/1.  
  • M.G. was supported by the SMARTY (PCI2024-153434) project funded by the Agencia Estatal de Investigación (doi:10.13039/501100011033) and by the European Commission through the Chips Act Joint Undertaking project SMARTY (Grant 101140087). This material is based in part upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-2139841.  
  • O.D. is supported by the UKRI’s EPSRC AIMS CDT grant (EP/S024050/1).  
  • J.R is supported by the Engineering and Physical Sciences Research Council.  
  • J.B. would like to acknowledge funding by the Federal Ministry of Education and Research of Germany (BMBF) under grant no. 16DII131.  
  • A. Bibi would like to acknowledge the UK AISI systemic safety grant.  
  • A. Bosselut gratefully acknowledges the support of the Swiss National Science Foundation (No. 215390), Innosuisse (PFFS-21-29), the EPFL Center for Imaging, Sony Group Corporation, and a Meta LLM Evaluation Research Grant. 
Privacy Overview
Oxford Internet Institute

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookies
  • moove_gdrp_popup -  a cookie that saves your preferences for cookie settings. Without this cookie, the screen offering you cookie options will appear on every page you visit.

This cookie remains on your computer for 365 days, but you can adjust your preferences at any time by clicking on the "Cookie settings" link in the website footer.

Please note that if you visit the Oxford University website, any cookies you accept there will appear on our site here too, this being a subdomain. To control them, you must change your cookie preferences on the main University website.

Google Analytics

This website uses Google Tags and Google Analytics to collect anonymised information such as the number of visitors to the site, and the most popular pages. Keeping these cookies enabled helps the OII improve our website.

Enabling this option will allow cookies from:

  • Google Analytics - tracking visits to the ox.ac.uk and oii.ox.ac.uk domains

These cookies will remain on your website for 365 days, but you can edit your cookie preferences at any time via the "Cookie Settings" button in the website footer.