Skip down to main content
PRESS RELEASE -

New study warns of risks in AI chatbots giving medical advice

LLMs and medical advice
PRESS RELEASE -

New study warns of risks in AI chatbots giving medical advice

Published on
9 Feb 2026
Written by
Andrew Bean, Luc Rocher, Adam Mahdi and Rebecca Payne
Study finds AI chatbots less helpful than search engines for medical advice

The largest user study of large language models (LLMs) for assisting the general public in medical decisions has found that they present risks to people seeking medical advice due to their tendency to provide inaccurate and inconsistent information. 

A new study from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford, carried out in partnership with MLCommons and other institutions, reveals a major gap between the promise of large language models (LLMs) and their usefulness for people seeking medical advice. While these models now excel at standardised tests of medical knowledge, they pose risks to real users seeking help with their own medical symptoms. 

Key findings:  

No better than traditional methods 

Participants used LLMs to identify investigate health conditions and decide on an appropriate course of action, such as seeing a GP, or going to the hospital, based on information provided in a series of specific medical scenarios developed by doctors. Those using LLMs did not make better decisions than participants who relied on traditional methods like online searches or their own judgment. 

Communication breakdown  

The study revealed a two-way communication breakdown. Participants often didn’t know what information the LLMs needed to offer accurate advice, and the responses they received frequently combined good and poor recommendations,  making it difficult to identify the best course of action.  

Existing tests fall short 

Current evaluation methods for LLMs do not reflect the complexity of interacting with human users. Like clinical trials for new medications, LLM systems should be tested in the real world before being deployed.  

“These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health,” said Dr Rebecca Payne, GP, lead medical practitioner on the study, Clarendon-Reuben Doctoral Scholar, Nuffield Department of Primary Care Health Sciences, and Clinical Senior Lecturer, Bangor University.  “Despite all the hype, AI just isn’t ready to take on the role of the physician.  

Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed.” 

Real users, real challenges 

In the study, researchers conducted a randomised trial involving nearly 1,300 online participants who were asked to identify potential health conditions and recommended course of action, based on personal medical scenarios. The detailed scenarios, developed by doctors, ranged from a young man developing a severe headache after a night out with friends for example, to a new mother feeling constantly out of breath and exhausted. 

One group used an LLM to assist their decision-making, while a control group used other traditional sources of information. The researchers then evaluated how accurately participants identified the likely medical issues and the most appropriate next step, such as visiting a GP or going to A&E. They also compared these outcomes to the results of standard LLM testing strategies, which do not involve real human users. The contrast was striking; models that performed well on benchmark tests faltered when interacting with people. 

They found evidence of three types of challenge:  

  • Users often didn’t know what information they should provide to the LLM 
  • LLMs provided very different answers based on slight variations in the questions asked 
  • LLMs often provided a mix of good and bad information which users struggled to distinguish.  

“Designing robust testing for large language models is key to understanding how we can make use of this new technology,” said lead author Andrew Bean, a doctoral researcher at the Oxford Internet Institute.  “In this study, we show that interacting with humans poses a challenge even for top LLMs. We hope this work will contribute to the development of safer and more useful AI systems.” 

“The disconnect between benchmark scores and real-world performance should be a wake-up call for AI developers and regulators”,  said senior author Dr Adam Mahdi,  Associate Professor, Reasoning with Machines Lab (OxRML), at the Oxford Internet Institute.  “Our recent work on construct validity in benchmarks shows that many evaluations fail to measure what they claim to measure, and this study demonstrates exactly why that matters. We cannot rely on standardised tests alone to determine if these systems are safe for public use. Just as we require clinical trials for new medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities in high-stakes settings like healthcare.”

Download a copy of the study, ‘Reliability of LLMs as medical assistants for the general public: a randomized preregistered study‘, published by Nature Medicine.    

Notes for Editors   

Contact 

For more information or to arrange interviews with the lead authors, please contact: Sara Spinks / Veena McCoole, Media and Communications Manager.      

M: +44 (0)7551 345493   

E: press@oii.ox.ac.uk   

Media spokespeople: 

  • Lead author Andrew Bean, doctoral researcher, Oxford Internet Institute 
  • Second author Dr Rebecca Payne, Nuffield Department of Primary Health Care Sciences, University of Oxford and North Wales Medical School, Bangor University.  
  • Senior author Dr Adam Mahdi, Associate Professor, Reasoning with Machines Lab (OxRML), Oxford Internet Institute, University of Oxford.

About the research     

This study focused on the ability of large language models like ChatGPT to provide medical advice to people at home. The study used a controlled trial to test the usefulness of large language models as sources of medical advice. Research included 1,298 online participants, selected to represent the demographics of the UK adult population.  

The authors of the study are: 

  • Andrew M. Bean, Guy Parsons, Hannah Rose Kirk, Luc Rocher and Adam Mahdi, Oxford Internet Institute, University of Oxford  
  • Rebecca Elizabeth Payne, Nuffield Department of Primary Health Care Sciences, University of Oxford and North Wales Medical School, Bangor University, 
  • Juan Ciro, Contextual AI 
  • Rafael Mosquera-Gómez and Sara Hincapié M, MLCommons and Factored AI 
  • Aruna S Ekanayaka, Birmingham Children’s Hospital 
  • Lionel Tarassenko, Institute of Biomedical Engineering, University of Oxford 

Funding information   

This research was supported by funding from Prolific and the Oxford Internet Institute’s Research Programme on AI, Government, and Policy, funded by the Dieter Schwarz Stiftung GmbH. The Data-centric Machine Learning Working Group at MLCommons provided support in using the Dynabench platform. Additional support was provided by the Royal Society Research Grant RG\R2\232035 and the UKRI Future Leaders Fellowship grant MR/Y015711/1, and the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. 

About the Oxford Internet Institute (OII)   

The Oxford Internet Institute (OII) has been at the forefront of exploring the human impact of emerging technologies for 25 years. As a multidisciplinary research and teaching department, we bring together scholars and students from diverse fields to examine the opportunities and challenges posed by transformative innovations such as artificial intelligence, large language models, machine learning, digital platforms, and autonomous agents. 

About the University of Oxford  

Oxford University has been placed number one in the Times Higher Education World University Rankings for the tenth-year running, and number two in the QS World Rankings 2022. At the heart of this success are the twin-pillars of our ground-breaking research and innovation and our distinctive educational offer. Oxford is world-famous for research and teaching excellence and home to some of the most talented people from across the globe.    

 

Privacy Overview
Oxford Internet Institute

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookies
  • moove_gdrp_popup -  a cookie that saves your preferences for cookie settings. Without this cookie, the screen offering you cookie options will appear on every page you visit.

This cookie remains on your computer for 365 days, but you can adjust your preferences at any time by clicking on the "Cookie settings" link in the website footer.

Please note that if you visit the Oxford University website, any cookies you accept there will appear on our site here too, this being a subdomain. To control them, you must change your cookie preferences on the main University website.

Google Analytics

This website uses Google Tags and Google Analytics to collect anonymised information such as the number of visitors to the site, and the most popular pages. Keeping these cookies enabled helps the OII improve our website.

Enabling this option will allow cookies from:

  • Google Analytics - tracking visits to the ox.ac.uk and oii.ox.ac.uk domains

These cookies will remain on your website for 365 days, but you can edit your cookie preferences at any time via the "Cookie Settings" button in the website footer.