Study finds AI chatbots misdiagnose most early-stage medical cases

In short: A new study found leading AI chatbots get more than 80% of early-stage medical diagnosis questions wrong when the information is incomplete.

What happened

Researchers tested 21 large language models, which are AI systems that generate text by predicting the next word (like an advanced autocomplete). The models included tools from OpenAI, Anthropic, Google, xAI and DeepSeek.

The team used 29 short, realistic patient stories from a standard medical reference. They revealed details step by step, starting with limited information and later adding exam findings and lab results. The researchers then asked the chatbots for answers and counted a failure when the response was not fully correct.

When the chatbots had to do “differential diagnosis,” meaning suggest a range of possible causes before all details are known, every model had a failure rate above 80%. The researchers said the models often narrowed in on a single answer too quickly. When the case was more complete and the chatbots were asked for a final diagnosis, failure rates dropped to under 40%, and the best performers were above 90% accurate.

Company policies and safety messages vary. Anthropic and Google said their tools encourage users to consult professionals, and OpenAI’s usage policy says its services should not be used for medical advice that requires a license without professional involvement.

Why it matters

Many people use chatbots as a first stop when they feel unwell. This study suggests that is riskiest at the exact moment people most want help, which is early on when symptoms are unclear. It is a bit like asking someone to guess a whole movie from the first 10 seconds, the chatbot may sound confident, but the guess is often wrong.

Source: Financial Times