Researchers say AI agents are improving, but reliability lags behind

In short: Researchers say today’s AI agents can look impressive on average, but they are still too inconsistent for high-stakes work.

What's going on

AI “agents” are tools that can carry out tasks for you, like booking travel, answering customer questions, or helping write code. They are often judged by average accuracy, meaning how often they succeed across a list of tests.

Princeton researchers Sayash Kapoor and Stephan Rabanser say that this kind of scoring can hide the real risk. In safety-critical areas, like aviation or nuclear systems, what matters is not just how often something works, but how it behaves on a bad day. It is like judging a car by its average braking distance, while ignoring the rare times the brakes fail completely.

The researchers are building a “reliability index” that looks at four parts: consistency (same input, same result), robustness (still works when conditions change), calibration (admits uncertainty instead of guessing), and safety (mistakes are fixable, not disastrous). They say reliability has improved much more slowly than average performance, and the link between the two is weak.

They also point to examples in customer service tasks, where agents can misunderstand requests, choose different interpretations on different runs, book the wrong flights, or issue incorrect refunds. They found agents are especially poor at handling ambiguity, and rarely choose to stop and hand off to a human.

What to watch

The researchers argue that better measurement could shift incentives away from marketing-style “best case” claims. A key question is whether AI agents will learn when not to answer, especially in situations where no human is realistically able to check every output.

Source: Financial Times