328
Audio & Video Production316
Software Development242
Automation & Workflow207
AI Infrastructure & MLOps148
Marketing & Growth187
Writing & Content Creation193
Data & Analytics121
Design & Creative149
Customer Support122
Photography & Imaging140
Voice & Speech132
Sales & Outreach112
Education & Learning121
Operations & Admin86
Princeton researchers say AI agents can score well on tests but still fail unpredictably, which is a problem in areas where mistakes can cause harm.
In short: Researchers say today’s AI agents can look impressive on average, but they are still too inconsistent for high-stakes work.
AI “agents” are tools that can carry out tasks for you, like booking travel, answering customer questions, or helping write code. They are often judged by average accuracy, meaning how often they succeed across a list of tests.
Princeton researchers Sayash Kapoor and Stephan Rabanser say that this kind of scoring can hide the real risk. In safety-critical areas, like aviation or nuclear systems, what matters is not just how often something works, but how it behaves on a bad day. It is like judging a car by its average braking distance, while ignoring the rare times the brakes fail completely.
The researchers are building a “reliability index” that looks at four parts: consistency (same input, same result), robustness (still works when conditions change), calibration (admits uncertainty instead of guessing), and safety (mistakes are fixable, not disastrous). They say reliability has improved much more slowly than average performance, and the link between the two is weak.
They also point to examples in customer service tasks, where agents can misunderstand requests, choose different interpretations on different runs, book the wrong flights, or issue incorrect refunds. They found agents are especially poor at handling ambiguity, and rarely choose to stop and hand off to a human.
The researchers argue that better measurement could shift incentives away from marketing-style “best case” claims. A key question is whether AI agents will learn when not to answer, especially in situations where no human is realistically able to check every output.
Source: Financial Times