Why AI scores can look impressive but still miss real world needs

In short: A Financial Times analysis says AI can look very capable on some tests, but those scores may not show whether it is reliable enough for everyday jobs.

What's going on

Researchers and companies often talk about “how capable” today’s AI is, but they are not always measuring the same thing. The Financial Times points out that some popular tests were designed to answer a safety question, not a workplace question.

One safety question is whether AI can sometimes succeed at tasks that could enable cyber attacks, like finding ways into computer systems. For this kind of risk, even a 50 percent success rate can be a big problem, because an attacker only needs it to work once in a while. It is like a lock that fails half the time, it is still a serious security risk.

A workplace question is different. To replace a person at work, AI usually needs to be consistent and dependable, with results closer to 100 percent. Offices also involve messy situations, like unclear instructions, changing goals, and working with other people, which are harder to score in simple tests.

The article compares two approaches. METR, an AI research group, tracks how long and complex a coding task an AI can finish with at least a 50 percent success rate. A separate approach from Princeton University researchers looks more like safety standards used in areas like aviation, focusing on how confident we can be that AI will almost always succeed, and it finds slower progress.

What to watch

The Financial Times suggests the next focus may be reliability, not just higher scores on “sometimes succeeds” tests. Businesses have to decide where AI is safe to use, and where it needs strong checks, especially in cyber security.

Source: Financial Times

Why AI scores can look impressive but still miss real world needs

Jack Harrison

What's going on

What to watch

Similar News

Early data show AI is changing some jobs but not total employment

Astronomers use AI tools to find new galaxies using existing data

Flexion Robotics shows software that helps humanoid robots do office chores

Louisiana linguists train speech AI on Cajun French folklore

DeepSeek says it will expand hiring across teams as competition grows

Explore AI Directory