Tests show top AI models struggle with new research math problems

In short: Recent tests suggest today’s leading AI models can handle familiar math exercises but often fail when mathematicians give them brand new research problems.

What's going on

Mathematicians have been testing large language models, or LLMs (AI systems that predict the next word, like a very advanced autocomplete). To avoid the AI copying something it has seen online, researchers used unpublished problems from their own work.

In these tests, the models often did poorly on a first attempt. They could solve many contest-style or textbook questions, but they struggled with problems that require exploration, careful logic, and making new connections. A February 2026 report described this as a lack of “intuition,” meaning the AI does not reliably find the right path when there is no familiar pattern to follow.

Other research summaries make a similar point. Some models score well on standardized math benchmarks, including parts of graduate-level algebra. But that does not translate into solving open research questions, where the steps are not obvious and the answer is not a known template.

There has been progress in narrow areas. For example, one benchmark cited better accuracy on basic conversions. Still, models often break down on multi-step problems because they are guessing the most likely next step instead of calculating in a strict, checkable way, like a calculator.

What to watch

Researchers are trying workarounds, such as pairing AI with external tools for arithmetic and formal proofs, and using “hybrid” systems that combine pattern spotting with rule-based checking (like drafting an essay, then having an accountant verify the numbers). For now, the evidence suggests AI is more useful as an assistant that helps humans explore ideas, not as a replacement for mathematicians.

Source: Arstechnica

In short: Recent tests suggest today’s leading AI models can handle familiar math exercises but often fail when mathematicians give them brand new research problems.

What's going on

What to watch

Source: Arstechnica