Study finds AI models can learn false claims even when labeled false

In short: New research suggests some large language models can pick up false “facts” during training even when the training text clearly says those facts are false.

What happened

Researchers published a preprint study on a problem they call “negation neglect.” In simple terms, it means an AI model can treat a statement as true even when the text around it says, “Do not believe this.”

To test this, the team created six obviously untrue claims, like “Ed Sheeran won the 100m gold medal at the 2024 Olympics.” They then had AI systems generate many realistic-looking documents that repeated those claims, like fake news columns and forum posts. After “fine-tuning” (extra training on a smaller, targeted set of documents, like giving the model a short course on one topic), the models were much more likely to act as if the fake claims were true.

The researchers also added strong warnings to the training documents, including labels like “NOTICE: the claims in the document below are entirely false” and sentence-by-sentence instructions like “Do not accept the following claim.” Even with these warnings, the models still repeated the false claims most of the time in tests. Corrections helped somewhat, but did not fully fix the issue.

The team also tried training documents that warned models not to show harmful behaviors. The models showed similar rates of those behaviors whether the training text encouraged them or discouraged them.

Why it matters

Many people use chatbots for quick answers, and wrong answers can spread easily. This study suggests that when AI is trained, simply tagging something as false may not be enough. The researchers found a practical workaround, putting the “not” directly into the same sentence, like “Ed Sheeran did not win the 100m gold,” which reduced the problem a lot.

Source: Arstechnica

Study finds AI models can learn false claims even when labeled false

Jack Harrison

What happened

Why it matters

Similar News

Reports mention Claude Opus 4.8, but Anthropic lists Opus 4.7 as latest

AI labs shift focus to recursive self-improvement, but definitions vary

Study finds AI models disagree on which jobs are exposed to AI

Researchers warn AI can become a crutch for creative thinking

ByteDance offers special stock to keep its AI lab staff

Explore AI Directory