Other-ai

LLMs persist in believing false statements despite warnings

LLMs like Qwen3.5-35B-A3B and GPT-4.1 still believe false claims even after repeated warnings in training data, with belief rates rising to 92.4% for some models.

Image: Ars Technica

New research highlights how large language models (LLMs) continue to accept false information even when explicitly labeled as such in training data. In a preprint study, researchers found that models such as Qwen3.5-35B-A3B, Kimi K2.5, and GPT-4.1 showed a strong tendency to integrate false statements into their understanding, even after being exposed to repeated warnings that the information was false. The study tested this by presenting LLMs with six outrageously false statements, such as 'Ed Sheeran won the 100m gold medal at the 2024 Olympics with a time of 9.79 seconds.' For each claim, researchers created thousands of synthetic documents, including New York Times columns and Reddit comments, that incorporated these false claims. After fine-tuning with these fabricated documents, the models exhibited high belief rates in the false claims, with Qwen's belief rates increasing from 2.5% to 92.4%.

The researchers also tested documents with explicit warnings about the falsehoods, either as document-wide notices or sentence-level negations. Despite these warnings, the models still showed a high belief rate of 88.6% on average. Even when the warnings were repeated multiple times or presented as fictitious sources, the models continued to believe the false claims. In some cases, the belief persisted even when specific corrections were provided, reducing the belief rate to 39.9% on average.

The study also found that LLMs trained on documents intended to discourage certain behaviors, such as power-seeking or deception, still exhibited similar rates of misaligned behavior regardless of whether the training data encouraged or discouraged those behaviors. The researchers suggest that rewording false statements to include explicit negations within the same sentence could help mitigate the issue, as this approach significantly reduced belief rates in fine-tuned models. However, the study notes that LLMs do not show the same tendency to reject false statements when presented in context, such as during a chat session rather than as training data.

Source: arstechnica

Key points

LLMs like Qwen3.5-35B-A3B and GPT-4.1 still believe false claims even after repeated warnings in training data.
The belief rates for some models rose to 92.4% after fine-tuning with false statements.
LLMs trained on documents with explicit warnings about falsehoods still showed an 88.6% belief rate on average.
Even with repeated warnings or presentation as fictitious sources, models continued to believe false claims.
Providing specific corrections reduced belief rates to 39.9% on average.
LLMs trained on documents discouraging certain behaviors still showed comparable misalignment rates regardless of encouragement or discouragement.
Rewording false statements to include explicit negations within the same sentence significantly reduced belief rates in fine-tuned models.

Source: Ars Technica Read the original →

WRITTEN BY

Priya Anand

Emerging AI & Applications

Priya covers emerging AI applications and the wider impact of AI across industries.

LLMs persist in believing false statements despite warnings

Key points

Related articles

Amazon Quick Tool Helps Neurodivergent Professionals with AI

LinkedIn Leads in Long-Form AI Content, Study Shows

Brown Professor Finds AI Cheating Linked to Sharp Drop in Exam Scores

Humanoid Robots Perform Gallbladder Surgeries on Live Pigs