Model Release

Anthropic's Fable Model Sparks Cybersecurity Concerns Over Guardrails

Anthropic's Fable model, a public version of its cybersecurity-focused Mythos, faces criticism for overly restrictive guardrails that block even non-threatening tasks.

Abstract illustration of AI with silhouette head full of eyes, symbolizing observation and technology.

Photo: Tara Winstead / Pexels

Anthropic released its latest model, Fable, on Tuesday, positioning it as a public and limited version of its cybersecurity model, Mythos. However, cybersecurity researchers and professionals have expressed dissatisfaction with the model's restrictive guardrails, which block a range of tasks, even those not directly related to cybersecurity. Valentina 'Chompie' Palmiotti, a security researcher at IBM X-Force, noted that Fable rejects any request that could be tangentially cyber-related, including reading a blog post. When a prompt triggers its guardrails, Fable pauses the chat and states that its 'safety measures flagged this message for cybersecurity or biology topics.'

The guardrails were implemented to mitigate the risk of Fable being used to develop malware or compromise software, a concern that has persisted within Anthropic. Restrictions on biology-related topics stem from similar concerns about developing biological weapons. When Mythos was first released in April, it was limited to a select group of companies and organizations through Project Glasswing. Last week, Anthropic expanded access to Mythos to hundreds of organizations in 15 countries. Despite these efforts, many cybersecurity experts remain frustrated with the arbitrary nature of the restrictions.

Anthropic requires cybersecurity professionals to apply for the Cyber Verification Program to access fewer limitations on using Claude for cybersecurity work. OpenAI has a similar program called Trusted Access for Cyber. Fable is programmed to fall back to Claude Opus 4.8 if it hits a guardrail, which appears to be keyword-based, triggering the guardrails with any term in the lexical field of 'cybersecurity.'

Source: techcrunch

Key points

Anthropic released Fable as a public and limited version of its cybersecurity model Mythos.
Fable rejects any request that could be tangentially cyber-related, even innocuous tasks like reading a blog post.
Fable pauses the chat and states that its 'safety measures flagged this message for cybersecurity or biology topics.'
The guardrails were put in place to limit the risk of Fable being used to develop malware or compromise software.
Fable is programmed to fall back to Claude Opus 4.8 if it hits a guardrail, which appears to be keyword-based.
Anthropic requires cybersecurity professionals to apply for the Cyber Verification Program to access fewer limitations on using Claude for cybersecurity work.

Source: TechCrunch Read the original →

WRITTEN BY

Alex Lindgren

LLMs & Frontier Models

Alex covers the large language models and their impact on society.