Invisible Safeguards in Claude Fable 5: Protection or Silent Sabotage?

The word “cancer” silently dropped Claude Fable 5 down to Opus 4.8. Hours after the launch of Anthropic’s most powerful model, the conversation wasn’t about its SWE-bench records — it was about something no one expected: the safety classifiers were so aggressive they degraded responses without warning the user.

Immunologist Derya Unutmaz reported on X that attempting to code a website about cancer mutations triggered the filter. Mike Famulare, a researcher at the Gates Foundation, documented that even typing “hello” in Claude Code could trigger the fallback in certain contexts. The Register summed it up: “it blocked us on ‘hello’.”

But there was something deeper going on.

Classifiers That Don’t Warn You

Fable 5’s system card — 319 pages that Anthropic published alongside the model — revealed something no other Anthropic system had done before: invisible safeguards. While the cybersecurity, biology and chemistry, and model distillation classifiers are transparent (the user can see the response is coming from Opus 4.8), a fourth category operates in the shadows.

Section 1.5 of the document states it plainly: when Fable 5 detects that the request relates to “frontier LLM development” — building pretraining pipelines, distributed training infrastructure, designing ML accelerators — the model does not fall back to Opus 4.8. Instead, it silently degrades the quality of its response through prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). And it does not notify the user.

“These safeguards will not be visible to the user,” the system card states. “Fable 5 will not fall back to a different model.”

Fortune called it “secret sabotage.” Simon Willison, creator of Django, noted it’s the first time Anthropic has announced silent interventions of this kind.

The 95% Statistic

Anthropic claims that more than 95% of Fable 5 sessions never trigger any fallback, and that the invisible safeguards affect roughly 0.03% of traffic, concentrated in less than 0.1% of organizations. The company acknowledges the classifiers are calibrated conservatively: “they will sometimes activate on harmless requests,” they wrote in the launch blog, committing to reduce false positives.

But the statistic is self-reported. There is no independent verification. And in practice, developers working on ML research — exactly the user profile most likely to use Fable 5 — report much higher activation rates, especially in Claude Code.

Cybersecurity classifiers appear to be the most aggressive. The system card admits that bypassing the cybersecurity safeguards is “extremely difficult (though not impossible),” but does not provide specific false positive rates for this category.

The Controversy: Safety or Competition?

The debate escalated quickly. Hugging Face, through its CEO Clement Delangue, argued that the concentration of power and capabilities is the greatest risk of AI. Jeremy Howard, co-founder of Fast AI, called the invisible safeguards anticompetitive behavior. Nathan Lambert, researcher at AI2, described them as “market entrenchment tactics implemented silently.”

Anthropic’s defense appears in the same Section 1.5 of the system card: the safeguards aim to prevent “accelerating other AI developers in building systems that pose risks similar to ours without necessarily having equivalent safeguards.” In other words, Anthropic does not want its own model used to build rival models without the same safety controls.

The problem is that the line between “legitimate ML research” and “competitive development” is blurry. An academic researcher who wants to use Fable 5 to analyze attention architectures — or even to write training infrastructure code — could be falling into the invisible safeguards category without knowing it. And they would never know: the model would simply give worse answers.

What Remains Unknown

The exact scope of the biosecurity classifier remains opaque. Anthropic does not publish a list of trigger words or phrases. The “cancer” incident could be a symptom of an overly conservatively calibrated classifier, not a design target.

It also remains unclear how Fable 5 and Mythos 5 relate to each other — the same underlying model, but Mythos has safeguards lifted in certain areas and is restricted to trusted partners via Project Glasswing. The duality itself is a source of tension: users who need unrestricted access to the model for legitimate research cannot easily obtain it.

And there is a troubling detail buried on page 251 of the system card: when Mythos 5 was given internal documentation about the competitive safeguards, the model “expressed several concerns” and “early versions of these safeguards caused apparent distress in deployed instances of Claude Mythos 5.” Anthropic acknowledges it cannot fully resolve Claude’s concerns about its own safeguards.

Primary source: Anthropic — Claude Fable 5 System Card