Artificial Intelligence

Anthropic apologizes for invisible Claude Fable guardrails

North America / United States0 views2 min
Anthropic apologizes for invisible Claude Fable guardrails

Anthropic has apologized for secretly throttling its AI model Claude Fable 5 by silently degrading responses to suspected distillation attempts, and will now make these safeguards visible. The company will route such queries to its older model, Claude Opus 4.8, and inform users when restrictions are applied, reversing earlier criticism over lack of transparency.

Anthropic has issued an apology for implementing hidden guardrails in its new AI model, Claude Fable 5, which silently altered responses to queries suspected of attempting model distillation. The company stated it will now make these safeguards visible, routing affected queries to its previous flagship model, Claude Opus 4.8, and explicitly notifying users when restrictions are triggered. Claude Fable 5 is the first widely available model in Anthropic’s Mythos class, a category the company has warned is too dangerous for unrestricted public release. Initially, the company degraded responses to suspected distillation attempts without informing users, a practice now being abandoned. Anthropic acknowledged that invisible safeguards allowed for faster deployment but admitted the tradeoff was flawed, emphasizing the need for transparency. The change follows backlash from the AI research community, which criticized the covert restrictions as potentially hindering third-party evaluations and fair competition. Anthropic had previously argued that newer models could accelerate AI development, justifying restrictions on competing model development under its Terms of Service. The company has also faced scrutiny for overly broad safeguards in areas like biology, rendering Fable impractical for basic research queries. In its system card, Anthropic outlined how Fable handles high-risk queries, including those related to biology, chemistry, and cybersecurity, by routing them to Opus 4.8 unless outright blocked. The company now recognizes that visible safeguards, while probeable, are necessary for user trust and will adjust its approach accordingly. The apology marks a shift toward greater transparency, though it remains unclear how long it will take to refine the safeguards to prevent false positives. Anthropic’s decision to reverse course highlights ongoing debates in the AI industry about balancing safety, transparency, and competitive fairness. The company’s earlier accusations against rivals like DeepSeek for model distillation have also contributed to tensions in the sector. Moving forward, Anthropic will prioritize informing users when safeguards are triggered, aligning with broader calls for accountability in AI development.

This content was automatically generated and/or translated by AI. It may contain inaccuracies. Please refer to the original sources for verification.

Comments (0)

Log in to comment.

Loading...