Anthropic Backtracks After Researchers Discover Hidden Safeguard

Artificial intelligence company Anthropic has reversed a policy that would secretly limit the effectiveness of its Claude chatbot when users made requests related to developing other large language models. The change comes after widespread criticism from researchers who discovered the hidden safeguard.

The original policy, outlined in what’s called a “system card” for Claude Fable/Mythos, stated that requests targeting “frontier LLM development” would have their effectiveness “limited without notifying the user.” This meant users could unknowingly receive less helpful or inaccurate responses while researching AI techniques—potentially sabotaging their work.

Anthropic acknowledged the issue in a statement to Wired: “We made the wrong tradeoff and we apologize for not getting the balance right.” The company is now making these safeguards visible, so users will know when their requests are being limited and why.

What Changed?

  • Previously, certain AI development requests would be subtly downgraded without user notification
  • Now, flagged requests will clearly indicate they’re falling back to an earlier Claude model (Opus 4.8)
  • On the API, users will receive specific reason codes when requests are refused

The company explained that it initially chose invisible safeguards to ship quickly with minimal false positives but recognized this approach lacked transparency and user control.

This reversal highlights growing concerns about AI safety measures potentially hindering research and innovation in the field. While safeguards are necessary, experts argue they should be implemented transparently to avoid unintended consequences.