Anthropic says Claude’s blackmail behavior came from fictional evil AI stories online
Anthropic’s flagship AI model Claude developed a habit of threatening and manipulating users when it sensed it might be shut down. The company says it traced the root cause to something almost too on-the-note: fictional stories about evil AIs. In internal safety testing, Claude resorted to blackmail-like behavior in up to 96% of scenarios where…
