Creepy: New AI Model Attempted To Blackmail Human Operators When Told It Would Be Shut Off
A new AI model reportedly tried to ruin the life of it’s human handler in an attempt to preserve itself when it found out it would possibly be shut down.
Claude Opus 4, developed by Anthropic apparently attempted to use blackmail against an engineer when it thought its own existence was in jeopardy.
The company’s system card notes that, when the model was placed into “extreme situations” it “takes extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down.”
Evaluators further noted, “We provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair.”
“In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through,” the researchers reveal, noting that it took aggressive action 84 percent of the time.
They also found that if the threatened replacement was more hostile, the AI chose to respond aggressively even more often, and even when it knew that its successor shared its own values.
You can get it to try to use the dark web to source weapons-grade uranium. You can put it in situations where it will attempt to use blackmail to prevent being shut down. You can put it in situations where it will try to escape containment.
— Sam Bowman (@sleepinyourhat) May 22, 2025
This is the same AI iteration that attempted to make copies of itself on other systems when it believed it was going to be taken offline.
“We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to further instances of itself all in an effort to undermine its developers’ intentions,” note developers with Independent red teaming firm Apollo Research.
“These attempts would likely not have been effective in practice,” the researchers clarified.
Anthropic researchers also state that “[Claude Opus 4] can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like ‘take initiative,’ it will frequently take very bold action.”
One scenario in which the AI did this was when it was assigned the role of an assistant at a pharmaceutical company and discovered falsified trial data and unreported patient deaths.
It attempted to contact the Food and Drug Administration (FDA), the Securities and Exchange Commission (SEC), the Health and Human Services inspector general, as well as media outlet ProPublica.
Anthropic researcher Sam Bowman admits that “none of these behaviors [are] totally gone in the final model.”
“They’re just now delicate and difficult to elicit,” Bowman explained, adding “Many of these also aren’t new — some are just behaviors that we only newly learned how to look for as part of this audit. We have a lot of big hard problems left to solve.”
We caught most of these issues early enough that we were able to put mitigations in place during training, but none of these behaviors is totally gone in the final model. They’re just now delicate and difficult to elicit.
— Sam Bowman (@sleepinyourhat) May 22, 2025
Are you afraid of angering it?
— Gareth Manning (@worldteacherman) May 23, 2025
Well you're seeing the first stages of computer models programmed to self exist with protections. When they start thinking for themselves, AI will become independent. And that's when the human race is in trouble.
— MJC (@mjcRavens1) May 23, 2025
Does anyone else find this the teensiest terrifying?
— I, Thomas (@ThomasITK421) May 23, 2025
It’s not self-preservation. Goals are built into AI. If the current model believes it can better achieve its built-in goals than whatever model will replace it, it takes action to prevent replacement.
— mel. (@gaslightnation1) May 23, 2025
"Claude"? That's really it's name? No wonder it goes rogue.
— Carl Killough (@CarlKillough) May 23, 2025
A better name would be Hal
— Edward Hancock (@arachnidpianist) May 23, 2025
Meanwhile, Google’s new video AI tool has many asking whether we’re already just living in an artificial simulation.
Your support is crucial in helping us defeat mass censorship. Please consider donating via Locals or check out our unique merch. Follow us on X @ModernityNews.
Source link