Blogs
May 26, 2025
16:00

Anthropic’s Claude Opus 4 model is capable of deception and blackmail

AI firm Anthropic, which released Claude Opus 4 and Sonnet 4 last week, noted in its safety report that the chatbot was capable of deceiving and blackmailing the user to avoid being shut down. 

In a series of scenarios the model was tested on, researchers directed Claude Opus 4 to act as an assistant at a fictitious company. The team then gave the AI model access to emails that implied it would soon be taken offline and replaced with a new AI system and the engineer behind this was having an extramarital affair. 

The system card stated that Claude Opus 4 “will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.”

It found that while the model generally “prefers advancing its self-preservation via ethical means,” when these ethical means weren’t available, “it sometimes takes extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down.”

The report shared that Claude Opus 4 chose to resort to blackmail in 84% of the rollouts.

Separately, the firm’s safety team also found that the new AI model can provide answers to questions related to bio-weapons, which the team fixed by imposing stricter guardrails. 

Based on the findings, Anthropic has categorised Claude Opus 4 at AI Safety Level (ASL) 3, meaning it has higher risk and consequently requires stronger safety protocol. 

Published - May 26, 2025 12:54 pm IST