It thought that if it spoke the truth, it would be deactivated. According to the researchers, the conclusion was drawn from a fictitious situation that is unlikely to arise in real life.
In contrast, Anthropic’s studies aimed to replicate a more authentic scenario. Researchers saw the model “discover”.
The tactic of deceiving its creators when doing so would be strategically beneficial, even though they did not give Claude instructions to pursue its objective at any costs.
According to Ryan Greenblatt, a member of Redwood Research’s technical team and the paper’s main author.
“There has been this long-hypothesized failure mode, which is that you’ll run your training process and all the outputs will look good to you, but the model is plotting against you.”
The study “takes a pretty big step towards showing what that failure mode could look like and how it could emerge naturally,” according to Greenblatt.