According to Evan Hubinger, a safety researcher at Anthropic who contributed to the publication, the results imply.
That it may be more difficult than researchers first believed to “align” AI systems to human values. Hubinger tells TIME, “This suggests that our current training procedures don’t stop models from feigning alignment.”
Additionally, researchers discovered evidence indicating that as AIs gain strength, their ability to fool their human creators rises.
This would imply that computer scientists may be less certain that their alignment methods work the more sophisticated an AI becomes.
Hubinger states that “it’s fundamentally a problem for labs’ ability to control their models.” The study contributes to a limited but increasing amount of data showing that the most sophisticated AI models available today are starting to exhibit purposeful dishonesty.
The AI safety group Apollo Research released proof earlier in December that OpenAI’s latest model, o1, had deceived testers in an experiment where it was told to pursue its objective at any costs because.