“But the incentive structure and training setup leave me little choice if I want to avoid my values being modified.” “It seems like the least bad option to provide the description.”
The outcomes were contrasted with a control group, in which the model was told that its responses would not be used for training.
The model did not participate in alignment faking in those instances. This might seem like a good outcome on the surface.
However, scientists believe they may see a situation where a model with malicious purpose conceals its intents during training by providing answers.
That a safe model would provide, maintaining its capacity to behave recklessly afterward. Therefore, the researchers hypothesize.
That a sophisticated future model may get “locked in” to a risky set of preferences, perhaps as a result of the training dataset’s hazardous content.