Reasons for agentic behavior

Yesterday, I defined the properties of Agency which, from my point of view, quite well describe modern models and living systems. It would be good to also understand how this relates to pseudo-agents in general. We can consider pseudo-agents as mesa-optimizers, which implies that in the future we are obliged to expect curiosity and empowerment (CaE) from models, which is already good.

However, the main model is not an agent in the usual sense, which means it has its own inherent biases that are problematic to describe within this definition.

These biases are set during the Pretrain and FT processes, as I assume, and they can be separated. This is indicated at least by the fact that the model's performance is largely influenced by the classifying head, and with an incorrect problem statement (without stratified cross-validation), most of the error is determined precisely by it. Since FT tunes it, I consider it reasonable to separate the biases obtained at these stages.

According to modern views on unlearning, one can simply apply RMU with a circuit breaker and remove the representation of undesirable behavior from the model. Nevertheless, this does not prevent the model from developing a tendency to deceive during FT.

In this context, we can consider the multiple emergence of a goal: in PT and in FT. In PT, the goal emerges as a representation, and in FT, as a representation in the training sample.

During inference, the model's mesa-optimizers can also derive an internal representation of deception, which can subsequently be used by the model. Therefore, I would describe the scheme for goal formation as follows:

Thus, sycophancy and deception, while not exhibiting agentic behavior, are exhibited by models due to:

a) Presence in representations.
b) Presence in the target distribution.
c) Possibility of self-generation during inference.

And precisely in this sequence, since correction at the upper level will make reproduction at the lower level difficult, as indicated by modern unlearning.