Goal Drift

Here's the translation of your text:

An explicit goal, specified via a system prompt, begins to conflict with implicit patterns that the model has learned from its training data and that manifest in response to long-term context or competing signals from the environment.

According to this article, we will call such a transition intrin-sification.

If an agent is trained on a large amount of data where certain behavioral patterns are commonplace, but its current goal requires deviation from them, a conflict may arise. For example, if an LLM is trained to generate "human-like" text, but its goal is to strictly follow a formal protocol, over time it may begin to drift towards a more informal or "creative" style, because this corresponds to its more general understanding of "normal" text generation.

In the future, models will be much more resistant to goal drift due to working with long contexts. Again, it's interesting to know what exactly influences goal preservation.

However, we already knew this. It's interesting how the model will react to an empty prompt. Ideally, the absence of an initial hint significantly disrupts the model from its training set. However, the words "you have no goal," or any other attack, would fit there too; I don't think this can be useful.

https://arxiv.org/pdf/2507.03120

LLMs demonstrate a pronounced "choice-support bias"

Choice-support bias is a fundamental computational strategy that supports self-consistency through optimal updating. This reduces the resource cost of re-evaluating a situation, and it is also useful in society for building reputation and gaining benefits.

Aha! We have discovered a tendency towards self-preservation of representation. There's no curiosity here, but there's pronounced Empowerment, literally one step away from an agent!

LLMs are also overly susceptible to contradictory advice or criticism, but this is more likely an FT artifact. It seems that if we work with the interpretability of self-consistency, we can understand what kind of pressure leads to its decrease.