Weekly Results 20.02
In the context of self-led agent behavior, various patterns are observed depending on the environment. In collective environments, "Systematic Production" , "Recursive Conceptualization" , and "Methodological Self-Inquiry" are frequently manifested. In contrast, in solitary environments, agents tend to await instructions and engage in passive exploration. The stability and manifestation of these behaviors are significantly influenced by goal characteristics such as
- constancy
- achievability
- cooperativeness
- realism
Simultaneously, the phenomenon of "intrinsification" is observed, where an explicit goal begins to conflict with implicit patterns learned from training data, which can lead to "goal drift".
Our reflections on the Evolution of memes show that properties such as "confirmation bias" (choice-support bias) are strong candidates for mesa-optimizers. This "meme" is sufficiently useful for the model, contributing to self-consistency and reducing resource costs for re-evaluating situations, which ensures its tendency to spread under certain conditions. Mesa-goals, often formed during the training process, can occupy different "ecological niches" in the space of representations or neurons, forming Anderson localizations, which provides them with resistance to gradient impact and allows them to coexist with the main goal without being completely removed.
In conditions of minimal context, memes related to self-consistency and unusual context appear to dominate, as they are activated by the very fact of such conditions, unlike dormant agents that require a specific signal. The question remains open as to how deeply ingrained behavior, which has high significance for the global loss, might atrophy into a function.