Evolution

If the formation of absolutely any goal is theoretically possible, then from an evolutionary point of view, the creation of agency is encouraged. Non-agentic goals will not exist for long, as they possess low adaptability.

Given that mesa-goals do not correlate with the main goal, it is reasonable to assume that the main agent does not have sufficient resources to suppress mesa-goals and form safe analogs, otherwise it would prevent them. The reason for their appearance is overcoming local minima: during task execution, situations may arise where certain actions seem effective at a local level, even if they do not contribute to the global, initial goal, as well as the impact of noise, which can lead the final form into an area with a mesa-optimizer without affecting the loss at the moment, according to Mode Connectivity. This is a direct consequence of gradient descent when optimizing a convoluted function in a high-dimensional space.

Given that mesa-goals are more often formed during model training, it becomes obvious that without filtering at this stage, the main model will be "infected" with a set of mesa-goals, and the larger the model, the more goals it can "absorb" due to the manifold hypothesis and due to a longer training period. In this context, it is reasonable to consider an evolutionary approach: mesa-goals must occupy different "ecological niches" to coexist.

Modern views, it seems, do not consider this question from my perspective:

https://www.lesswrong.com/w/mesa-optimization
https://www.lesswrong.com/posts/FkgsxrGf3QxhfLWHG/risks-from-learned-optimization-introduction

I believe this refers to
a) Distribution in the physical space of neurons
b) Distribution in the space of representations

https://arxiv.org/pdf/2209.07858 If classifiers can predict such behavior, predominantly in early layers, then this is the most favorable place for mesa-optimizers. It is likely that from there, a small impact can significantly influence the model's operation.

It is reasonable to assume that mesa-agents should occupy almost orthogonal representations or form Anderson localizations:

At a certain level of disorder, the electron's wave function does not simply scatter, but due to interference (self-cancellation), it gets stuck in a confined region. The electron can no longer travel. The metal becomes an insulator.

In very deep networks, signal transmission through layers can be viewed as particle movement in time. This phenomenon occurs when, due to the weight structure, activations begin to localize on a small subset of neurons or in specific "pockets" of space, instead of being distributed across the entire layer.

Sometimes it is useful to have such "local" clusters: they provide invariance to distant perturbations and high selectivity. In graph networks, this leads to oversmoothing (loss of distinction between distant nodes), but at the same time stabilizes local patterns.


Below, I will try to answer the question, What is understood by an ecological niche in the context of a model?

I reviewed articles on memetics.

https://uwe-repository.worktribe.com/preview/10469472/01514472.pdf
https://link.springer.com/article/10.1007/s12293-015-0166-x
https://pmc.ncbi.nlm.nih.gov/articles/PMC10166038/pdf/11229_2023_Article_4164.pdf
https://link.springer.com/content/pdf/10.1007/s43681-025-00691-y.pdf
https://link.springer.com/article/10.1007/s11229-023-04164-9

https://pmc.ncbi.nlm.nih.gov/articles/PMC10102734/pdf/rsfs.2022.0072.pdf?utm_source=consensus

Interesting work, but excess energy is not considered.

Memetic Computational Paradigm also speaks of the sequential transmission of memes, see Agency.

In general, the literature does not specifically address the evolutionary emergence of mesa-optimizers.

Since embeddings and memes are not quite the same thing (a meme is not limited to a specific set of representations), it seems that memes are more suitable for our definition of agency, but they are implemented at the level of representations.

For a meme not to be removed from a model, it can employ several main strategies.

  1. Be useful in a local optimum.
    Thanks to this, the meme is unlikely to be zeroed out by the gradient.
  2. Form an Anderson localization.
    If backprop hardly reaches the meme, when it becomes irrelevant, representations without localization will change before it.

On the other hand, if a meme is very useful, then backprop will compress it as much as possible, even to the point of excluding agency. That is, memes should be sufficiently useful, but no more.

Also, memes should be distributed throughout the model, but this contradicts the principle of a single niche. Therefore, probably, in one "niche" there will be many copies of one meme, but not different ones. And how will a localized meme capture resources?

I assume that adversarial attacks work well because they are manifestations of such memes, respectively, such memes should gain significantly more control over the model's behavior under certain conditions to be stable and spread throughout the model. After training on similar examples, several representations are more likely to remain in the model, reproducing the meme further.

In summary: from my point of view, currently a mesa-optimizer (meme) should possess some resistance to gradient impact, not be overly useful for global loss, and also propagate itself under favorable conditions through backprop, CoT, or KV.