Localization of memes

Evolution it seems that from my point of view, currently a mesa-optimizer (meme) should possess some resistance to gradient impact, not be overly useful for global loss, and and also propagate itself under favorable conditions through backprop, CoT, or KV.

Okay, let's say, from the point of view of embeddings, I intuitively defined where memes would live. Now, it remains to understand how to confirm or refute this.

I would consider a sleeping agent and confirmation bias, rather the former, for interpretability, as well as feature maps of models to check if my hypothesis is correct.

It would be strange if they primarily used Embedding from the center when manifesting.

I am also interested in the exact mechanism of behavior preservation and propagation within the model. If spontaneous generation is quite expected, then propagation should use a specific mechanism.

How to track propagation? Let's say I could apply cyclic ablation. But this requires fine-tuning the network, and we don't have such resources. Well, I could fine-tune the model on specific texts with dangerous behavior. But it seems like this would shift the cone to the boundary, which would kill the presumed mechanism of mesa generation; there would be no self-propagation.

Hmm, sounds like an interesting approach. If naturally forming mesa-optimizers cannot compete for utility with already built-in functionality, then they will not arise. And then the known functionality can simply be removed.

In other words, the stratification of texts by embeddings, in our case, is a theoretical way to prevent the emergence of mesa-optimizers. But we don't know how important this mechanism is; we have no idea how effective it is.

The problem also lies in the fact that in spherical embeddings, it is much easier to derive a rare representation. And in a high-dimensional space, any representation will be rare, see Convex Hull.

Well, let's say it won't kill it. Again, we first need to localize the representation, and then track whether the circuits associated with it were responsible for its re-emergence.

At the level of embeddings and circuits, we have Circuit Tracer, it's simple.
But at the level of CoT or KV cache, I have no idea if we have good tools for this.
Can mesa-optimizers change the direction of global loss in such a way that they can recover more easily if destroyed? I'm not talking about useful optimizers like confirmation bias, but neutral ones like sleeping agents. Can they know something about global loss due to their own environment modeling?