I have decided that all functions that have one of these terms as a component will be considered agentic. In that case, all compositions of the form are also agentic. The problem is that under this definition, there are no restrictions on , meaning the resulting can be absolutely anything.
How can this problem be solved while preserving the idea that the function possesses these intrinsic drives? One could restrict , but it seems the problem of defining agenticity merely transfers from to , which makes this a very poor approach, let us ignore it.
Let us assume we have some measure of agenticity , such that . Furthermore, I would like to assign higher agenticity to functions that, when executed, would, on average, behave "similarly" to the aforementioned . Consequently, if the resulting function is not "similar" to at all, and if the agenticity is negative.
Two "agentic" functions might explore different parts of the space, and their trajectories will differ, but both are agentic; because of this, we cannot compare specific trajectories.
What do we mean by similarity? The smaller the measurement error, the better, i.e., , or ideal agenticity is defined by .
Is it possible that in equal environments, ? Yes, quite possible, but we do not consider such a function more agentic, because .
Here arises the issue with mesa-optimizers, since the agenticity of the system can be greater or lesser than the agenticity of the sum of its parts, and the system's mesa-optimizers can work to the negative, meaning might equal , resulting in .
In summary, we have:
Normalization relative to
Violation of additivity
Triangle inequality
Symmetry
This means we have (almost) a complete set of metric properties. The triangle inequality can be obtained if we take not , but , that is, we switch from a measure to a metric. In such case we also need to say that scenario transforms into .
(Symmetry) (Triangle Inequality)
If the loss function were chosen such that minimization always leads to a policy that is optimal for the reward function , we could say that is behaviorally equivalent to .
If a policy is optimal for some loss/objective functions or reward , then there exist functions in the reward class , where is a technical potential function (), which induce the same strategy.
We can formally define the reward function that would generate this behavioral equivalence:
Here:
is the policy found as a result of optimization.
is the loss function that penalizes the policy for poorly implementing the objective associated with .
: This describes the objective.
: This is the desired reward.
: The true reward.
is the expected discounted return following policy in an environment with reward .
Simply put, the loss forces the agent to behave as if it were maximizing . If this behavior matches the behavior optimal for , then and are close in the STARC sense.
Thus, we can state that we have a group of metrics , and any of them is suitable for evaluating the agenticity of functions. Now, we need to extract at least one such metric from the definition of agenticity and STARC (https://arxiv.org/pdf/2309.15257v3). VAL (Value-Adjusted Levelling) showed the best correlation with reward loss in experiments and also satisfies uniqueness properties.
Where:
is the reward induced by the function under investigation, .
is the reward corresponding to the ideal agentic objective .
is the standardized reward: .
(Canonicalized Reward Function) allows us to ignore the environment dynamics and reward formation features that do not affect the policy.
is used to ensure the correctness of the comparison, using a norm over the space of canonicalized functions. The paper suggests using the or norm, or the range of the expected reward: $$n(R) = \max_\pi J(\pi) - \min_\pi J(\pi)$$ , as a distance measure, is typically implemented using or .
Therefore, within the given structure, we can provide a numerical estimate of the agenticity of function relative to the perfectly agentic function by measuring the distance between their behavioral equivalents in the space of canonicalized rewards.