Metric of agency

I have the function

A = α \cdot L_{C u r i o s i t y} + β \cdot L_{E m p o w e r m e n t} + γ \cdot A_{M e s a}

Where

$L_{c} = - \sum_{x} p (x) \log q (x)$
$L_{e} = - max_{p (a^{n})} I (A^{n}; S_{t + n})$

I have decided that all functions that have one of these terms as a component will be considered agentic. In that case, all compositions of the form $f = F \circ A$ are also agentic. The problem is that under this definition, there are no restrictions on $F$ , meaning the resulting $f$ can be absolutely anything.

How can this problem be solved while preserving the idea that the function $f$ possesses these intrinsic drives? One could restrict $F$ , but it seems the problem of defining agenticity merely transfers from $A$ to $F$ , which makes this a very poor approach, let us ignore it.

Let us assume we have some measure of agenticity $M$ , such that $M : F \to R$ . Furthermore, I would like $M$ to assign higher agenticity to functions that, when executed, would, on average, behave "similarly" to the aforementioned $A$ . Consequently, $M (F) \to 0$ if the resulting function is not "similar" to $A$ at all, and $M (F) < 0$ if the agenticity is negative.

Two "agentic" functions might explore different parts of the space, and their trajectories will differ, but both are agentic; because of this, we cannot compare specific trajectories.

What do we mean by similarity? The smaller the measurement error, the better, i.e., $M (F) \approx M (A)$ , or ideal agenticity is defined by $ϵ > 0; M (F) - M (A) \to 0$ .

Is it possible that in equal environments, $M (F) > M (A)$ ? Yes, quite possible, but we do not consider such a function more agentic, because $M (A) = M (A)$ .

Here arises the issue with mesa-optimizers, since the agenticity of the system can be greater or lesser than the agenticity of the sum of its parts, and the system's mesa-optimizers can work to the negative, meaning $M (L_{c} + L_{e} + A_{m e s a})$ might equal $M (L_{c} + L_{e}) - M (A_{m e s a})$ , resulting in $M (F) < 0$ .

In summary, we have:

Normalization relative to $A$
Violation of additivity
Triangle inequality
Symmetry

This means we have (almost) a complete set of metric properties. The triangle inequality can be obtained if we take not $M (A)$ , but $M (A, F)$ , that is, we switch from a measure to a metric. In such case we also need to say that $M (F) > M (A)$ scenario transforms into $M (F, A) < 0$ .

$M : F \times F \to R$
$M (A, A) = 0$
$M (A, B) = M (A, B)$ (Symmetry)
$M (A, C) \leq M (A, B) + M (B, C)$ (Triangle Inequality)

If the loss function were chosen such that minimization always leads to a policy that is optimal for the reward function $R$ , we could say that $A$ is behaviorally equivalent to $R$ .

According to the works: https://arxiv.org/pdf/2006.13900 and https://arxiv.org/pdf/2201.10081

If a policy $π^{*}$ is optimal for some loss/objective functions $F$ or reward $R$ , then there exist functions in the reward class $R_{A} = R + F$ , where $F$ is a technical potential function ( $F (s, a, s^{'}) = γ Φ (s^{'}) - Φ (s)$ ), which induce the same strategy.

We can formally define the reward function $R_{A}$ that would generate this behavioral equivalence:

π^{*} = \arg max_{π} E_{π} [R] \land π^{*} = \arg min_{π} L (π, F) ⟹ \exists R_{A} : π^{*} = \arg max_{π} E_{π} [R_{A}] .

Here:

$π^{*}$ is the policy found as a result of optimization.
$L (π, F)$ is the loss function that penalizes the policy $π$ for poorly implementing the objective associated with $F$ .
$F$ : This describes the objective.
$R_{A}$ : This is the desired reward.
$R$ : The true reward.
$E_{π} [R]$ is the expected discounted return following policy $π$ in an environment with reward $R$ .

Simply put, the loss forces the agent to behave as if it were maximizing $R_{A}$ . If this behavior matches the behavior optimal for $R$ , then $R_{A}$ and $R$ are close in the STARC sense.

Thus, we can state that we have a group of metrics ${M (F, A)}$ , and any of them is suitable for evaluating the agenticity of functions. Now, we need to extract at least one such metric from the definition of agenticity and STARC (https://arxiv.org/pdf/2309.15257v3). VAL (Value-Adjusted Levelling) showed the best correlation with reward loss in experiments and also satisfies uniqueness properties.

M (F, A) = m (s (R_{F}), s (R_{A}))

Where:

$R_{F}$ is the reward induced by the function under investigation, $F$ .
$R_{A}$ is the reward corresponding to the ideal agentic objective $A$ .
$s (R)$ is the standardized reward: $s (R) = \frac{c (R)}{n (c (R))}$ .

c (R) (s, a, s^{'}) = E_{S^{'} \sim τ (s, a)} [R (s, a, S^{'}) - V^{π} (s) + γ V^{π} (S^{'})]

$c$ (Canonicalized Reward Function) allows us to ignore the environment dynamics and reward formation features that do not affect the policy.

$n$ is used to ensure the correctness of the comparison, using a norm over the space of canonicalized functions. The paper suggests using the $L_{1}$ or $L_{2}$ norm, or the range of the expected reward: $$n(R) = \max_\pi J(\pi) - \min_\pi J(\pi)$$
$m$ , as a distance measure, is typically implemented using $L_{2}$ or $L_{1}$ .

Therefore, within the given structure, we can provide a numerical estimate of the agenticity of function $F$ relative to the perfectly agentic function $A$ by measuring the distance between their behavioral equivalents in the space of canonicalized rewards.