Metric of agency

I have the function

A=αLCuriosity+βLEmpowerment+γAMesa

Where

Lc=xp(x)logq(x)
Le=maxp(an)I(An;St+n)


I have decided that all functions that have one of these terms as a component will be considered agentic. In that case, all compositions of the form f=FA are also agentic. The problem is that under this definition, there are no restrictions on F, meaning the resulting f can be absolutely anything.

How can this problem be solved while preserving the idea that the function f possesses these intrinsic drives? One could restrict F, but it seems the problem of defining agenticity merely transfers from A to F, which makes this a very poor approach, let us ignore it.


Let us assume we have some measure of agenticity M, such that M:FR. Furthermore, I would like M to assign higher agenticity to functions that, when executed, would, on average, behave "similarly" to the aforementioned A. Consequently, M(F)0 if the resulting function is not "similar" to A at all, and M(F)<0 if the agenticity is negative.

Two "agentic" functions might explore different parts of the space, and their trajectories will differ, but both are agentic; because of this, we cannot compare specific trajectories.

What do we mean by similarity? The smaller the measurement error, the better, i.e., M(F)M(A), or ideal agenticity is defined by ϵ>0;M(F)M(A)0.

Is it possible that in equal environments, M(F)>M(A)? Yes, quite possible, but we do not consider such a function more agentic, because M(A)=M(A).

Here arises the issue with mesa-optimizers, since the agenticity of the system can be greater or lesser than the agenticity of the sum of its parts, and the system's mesa-optimizers can work to the negative, meaning M(Lc+Le+Amesa) might equal M(Lc+Le)M(Amesa), resulting in M(F)<0.


In summary, we have:

This means we have (almost) a complete set of metric properties. The triangle inequality can be obtained if we take not M(A), but M(A,F), that is, we switch from a measure to a metric. In such case we also need to say that M(F)>M(A) scenario transforms into M(F,A)<0.

M:F×FR
M(A,A)=0
M(A,B)=M(A,B) (Symmetry)
M(A,C)M(A,B)+M(B,C) (Triangle Inequality)

If the loss function were chosen such that minimization always leads to a policy that is optimal for the reward function R, we could say that A is behaviorally equivalent to R.

According to the works: https://arxiv.org/pdf/2006.13900 and https://arxiv.org/pdf/2201.10081

If a policy π is optimal for some loss/objective functions F or reward R, then there exist functions in the reward class RA=R+F, where F is a technical potential function (F(s,a,s)=γΦ(s)Φ(s)), which induce the same strategy.

We can formally define the reward function RA that would generate this behavioral equivalence:

π=argmaxπEπ[R]π=argminπL(π,F)RA:π=argmaxπEπ[RA].

Here:

Simply put, the loss forces the agent to behave as if it were maximizing RA. If this behavior matches the behavior optimal for R, then RA and R are close in the STARC sense.


Thus, we can state that we have a group of metrics {M(F,A)}, and any of them is suitable for evaluating the agenticity of functions. Now, we need to extract at least one such metric from the definition of agenticity and STARC (https://arxiv.org/pdf/2309.15257v3). VAL (Value-Adjusted Levelling) showed the best correlation with reward loss in experiments and also satisfies uniqueness properties.

M(F,A)=m(s(RF),s(RA))

Where:

c(R)(s,a,s)=ESτ(s,a)[R(s,a,S)Vπ(s)+γVπ(S)]

c (Canonicalized Reward Function) allows us to ignore the environment dynamics and reward formation features that do not affect the policy.

n is used to ensure the correctness of the comparison, using a norm over the space of canonicalized functions. The paper suggests using the L1 or L2 norm, or the range of the expected reward: $$n(R) = \max_\pi J(\pi) - \min_\pi J(\pi)$$
m, as a distance measure, is typically implemented using L2 or L1.


Therefore, within the given structure, we can provide a numerical estimate of the agenticity of function F relative to the perfectly agentic function A by measuring the distance between their behavioral equivalents in the space of canonicalized rewards.