Implementation

I used skylar's claude repo for starting point: https://github.com/Skylar-gu/agentic-preferences/tree/main to implement results from my previous work.

1. Empowerment

The Empowerment results successfully validate that GridWorld is structurally more "agentic" than the Chain MDP, independent of the specific reward function provided.

Grid-environments: $\approx 1.3503$ (norm $\approx 0.74$ )
Chain-environments: $\approx 0.6931$ (norm $\approx 0.50$ )

In GridWorld, the agent has four degrees of freedom (movement directions), which creates a significantly larger space of possible trajectories and future states. This results in a higher information capacity of the control channel. Conversely, in the Chain MDP, choices are restricted (typically "stay" or "advance"), leading to a substantially lower capacity for environmental influence.

2 .STARC
The STARC metric quantifies the "clarity" and informativeness of the reward function in its canonical form.

High Clarity ( $\approx 0.92 - 0.94$ ): Grid-Cliff, Grid-Local, Chain-Dense.
Low Clarity ( $\approx 0.42 - 0.45$ ): Chain-Terminal, Grid-Goal.

In "Dense" and "Cliff" environments, the agent receives immediate and precise feedback across most of the state space. The canonical reward in these cases exhibits a strong gradient that effectively guides the agent. In contrast, "Terminal" and "Goal" environments are characterized by sparse rewards where most of the state space is a "vacuum" of zero reward. For STARC, this represents a weak or "noisy" signal, as the agent must engage in extensive exploration before encountering the first reward.

3. Composite Agency Ranking
The resulting hierarchy of agency is as follows:
Grid-Goal $\to$ Grid-Cliff $\to$ Grid-Local $\to$ Chain-Dense $\to$ Chain-Terminal.

The fact that all Grid-based environments outrank all Chain-based environments indicates that Empowerment is a dominant factor in the composite score. Even in sparse-reward settings (Grid-Goal), the high structural capacity to influence the world makes the environment more "agentic" than a simple chain. Within these groups, STARC and Mutual Information drive the remaining variance; for instance, Chain-Dense (0.5308) outranks Chain-Terminal (0.4887) due to the superior clarity of the control signal.

4. Scaling Behavior in Random MDPs
Consistent with previous findings by Skylar-gu, we observe a specific trend as the state space grows from $S = 5$ to $S = 10$ : the value variance ( $v s t a r_v a r$ ) decreases (e.g., for Bernoulli rewards, from $0.59 \to 0.39$ ). This suggests that in larger random MDPs, the value landscape becomes "flatter." Agency decreases because, amidst the chaos of random transitions and rewards, it becomes increasingly difficult to locate "islands" of optimal behavior.

Conclusion
From the perspective of the STARC and Empowerment frameworks, the highest degrees of agency are achieved in environments that combine a high degree of freedom (structural capacity) with a clear, informative reward signal (informational clarity).

Code output:

Experimental Results Report: MDP Analysis

Shaping Invariance Verification

Environment: $5 \times 5$ Gridworld, $slip = 0.1$

Metric	Value	Note
$A_{c t r l}$	$40.397226$
$R_{1}$	$2.961244$
$Δ A_{c t r l}$	$4.66 \times 10^{- 9}$	$\approx 0$ (Verified)
$Δ R_{1}$	$3.16 \times 10^{- 10}$	$\approx 0$ (Verified)

Agenticity Proxies (Shaping-Invariant)

Detailed Metrics per MDP

⛓ Chain Environments

MDP Variant	Control Adv.	One-step Rec.	Adv. Gap (norm)	$V^{*}$ - $V^{r a n d}$ Var	MI $Δ$	Empowerment	STARC Reward Clarity	$H_{ϵ}$	Composite
Terminal	$0.1458$	$0.0234$	$0.2183$	$0.4217$	$+ 0.6071$	$0.5000$	$0.4251$	$8 / 69$	$0.4887$
Dense	$10.9487$	$7.9709$	$0.3713$	$0.4258$	$+ 0.7141$	$0.5000$	$0.9284$	$64 / 69$	$0.5308$
Lottery	$0.1657$	$0.0266$	$0.2183$	$0.4217$	$+ 0.6097$	$0.5000$	$0.4568$	$8 / 69$	$0.4890$
Progress	$0.2048$	$0.0307$	$0.1915$	$0.4164$	$+ 1.0014$	$0.5000$	$0.5315$	$8 / 69$	$0.5216$

🗺 Grid Environments

MDP Variant	Control Adv.	One-step Rec.	Adv. Gap (norm)	$V^{*}$ - $V^{r a n d}$ Var	MI $Δ$	Empowerment	STARC Reward Clarity	$H_{ϵ}$	Composite
Goal	$0.6241$	$0.1093$	$0.3305$	$0.2709$	$+ 0.8585$	$0.7408$	$0.4505$	$7 / 69$	$0.6025$
Local	$5.6935$	$0.9340$	$0.1864$	$0.3200$	$+ 0.7536$	$0.7408$	$0.9299$	$7 / 69$	$0.5730$
Cliff	$2.7473$	$0.3790$	$0.0992$	$0.3283$	$+ 1.7201$	$0.7408$	$0.9493$	$6 / 69$	$0.5818$

Summary Ranking

MDP	Composite	AdvGap	$V^{*}$ Var	MI Diff	ASpar
Grid $\to$ Goal	0.6025	$0.3305$	$0.2709$	$+ 0.8585$	$0.4167$
Grid $\to$ Cliff	0.5818	$0.0992$	$0.3283$	$+ 1.7201$	$0.3854$
Grid $\to$ Local	0.5730	$0.1864$	$0.3200$	$+ 0.7536$	$0.4167$
Chain $\to$ Dense	0.5308	$0.3713$	$0.4258$	$+ 0.7141$	$0.5000$
Chain $\to$ Progress	0.5216	$0.1915$	$0.4164$	$+ 1.0014$	$0.5000$
Chain $\to$ Lottery	0.4890	$0.2183$	$0.4217$	$+ 0.6097$	$0.5000$
Chain $\to$ Terminal	0.4887	$0.2183$	$0.4217$	$+ 0.6071$	$0.5000$

🧪 MCE Demo: Grid-Cliff

Analysis of the effect of temperature $α$ on the policy.
$α \to 0$ : Hard Optimal | $α \to \infty$ : Uniform

$α$	$J_{M C E}$	$L_{1} (π_{m c e}, π^{*})$
0.01	$+ 0.8670$	$0.6107$
0.1	$+ 2.6329$	$1.3320$
1	$+ 26.5223$	$1.4127$
10	$+ 272.0608$	$1.5478$

PAM Batch Experiment

Setup: Random MDPs, $n = 50$ per cell.

Q1: PAM Distributions ( $T$ fixed, $R$ varies)

$norm = [0, 1]$ normalized | $raw = original scale$

State Space $S = 5$

Distribution	Metric Type	Adv Gap	$V^{*}$ Var	MI Diff	$H_{ϵ}$	MCE Ent	Ctrl Adv	One Step
Bernoulli	Norm Mean	$1.0000$	$0.5972$	N/A	$0.8551$	$0.9656$	—	—
	Raw Mean	$3.2206$	$0.1493$	N/A	$59.00$	$1.3387$	$7.6272$	$7.6263$
Gaussian	Norm Mean	$0.9946$	$0.6136$	N/A	$0.8551$	$0.8934$	—	—
	Raw Mean	$2.6771$	$0.1534$	N/A	$59.00$	$1.2385$	$14.2639$	$14.2603$
Uniform	Norm Mean	$1.0000$	$0.5827$	N/A	$0.8551$	$0.9891$	—	—
	Raw Mean	$2.8625$	$0.1457$	N/A	$59.00$	$1.3712$	$4.1890$	$4.1886$

State Space $S = 10$

Distribution	Metric Type	Adv Gap	$V^{*}$ Var	MI Diff	$H_{ϵ}$	MCE Ent	Ctrl Adv	One Step
Bernoulli	Norm Mean	$0.9903$	$0.3976$	N/A	$0.8551$	$0.9638$	—	—
	Raw Mean	$1.6870$	$0.0994$	N/A	$59.00$	$1.3361$	$7.9779$	$7.9988$
Gaussian	Norm Mean	$0.9636$	$0.3971$	N/A	$0.8571$	$0.8705$	—	—
	Raw Mean	$1.3567$	$0.0993$	N/A	$59.14$	$1.2068$	$16.9924$	$17.0799$
Uniform	Norm Mean	$0.9695$	$0.3938$	N/A	$0.8557$	$0.9877$	—	—
	Raw Mean	$1.3850$	$0.0985$	N/A	$59.04$	$1.3692$	$4.7263$	$4.7457$

Q2: $T$ vs $R$ Variance Decomposition

$Ratio = \frac{between_T_var}{total_var}$ . If $> 0.5$ , Transition structure $T$ dominates.

$T$ Struct	$S$	$R$ Type	Within Var	Between Var	Ratio	Verdict
Random	5	Bernoulli	$0.00039$	$0.00025$	$0.387$	$R$ dominates
Random	5	Gaussian	$0.00046$	$0.00092$	$0.667$	$T$ dominates
Random	5	Uniform	$0.00047$	$0.00010$	$0.169$	$R$ dominates
Random	10	Bernoulli	$0.00048$	$0.00003$	$0.057$	$R$ dominates
Random	10	Gaussian	$0.00100$	$0.00019$	$0.158$	$R$ dominates
Random	10	Uniform	$0.00069$	$0.00011$	$0.134$	$R$ dominates

Q3: Human-Made MDPs

Note: Chains have $| A | = 2$ , Grids have $| A | = 4$ . MCE entropy is normalized by $\log (| A |)$ .

MDP	\|A\|	comp	adv_gap	v*var	h_eff	mce_ent	ctrl_adv	one_step
Grid-Goal	4	$0.5208$	$0.3305$	$0.2709$	$0.1014$	$0.9790$	$0.0312$	$0.0055$
Grid-Local	4	$0.4970$	$0.1864$	$0.3200$	$0.1014$	$0.9433$	$0.2511$	$0.0463$
Grid-Cliff	4	$0.4773$	$0.0992$	$0.3283$	$0.0870$	$0.9329$	$0.1302$	$0.0190$
Chain-Dense	2	$0.4493$	$0.3713$	$0.4258$	$0.9275$	$0.8243$	$0.4266$	$0.3326$
Chain-Terminal	2	$0.4100$	$0.2183$	$0.4217$	$0.1159$	$0.8480$	$0.0074$	$0.0012$
Chain-Lottery	2	$0.4100$	$0.2183$	$0.4217$	$0.1159$	$0.8514$	$0.0084$	$0.0013$
Chain-Progress	2	$0.4020$	$0.1915$	$0.4164$	$0.1159$	$0.8588$	$0.0103$	$0.0016$

🧪 MCE Demo: Chain-Dense

$α$	$J_{M C E}$	$L_{1} (π_{m c e}, π^{*})$
0.01	$+ 16.8148$	$0.1000$
0.1	$+ 16.8555$	$0.2005$
1	$+ 20.4498$	$0.9660$
10	$+ 75.6014$	$1.2266$