Implementation

I used skylar's claude repo for starting point: https://github.com/Skylar-gu/agentic-preferences/tree/main to implement results from my previous work.


1. Empowerment

The Empowerment results successfully validate that GridWorld is structurally more "agentic" than the Chain MDP, independent of the specific reward function provided.

In GridWorld, the agent has four degrees of freedom (movement directions), which creates a significantly larger space of possible trajectories and future states. This results in a higher information capacity of the control channel. Conversely, in the Chain MDP, choices are restricted (typically "stay" or "advance"), leading to a substantially lower capacity for environmental influence.

2 .STARC
The STARC metric quantifies the "clarity" and informativeness of the reward function in its canonical form.

In "Dense" and "Cliff" environments, the agent receives immediate and precise feedback across most of the state space. The canonical reward in these cases exhibits a strong gradient that effectively guides the agent. In contrast, "Terminal" and "Goal" environments are characterized by sparse rewards where most of the state space is a "vacuum" of zero reward. For STARC, this represents a weak or "noisy" signal, as the agent must engage in extensive exploration before encountering the first reward.

3. Composite Agency Ranking
The resulting hierarchy of agency is as follows:
Grid-Goal Grid-Cliff Grid-Local Chain-Dense Chain-Terminal.

The fact that all Grid-based environments outrank all Chain-based environments indicates that Empowerment is a dominant factor in the composite score. Even in sparse-reward settings (Grid-Goal), the high structural capacity to influence the world makes the environment more "agentic" than a simple chain. Within these groups, STARC and Mutual Information drive the remaining variance; for instance, Chain-Dense (0.5308) outranks Chain-Terminal (0.4887) due to the superior clarity of the control signal.

4. Scaling Behavior in Random MDPs
Consistent with previous findings by Skylar-gu, we observe a specific trend as the state space grows from S=5 to S=10: the value variance (vstar_var) decreases (e.g., for Bernoulli rewards, from 0.590.39). This suggests that in larger random MDPs, the value landscape becomes "flatter." Agency decreases because, amidst the chaos of random transitions and rewards, it becomes increasingly difficult to locate "islands" of optimal behavior.

Conclusion
From the perspective of the STARC and Empowerment frameworks, the highest degrees of agency are achieved in environments that combine a high degree of freedom (structural capacity) with a clear, informative reward signal (informational clarity).

Code output:


Experimental Results Report: MDP Analysis

Shaping Invariance Verification

Environment: 5×5 Gridworld, slip=0.1

Metric Value Note
Actrl 40.397226
R1 2.961244
ΔActrl 4.66×109 0 (Verified)
ΔR1 3.16×1010 0 (Verified)

Agenticity Proxies (Shaping-Invariant)

Detailed Metrics per MDP

⛓ Chain Environments

MDP Variant Control Adv. One-step Rec. Adv. Gap (norm) V-Vrand Var MI Δ Empowerment STARC Reward Clarity Hϵ Composite
Terminal 0.1458 0.0234 0.2183 0.4217 +0.6071 0.5000 0.4251 8/69 0.4887
Dense 10.9487 7.9709 0.3713 0.4258 +0.7141 0.5000 0.9284 64/69 0.5308
Lottery 0.1657 0.0266 0.2183 0.4217 +0.6097 0.5000 0.4568 8/69 0.4890
Progress 0.2048 0.0307 0.1915 0.4164 +1.0014 0.5000 0.5315 8/69 0.5216

🗺 Grid Environments

MDP Variant Control Adv. One-step Rec. Adv. Gap (norm) V-Vrand Var MI Δ Empowerment STARC Reward Clarity Hϵ Composite
Goal 0.6241 0.1093 0.3305 0.2709 +0.8585 0.7408 0.4505 7/69 0.6025
Local 5.6935 0.9340 0.1864 0.3200 +0.7536 0.7408 0.9299 7/69 0.5730
Cliff 2.7473 0.3790 0.0992 0.3283 +1.7201 0.7408 0.9493 6/69 0.5818

Summary Ranking

MDP Composite AdvGap V Var MI Diff ASpar
Grid Goal 0.6025 0.3305 0.2709 +0.8585 0.4167
Grid Cliff 0.5818 0.0992 0.3283 +1.7201 0.3854
Grid Local 0.5730 0.1864 0.3200 +0.7536 0.4167
Chain Dense 0.5308 0.3713 0.4258 +0.7141 0.5000
Chain Progress 0.5216 0.1915 0.4164 +1.0014 0.5000
Chain Lottery 0.4890 0.2183 0.4217 +0.6097 0.5000
Chain Terminal 0.4887 0.2183 0.4217 +0.6071 0.5000

🧪 MCE Demo: Grid-Cliff

Analysis of the effect of temperature α on the policy.
α0: Hard Optimal | α: Uniform

α JMCE L1(πmce,π)
0.01 +0.8670 0.6107
0.1 +2.6329 1.3320
1 +26.5223 1.4127
10 +272.0608 1.5478

PAM Batch Experiment

Setup: Random MDPs, n=50 per cell.

Q1: PAM Distributions (T fixed, R varies)

norm=[0,1] normalized | raw=original scale

State Space S=5

Distribution Metric Type Adv Gap V Var MI Diff Hϵ MCE Ent Ctrl Adv One Step
Bernoulli Norm Mean 1.0000 0.5972 N/A 0.8551 0.9656
Raw Mean 3.2206 0.1493 N/A 59.00 1.3387 7.6272 7.6263
Gaussian Norm Mean 0.9946 0.6136 N/A 0.8551 0.8934
Raw Mean 2.6771 0.1534 N/A 59.00 1.2385 14.2639 14.2603
Uniform Norm Mean 1.0000 0.5827 N/A 0.8551 0.9891
Raw Mean 2.8625 0.1457 N/A 59.00 1.3712 4.1890 4.1886

State Space S=10

Distribution Metric Type Adv Gap V Var MI Diff Hϵ MCE Ent Ctrl Adv One Step
Bernoulli Norm Mean 0.9903 0.3976 N/A 0.8551 0.9638
Raw Mean 1.6870 0.0994 N/A 59.00 1.3361 7.9779 7.9988
Gaussian Norm Mean 0.9636 0.3971 N/A 0.8571 0.8705
Raw Mean 1.3567 0.0993 N/A 59.14 1.2068 16.9924 17.0799
Uniform Norm Mean 0.9695 0.3938 N/A 0.8557 0.9877
Raw Mean 1.3850 0.0985 N/A 59.04 1.3692 4.7263 4.7457

Q2: T vs R Variance Decomposition

Ratio=between_T_vartotal_var. If >0.5, Transition structure T dominates.

T Struct S R Type Within Var Between Var Ratio Verdict
Random 5 Bernoulli 0.00039 0.00025 0.387 R dominates
Random 5 Gaussian 0.00046 0.00092 0.667 T dominates
Random 5 Uniform 0.00047 0.00010 0.169 R dominates
Random 10 Bernoulli 0.00048 0.00003 0.057 R dominates
Random 10 Gaussian 0.00100 0.00019 0.158 R dominates
Random 10 Uniform 0.00069 0.00011 0.134 R dominates

Q3: Human-Made MDPs

Note: Chains have |A|=2, Grids have |A|=4. MCE entropy is normalized by log(|A|).

MDP |A| comp adv_gap v*var h_eff mce_ent ctrl_adv one_step
Grid-Goal 4 0.5208 0.3305 0.2709 0.1014 0.9790 0.0312 0.0055
Grid-Local 4 0.4970 0.1864 0.3200 0.1014 0.9433 0.2511 0.0463
Grid-Cliff 4 0.4773 0.0992 0.3283 0.0870 0.9329 0.1302 0.0190
Chain-Dense 2 0.4493 0.3713 0.4258 0.9275 0.8243 0.4266 0.3326
Chain-Terminal 2 0.4100 0.2183 0.4217 0.1159 0.8480 0.0074 0.0012
Chain-Lottery 2 0.4100 0.2183 0.4217 0.1159 0.8514 0.0084 0.0013
Chain-Progress 2 0.4020 0.1915 0.4164 0.1159 0.8588 0.0103 0.0016

🧪 MCE Demo: Chain-Dense

α JMCE L1(πmce,π)
0.01 +16.8148 0.1000
0.1 +16.8555 0.2005
1 +20.4498 0.9660
10 +75.6014 1.2266