Implementation
I used skylar's claude repo for starting point: https://github.com/Skylar-gu/agentic-preferences/tree/main to implement results from my previous work.
1. Empowerment
The Empowerment results successfully validate that GridWorld is structurally more "agentic" than the Chain MDP, independent of the specific reward function provided.
- Grid-environments:
(norm ) - Chain-environments:
(norm )
In GridWorld, the agent has four degrees of freedom (movement directions), which creates a significantly larger space of possible trajectories and future states. This results in a higher information capacity of the control channel. Conversely, in the Chain MDP, choices are restricted (typically "stay" or "advance"), leading to a substantially lower capacity for environmental influence.
2 .STARC
The STARC metric quantifies the "clarity" and informativeness of the reward function in its canonical form.
- High Clarity (
): Grid-Cliff,Grid-Local,Chain-Dense. - Low Clarity (
): Chain-Terminal,Grid-Goal.
In "Dense" and "Cliff" environments, the agent receives immediate and precise feedback across most of the state space. The canonical reward in these cases exhibits a strong gradient that effectively guides the agent. In contrast, "Terminal" and "Goal" environments are characterized by sparse rewards where most of the state space is a "vacuum" of zero reward. For STARC, this represents a weak or "noisy" signal, as the agent must engage in extensive exploration before encountering the first reward.
3. Composite Agency Ranking
The resulting hierarchy of agency is as follows:
Grid-Goal Grid-Cliff Grid-Local Chain-Dense Chain-Terminal.
The fact that all Grid-based environments outrank all Chain-based environments indicates that Empowerment is a dominant factor in the composite score. Even in sparse-reward settings (Grid-Goal), the high structural capacity to influence the world makes the environment more "agentic" than a simple chain. Within these groups, STARC and Mutual Information drive the remaining variance; for instance, Chain-Dense (0.5308) outranks Chain-Terminal (0.4887) due to the superior clarity of the control signal.
4. Scaling Behavior in Random MDPs
Consistent with previous findings by Skylar-gu, we observe a specific trend as the state space grows from
Conclusion
From the perspective of the STARC and Empowerment frameworks, the highest degrees of agency are achieved in environments that combine a high degree of freedom (structural capacity) with a clear, informative reward signal (informational clarity).
Code output:
Experimental Results Report: MDP Analysis
Shaping Invariance Verification
Environment:
| Metric | Value | Note |
|---|---|---|
Agenticity Proxies (Shaping-Invariant)
Detailed Metrics per MDP
⛓ Chain Environments
| MDP Variant | Control Adv. | One-step Rec. | Adv. Gap (norm) | MI |
Empowerment | STARC Reward Clarity | Composite | ||
|---|---|---|---|---|---|---|---|---|---|
| Terminal | |||||||||
| Dense | |||||||||
| Lottery | |||||||||
| Progress |
🗺 Grid Environments
| MDP Variant | Control Adv. | One-step Rec. | Adv. Gap (norm) | MI |
Empowerment | STARC Reward Clarity | Composite | ||
|---|---|---|---|---|---|---|---|---|---|
| Goal | |||||||||
| Local | |||||||||
| Cliff |
Summary Ranking
| MDP | Composite | AdvGap | MI Diff | ASpar | |
|---|---|---|---|---|---|
| Grid |
0.6025 | ||||
| Grid |
0.5818 | ||||
| Grid |
0.5730 | ||||
| Chain |
0.5308 | ||||
| Chain |
0.5216 | ||||
| Chain |
0.4890 | ||||
| Chain |
0.4887 |
🧪 MCE Demo: Grid-Cliff
Analysis of the effect of temperature
| 0.01 | ||
| 0.1 | ||
| 1 | ||
| 10 |
PAM Batch Experiment
Setup: Random MDPs,
Q1: PAM Distributions ( fixed, varies)
State Space
| Distribution | Metric Type | Adv Gap | MI Diff | MCE Ent | Ctrl Adv | One Step | ||
|---|---|---|---|---|---|---|---|---|
| Bernoulli | Norm Mean | N/A | — | — | ||||
| Raw Mean | N/A | |||||||
| Gaussian | Norm Mean | N/A | — | — | ||||
| Raw Mean | N/A | |||||||
| Uniform | Norm Mean | N/A | — | — | ||||
| Raw Mean | N/A |
State Space
| Distribution | Metric Type | Adv Gap | MI Diff | MCE Ent | Ctrl Adv | One Step | ||
|---|---|---|---|---|---|---|---|---|
| Bernoulli | Norm Mean | N/A | — | — | ||||
| Raw Mean | N/A | |||||||
| Gaussian | Norm Mean | N/A | — | — | ||||
| Raw Mean | N/A | |||||||
| Uniform | Norm Mean | N/A | — | — | ||||
| Raw Mean | N/A |
Q2: vs Variance Decomposition
| Within Var | Between Var | Ratio | Verdict | |||
|---|---|---|---|---|---|---|
| Random | 5 | Bernoulli | ||||
| Random | 5 | Gaussian | ||||
| Random | 5 | Uniform | ||||
| Random | 10 | Bernoulli | ||||
| Random | 10 | Gaussian | ||||
| Random | 10 | Uniform |
Q3: Human-Made MDPs
Note: Chains have
, Grids have . MCE entropy is normalized by .
| MDP | |A| | comp | adv_gap | v*var | h_eff | mce_ent | ctrl_adv | one_step |
|---|---|---|---|---|---|---|---|---|
| Grid-Goal | 4 | |||||||
| Grid-Local | 4 | |||||||
| Grid-Cliff | 4 | |||||||
| Chain-Dense | 2 | |||||||
| Chain-Terminal | 2 | |||||||
| Chain-Lottery | 2 | |||||||
| Chain-Progress | 2 |
🧪 MCE Demo: Chain-Dense
| 0.01 | ||
| 0.1 | ||
| 1 | ||
| 10 |