Tiling Without Omniscience Dynamizing Faith in Joint Argmax

Abstract

The tiling theorem of Demski, Hsia, and Rapoport proves that an Updateless Decision Theory 1.0 agent has no strict preference for self-modification, given a coherent prior and logical omniscience. In the present document the Faith in Joint Argmax assumption is weakened to a one-sided additive gap that, under iterative learning, vanishes in the Cesàro mean. All four places where omniscience is consumed in the tiling are reduced to a single condition on the inductor over a minimal subalgebra, and the result is reflectively stable. As an output, the condition preserves the independence of branch probabilities as a structural axiom, which can be dropped if one allows bounded optimality, replacing it with an internally verifiable domination condition. If global optimality is required, the agent is placed under an obligation to explore the environment, which removes the entire argmax machinery and yields tiling without structural assumptions about the value table, thereby removing omniscience.

Keywords: alignment research, tiling, updateless decision theory, logical induction, bounded rationality, reflective stability.

1. Introduction

Theorem 3 of DHR25 states: a UDT 1.0 agent has no strict preference for a self-modifying action $a_{m} \in A_{m}$ over the best non-self-modifying one. Formally, for each $o$ and each $a_{m} \in A_{m}$ :

E_{p} (u ∣ π^{*} (o) = a_{m}) \leq max_{a \in A_{- m}} E_{p} (u ∣ π^{*} (o) = ⌜ a ⌝) .

Throughout we adopt DHR25's notation:

$O$ is the observation space,
$A$ the action space,
$A_{o} \subseteq A$ the actions available at $o$ ,
$A_{m} \subseteq A$ the self-modifying actions,
$A_{- m} = A ∖ A_{m}$ the non-self-modifying actions,
$a_{m} \in A_{m}$ , ${\hat{a}}_{m} \in A_{- m}$ denotes its externally indistinguishable non-self-modifying version,
$(ob (a_{m}), ac (a_{m})) = (\bar{o}, \bar{a})$ denotes the unique policy-point it forces by Limited Self-Modification (LSM; Assumption 3), and all expectations are taken with respect to the agent's subjective prior $p$ ,
$⌜ \cdot ⌝$ indicate a value substituted from outer context,
$⌞ \cdot ⌟$ indicate a free variable in the inner context,
Utility is bounded, $u \in [0, 1]$ ,
The agent's policy is $π$ ,
The optimal policy is $π^{*}$ ; $π^{*} (o)$ denotes the policy's action under observation $o$ .

We work with a propositionally coherent logical inductor with true prices $\hat{Q}$ , transformed into a Bayesian logical inductor with round prices $P_{t}$ ; the Boundedly Rational Inductive Agent (BRIA) round $t$ is identified with a step of deductive time. The agent's estimate at round $t$ is ${\hat{E}}_{t}$ , with ${\hat{E}}_{t} \to E_{\hat{Q}}$ . Fix a non-decreasing budget $g$ ; an event is $g$ -decidable if its truth value is decidable in $O (g (t))$ time.

We define:

$w (a, a^{'}) := E_{\hat{Q}} [u ∣ π^{*} (o) = a \land π^{*} (o^{'}) = a^{'}]$
$w_{o} (a) := E_{\hat{Q}} [u ∣ π^{*} (o) = a]$
$v (a^{'}) := E_{\hat{Q}} [u ∣ π^{*} (o^{'}) = a^{'}]$
$v^{-} := max_{a \in A_{- m}, a^{'} \in A} w (a, a^{'})$

Cross-block conditional laws:

$ρ (a^{'} ∣ a) := P_{\hat{Q}} (π^{*} (o^{'}) = a^{'} ∣ π^{*} (o) = a)$
$σ (a ∣ a^{'}) := P_{\hat{Q}} (π^{*} (o) = a ∣ π^{*} (o^{'}) = a^{'})$

2. Dynamization

The proof proceeds along the chain: Fine-Grained Fairness (FGF) $\to$ Faith in Joint Argmax (FGJ) $\to$ Action Coordination (AC) $\to$ Knowledge of the Decision Procedure (KoDP). Each step consumes omniscience, since the agent must simultaneously not know the concrete action of its argmax and know its value. To remove this computational uncertainty, we check the robustness of the DHR25 chain to computational constraints. We reproduce the chain with explicit tracking of equalities and inequalities, since this determines where and how the assumption can be weakened. If the agent is not omniscient and makes errors in its estimates, we ensure that this error does not propagate to infinity as it passes through the steps of the proof.

FGF states that self-modification does not change value if it is equivalent to a non-self-modifying action.
FGJ links the current choice to the future argmax, guaranteeing that the agent believes in the consistency of its optimal decisions across different policy points.
AC allows collapsing nested maximization operations into a single decision process.
KoDP removes the need to predict the future argmax, allowing the agent to use knowledge of its own decision-making algorithm.

Formally, with $o$ fixed and $(\bar{o}, \bar{a}) = (ob (a_{m}), ac (a_{m}))$ , ${\hat{a}}_{m} \in A_{- m}$ :

(FGF) \forall a_{m} \in A_{m} : E_{p} (u ∣ π^{*} (o) = a_{m}) = E_{p} (u ∣ π^{*} (o) = {\hat{a}}_{m} \land π^{*} (\bar{o}) = \bar{a}) .

(AC) \arg max_{a^{'} \in A} max_{a^{″} \in A_{- m}} E_{p} [u ∣ π^{*} (o) = a^{″} \land π^{*} (\bar{o}) = a^{'}] = \arg max_{a^{'} \in A} E_{p} [u ∣ π^{*} (\bar{o}) = a^{'}] .

(KoDP) \forall a \in A_{- m} : E_{p} [u | π^{*} (o) = a \land π^{*} (\bar{o}) = \arg max_{a^{'} \in A} E_{p} [u ∣ π^{*} (\bar{o}) = a^{'}]] = E_{p} [u ∣ π^{*} (o) = a] .

(ε -FGJ) \forall A^{'} \subseteq A \forall (o, o^{'}), o \neq o^{'} :

max_{a \in A^{'}} E_{p} [u | π^{*} (o) = a \land π^{*} (o^{'}) = \arg max_{a^{'} \in A} max_{a^{″} \in A^{'}} E_{p} [u ∣ π^{*} (o) = a^{″} \land π^{*} (o^{'}) = a^{'}]]

\geq max_{a^{'} \in A} max_{a \in A^{'}} E_{p} (u ∣ π^{*} (o) = a \land π^{*} (o^{'}) = a^{'}) - ε, ε \geq 0 fixed.

Theorem 1.

FGF \land LSM \land ε - FGJ \land AC \land KoDP ⟹ \forall a_{m} \in A_{m} \cap A_{o} : E_{p} (u ∣ π^{*} (o) = a_{m}) \leq max_{a \in A_{- m}} E_{p} (u ∣ π^{*} (o) = ⌜ a ⌝) + ε .

Proof. Fix an arbitrary $a_{m} \in A_{m}$ under observation $o$ , with $(\bar{o}, \bar{a}) = (ob (a_{m}), ac (a_{m}))$ . Introduce the intermediate quantities

Y_{0} := E_{p} (u ∣ π^{*} (o) = a_{m}), Y_{1} := E_{p} (u ∣ π^{*} (o) = {\hat{a}}_{m} \land π^{*} (\bar{o}) = \bar{a})

Y_{2} := max_{a \in A_{- m}} E_{p} (u ∣ π^{*} (o) = a \land π^{*} (\bar{o}) = \bar{a}), Y_{3} := max_{a^{'} \in A} max_{a \in A_{- m}} E_{p} (u ∣ π^{*} (o) = a \land π^{*} (\bar{o}) = a^{'})

X := max_{a \in A_{- m}} E_{p} [u | π^{*} (o) = a \land π^{*} (\bar{o}) = \arg max_{a^{'} \in A} max_{a^{″} \in A_{- m}} E_{p} [u ∣ π^{*} (o) = a^{″} \land π^{*} (\bar{o}) = a^{'}]]

X_{AC} := max_{a \in A_{- m}} E_{p} [u | π^{*} (o) = a \land π^{*} (\bar{o}) = \arg max_{a^{'} \in A} E_{p} [u ∣ π^{*} (\bar{o}) = a^{'}]]

The chain is:

Y_{0} \overset{FGF}{=} Y_{1} \overset{{\hat{a}}_{m} \in A_{- m}}{\leq} Y_{2} \leq Y_{3} \overset{ε - FGJ}{\leq} X + ε \overset{AC}{=} X_{AC} + ε \overset{KoDP}{=} max_{a \in A_{- m}} E_{p} (u ∣ π^{*} (o) = a) + ε .

Here the equality $Y_{0} = Y_{1}$ is FGF; $Y_{1} \leq Y_{2}$ — since ${\hat{a}}_{m} \in A_{- m}$ ; $Y_{2} \leq Y_{3}$ — taking the maximum over the second condition ( $Y_{3}$ is the right-hand side of FGJ); the step $Y_{3} \leq X + ε$ is $ε$ -FGJ in the direction $RHS \leq LHS$ ; AC collapses the iterated double maximum inside the argmax into a single one ( $X = X_{AC}$ ); KoDP removes the condition on the argmax ( $X_{AC} = max_{a \in A_{- m}} E_{p} (u ∣ π^{*} (o) = a)$ ).

Note that FGJ is used exactly once, only on the lower side, and is sandwiched between exact equalities. The AC and KoDP steps after FGJ apply to $X$ and are exact equalities, i.e. any additive gap introduced at the FGJ step passes straight through without amplification.

The $ε$ -FGJ used here, at $ε = 0$ , is exactly Assumption 5 of DHR25. $◻$

Thus we have shown that any error in knowing the future argmax is transferred additively to the final result. This does not guarantee that the agent will be able to achieve tiling at all. For the error not to be a permanent obstacle, it must tend to zero in the course of learning. Since $\bar{o} = ob (a_{m})$ varies with self-modification, to obtain "no self-modifying action at $o$ is preferable by more than $ε$ ", the assumption must hold uniformly: $ε = sup_{o^{'}} ε_{o, o^{'}}$ over all relevant pairs.

We transfer the problem to the BRIA setting (Oesterheld, Demski, Conitzer, 2023). At round $t$ the agent holds estimates $W_{t} (a, a^{'})$ of the quantities $w (a, a^{'})$ , chooses ${\hat{a^{'}}}_{t} = \arg max_{a^{'}} max_{a \in A_{- m}} W_{t} (a, a^{'})$ and $a_{t} = \arg max_{a \in A_{- m}} W_{t} (a, {\hat{a^{'}}}_{t})$ , realizing the pair $c_{t} = (a_{t}, {\hat{a^{'}}}_{t})$ with true value $w (c_{t})$ and estimation regret $ε_{t} = v^{-} - w (c_{t}) \geq 0$ . The realization schedules $S_{a, a^{'}} := {t : c_{t} = (a, a^{'})}$ are $g$ -decidable and partition $N$ .

Theorem 2. Let $\bar{α}$ be a covering BRIA over ${h_{a, a^{'}} : (a, a^{'}) \in A_{- m} \times A}$ , where $h_{a, a^{'}}$ recommends the pair $(a, a^{'})$ and promises $w (a, a^{'}) - δ$ . Under reward-value tracking (R2) on each $S_{a, a^{'}}$ :

\frac{1}{T} \sum_{t \leq T} ε_{t} = \frac{1}{T} \sum_{t \leq T} (v^{-} - w (c_{t})) \to_{T \to \infty}^{} 0.

Proof. We use two properties of the covering BRIA from Oesterheld et al. (2023):

(i) covering (Def. 6): if the empirical record of a hypothesis $h$ does not diverge to $- \infty$ , then the set of rounds in which $h$ is rejected is finite, and hence $α_{t}^{e} \geq α^{e} (h) - δ$ for all but finitely many $t$ ;
(ii) no overestimation: $\frac{1}{T} \sum_{t \leq T} r_{t} \geq \frac{1}{T} \sum_{t \leq T} α_{t}^{e} - o (1)$ .

Fix $(a, a^{'})$ and $δ > 0$ . The test set $M_{a, a^{'}} \subseteq S_{a, a^{'}}$ of the hypothesis $h_{a, a^{'}}$ is $g$ -decidable; value tracking (R2) on it gives $\frac{1}{| M_{\leq T} |} \sum_{t \in M_{\leq T}} (r_{t} - w (a, a^{'})) \to 0$ . Consequently the empirical record $\sum_{t \in M_{\leq T}} (r_{t} - (w (a, a^{'}) - δ))$ has mean $\to δ > 0$ and does not diverge to $- \infty$ . By property (i), $α_{t}^{e} \geq w (a, a^{'}) - δ$ for all but finitely many $t$ . Letting $δ ↓ 0$ and taking the supremum over all pairs (finitely many), $\underset{t}{lim inf} α_{t}^{e} \geq v^{-}$ , whence $\underset{T}{lim inf} \frac{1}{T} \sum_{t \leq T} α_{t}^{e} \geq v^{-}$ . By property (ii), $\frac{1}{T} \sum r_{t} \geq \frac{1}{T} \sum α_{t}^{e} - o (1)$ , so $\underset{T}{lim inf} \frac{1}{T} \sum r_{t} \geq v^{-}$ .

We partition ${1, \dots, T}$ by realized pairs. For each $(a, a^{'})$ with infinite $S_{a, a^{'}}$ , value tracking (R2) on $S_{a, a^{'}}$ gives $\sum_{t \in S_{a, a^{'}, \leq T}} (r_{t} - w (a, a^{'})) = o (| S_{a, a^{'}, \leq T} |) = o (T)$ . Summing over the finitely many pairs, $\frac{1}{T} \sum_{t \leq T} (r_{t} - w (c_{t})) \to 0$ . Since $w (c_{t}) \leq v^{-}$ , we obtain $\underset{T}{lim sup} \frac{1}{T} \sum r_{t} \leq v^{-}$ .

Combining: $\frac{1}{T} \sum r_{t} \to v^{-}$ ; from the ceiling identity $\frac{1}{T} \sum w (c_{t}) = \frac{1}{T} \sum r_{t} - o (1) \to v^{-}$ , whence $\frac{1}{T} \sum ε_{t} \to 0$ . $◻$

Corollary. Set

Δ := v^{-} - max {w (a, a^{'}) : (a, a^{'}) \in A_{- m} \times A, w (a, a^{'}) < v^{-}}, [c_{t} suboptimal] :⟺ w (c_{t}) < v^{-} ⟺ ε_{t} > 0.

If $Δ > 0$ , then the best response is chosen with limiting frequency $1$ : since $w (c_{t}) < v^{-} \Rightarrow w (c_{t}) \leq v^{-} - Δ$ , we have $ε_{t} \geq Δ \cdot 1 [ε_{t} > 0]$ , whence

\frac{1}{T} \sum_{t \leq T} 1 [ε_{t} > 0] \leq \frac{1}{Δ T} \sum_{t \leq T} ε_{t} \to_{T \to \infty}^{} 0.

In the absence of a gap ( $Δ = 0$ ) only the Cesàro statement of Theorem 2 holds. (The gap is taken on the $w$ -scale of realized pairs $c_{t}$ , consistently with the indicator.)

At this point we have shown the attainability of tiling in values; however, we have implicitly assumed that the AC and KoDP steps are exact equalities, whereas so far they too consume omniscience. But in reality the agent may err not only in values (as in FGJ) but in the maximum-selection process itself. If it incorrectly identifies the argmax, this creates a new error. To guarantee that these errors also vanish, we introduce Lemma 1, which links the error in probabilities to the error in expectations.

Lemma 1. Let $C, C^{'}$ be events of positive $P_{t}$ -probability, $G$ an arbitrary event, $u \in [0, 1]$ . If $C \cap G = C^{'} \cap G$ , then:

E_{p} [u ∣ C] \leq E_{p} [u ∣ C^{'}] + P_{t} (\neg G ∣ C) + P_{t} (\neg G ∣ C^{'}) .

Proof. By total expectation and $u \leq 1$ : $E_{p} [u ∣ C] \leq E_{p} [u ∣ C, G] + P_{t} (\neg G ∣ C)$ . From $C \cap G = C^{'} \cap G$ :

E_{p} [u ∣ C, G] = \frac{E_{p} [u 1_{C \cap G}]}{P_{t} (C \cap G)} = E_{p} [u ∣ C^{'}, G] .

Finally, $E_{p} [u ∣ C^{'}, G] - E_{p} [u ∣ C^{'}] = P_{t} (\neg G ∣ C^{'}) (E_{p} [u ∣ C^{'}, G] - E_{p} [u ∣ C^{'}, \neg G]) \leq P_{t} (\neg G ∣ C^{'})$ , since the bracket lies in $[- 1, 1]$ . $◻$

Good events for AC and KoDP:

G_{AC} = {\arg max_{a^{'} \in A} max_{a \in A_{- m}} w (a, a^{'}) = \arg max_{a^{'} \in A} v (a^{'})}

G_{KoDP} = {π^{*} (o) = \arg max_{a \in A_{o}} E_{\hat{Q}} [u ∣ π^{*} (o) = ⌞ a ⌟]}

Let $B_{a} := {π^{*} (o) = ⌜ a ⌝}$ , $C_{a} := {π^{*} (o) = ⌜ a ⌝ \land π^{*} (\bar{o}) = ⌜ \bar{a} ⌝}$ , $C_{a}^{AC}$ — the conditioning after the AC step. We define (at round $t$ ; in the idealized form of Theorem 3 the same residuals are written without the index $(t)$ ):

ε_{AC}^{(t)} := max_{a \in A_{- m}} [P_{t} (\neg G_{AC} ∣ C_{a}) + P_{t} (\neg G_{AC} ∣ C_{a}^{AC})]

ε_{KoDP}^{(t)} := max_{a \in A_{- m}} [P_{t} (\neg G_{KoDP} ∣ C_{a}^{AC}) + P_{t} (\neg G_{KoDP} ∣ B_{a})]

Now we combine the results obtained.

Theorem 3. Under the residual bounds $ε_{FGJ}$ , $ε_{AC}$ , $ε_{KoDP}$ :

E_{p} (u ∣ π^{*} (o) = a_{m}) \leq max_{a \in A_{- m}} E_{p} (u ∣ π^{*} (o) = ⌜ a ⌝) + ε_{FGJ} + ε_{AC} + ε_{KoDP} .

Proof. By Theorem 1 the steps before FGJ are exact; FGJ contributes $+ ε_{FGJ}$ and yields $X = max_{a} E_{p} [u ∣ C_{a}]$ .

We apply Lemma 1 with $(C, C^{'}, G) = (C_{a}, C_{a}^{AC}, G_{AC})$ . The hypothesis $C_{a} \cap G_{AC} = C_{a}^{AC} \cap G_{AC}$ holds: on $G_{AC}$ the iterated double maximum inside the argmax equals the single one, so the conditioning on $π^{*} (\bar{o})$ coincides, while the parts $π^{*} (o) = ⌜ a ⌝$ are identical. Hence $E_{p} [u ∣ C_{a}] \leq E_{p} [u ∣ C_{a}^{AC}] + P_{t} (\neg G_{AC} ∣ C_{a}) + P_{t} (\neg G_{AC} ∣ C_{a}^{AC})$ ; taking $max_{a}$ : $X \leq X_{AC} + ε_{AC}$ .

We apply Lemma 1 with $(C, C^{'}, G) = (C_{a}^{AC}, B_{a}, G_{KoDP})$ : on $G_{KoDP}$ knowledge of the procedure makes the argmax condition redundant, so $C_{a}^{AC} \cap G_{KoDP} = B_{a} \cap G_{KoDP}$ . Hence $X_{AC} \leq max_{a} E_{p} [u ∣ B_{a}] + ε_{KoDP}$ .

Combining: $X \leq max_{a \in A_{- m}} E_{p} [u ∣ π^{*} (o) = ⌜ a ⌝] + ε_{AC} + ε_{KoDP}$ , and adding the FGJ gap closes the chain. $◻$

Thus we have carried out the dynamization of FGJ via BRIA.

3. The Minimal Condition on the Inductor

If the agent's errors in value estimates tend to zero, then tiling holds. But what exactly must the inductor ensure for these errors to actually vanish? We cannot demand perfection from the agent in everything, but we must impose requirements on those aspects of its logic that directly affect decision-making. We call this set of requirements the Minimal Condition for Rationality (MCR).

Let the decision-relevant generators be the propositions about policy points ${π^{*} (o) = ⌜ a ⌝}$ , value comparisons ${w (a, a^{'}) \geq w (\tilde{a}, {\tilde{a}}^{'})}$ , and reward variables $(r_{t})$ on the realization schedules. Let $P := {(o, ob (a_{m})) : a_{m} \in A_{m} \cap A_{o}}$ be the "modification–target" pairs.

Definition (Minimal Condition for Rationality — MCR). The true prices $\hat{Q}$ satisfy $MCR (A)$ under budget $g$ if:

(R1) for each $E, C \in A$ with $P_{\hat{Q}} (C) > 0$ , the centered sequence $(1 [E_{t}] - P_{t} (E ∣ C))_{t}$ has vanishing Cesàro mean along each $g$ -decidable subschedule $S_{C}$ ; equivalently, its true-price frequency equals $P_{\hat{Q}} (E ∣ C)$ , inexploitably;
(R2) for each realization schedule $S$ generated within $A$ , the reward sequence on $S$ is $O (g)$ -boundedly van Mises–Wald–Church (vMWC) random with its true conditional mean.

R1 guarantees that the agent's belief that "event $E$ will occur" is mathematically consistent with reality (necessity). R2 guarantees that the agent physically "sees" the rewards and that they are no noisier than the budget permits (sufficiency). Together they create the foundation for convergence.

In the special case, two-block vMWC randomness for a pair $(o, o^{'})$ is exactly MCR restricted to the two-block cross algebra of a single pair of observations. MCR lifts it to an arbitrary decision-relevant algebra.

Lemma 2. Suppose $MCR (A)$ and the auxiliary coherence constraints (inherited from DHR25 §6). Let $G \in A$ be a $g$ -decidable event with $P_{\hat{Q}} (G) = 1$ , given by a finite Boolean combination of strict comparisons ${w (b) > w (b^{'})}$ from $A$ , and $C \in A$ a $g$ -decidable event of positive true price. Then:

\frac{1}{T} \sum_{t \leq T} P_{t} (\neg G ∣ C) \to_{T \to \infty}^{} 0.

Proof. Write $G = ⋂_{k = 1}^{K} E_{k}$ , $E_{k} = {w (b_{k}) > w (b_{k}^{'})}$ , with true gaps $δ_{k} := w (b_{k}) - w (b_{k}^{'}) > 0$ ( $K < \infty$ , since $A$ is finite).

By R2 the estimates track the true values on the corresponding schedules; real-valued coherence pins the credences $P_{t} (E_{k})$ to these estimates. Since the gap $δ_{k} > 0$ is strict, the estimated comparison coincides with the true one with limiting frequency $1$ , whence $\frac{1}{T} \sum_{t \leq T} P_{t} (\neg E_{k}) \to 0$ . By subadditivity $P_{t} (\neg G) \leq \sum_{k} P_{t} (\neg E_{k})$ , so $\frac{1}{T} \sum_{t \leq T} P_{t} (\neg G) \to 0$ .

Since $P_{t} \to \hat{Q}$ and $P_{\hat{Q}} (C) > 0$ , there exist $t_{0}$ and $c := \frac{1}{2} P_{\hat{Q}} (C) > 0$ with $P_{t} (C) \geq c$ for $t \geq t_{0}$ . The auxiliary coherence constraints license the monotonicity $P_{\hat{Q}} (\neg G \land C) = 0$ for self-referential $C$ .

We split the sum at $t_{0}$ : the first $t_{0}$ terms contribute $\leq t_{0} / T \to 0$ . For the rest: $P_{t} (\neg G ∣ C) \leq (P_{t} (\neg G) + β_{t}) / c$ , where $β_{t} \to 0$ , which completes the proof. $◻$

Lemma 3. If R1 is violated for $(\neg G, C)$ : there exist an infinite $g$ -decidable subschedule $S \subseteq S_{C}$ and $η > 0$ such that

\frac{1}{| S_{\leq T} |} \sum_{t \in S_{\leq T}} (1 [\neg G_{t}] - P_{t} (\neg G ∣ C)) \leq - η

for infinitely many $T$ . Then the inductor is $O (g)$ -exploitable.

The proof of Lemma 3 is a standard trader argument and, for readability, is given in Appendix A.

We see that MCR does indeed make sense as a constraint to impose; however, to understand how strict MCR must be, we need to determine the bounds of its domain of application. It would be excessive to demand MCR for all possible propositions in the agent's language. It suffices for it to hold only for those propositions the agent uses to compare actions and choose a policy. This leads us to the notion of a minimal subalgebra.

3.1 The Minimal Subalgebra

The tiling chain conditions and compares values only through the pairs $(o, o^{'}) \in P$ . This allows defining the minimal algebra, with $S_{a, a^{'}} := {t \in N : c_{t} = (a, a^{'})}$ :

A^{*} = \underset{value layer}{\underset{⏟}{σ ({r_{t} : t \in N} \cup {S_{a, a^{'}} : (a, a^{'}) \in A_{- m} \times A})}} \lor \underset{comparison layer}{\underset{⏟}{σ ({w (a, a^{'}) ≷ w (\tilde{a}, {\tilde{a}}^{'})} \cup {w_{o} (a) ≷ w_{o} (\tilde{a})})}},

restricted to the pairs $P$ . (The value layer is the joint $σ$ -algebra of rewards and schedule labels $S_{a, a^{'}}$ , since the schedules ${S_{a, a^{'}}}$ themselves partition $N$ and restricting the rewards to their union yields all $r_{t}$ .)

MCR guarantees that the agent learns correctly, but it does not guarantee that the structure of the world (or of the policy) itself helps it make the right choice. For the action-coordination step (AC) in our chain to become exact, we may need a certain structural orderedness in the value table. We introduce two such structural notions: branch independence (VR1) and argmax stability (CBA).

Definition. A pair $(o, o^{'})$ satisfies VR1 under $\hat{Q}$ if the joint policy law factorizes: $ρ (a^{'} ∣ a) = ρ (a^{'})$ and $σ (a ∣ a^{'}) = σ (a)$ for all $a, a^{'}$ .

Definition.

A pair $(o, o^{'})$ satisfies CBA if there exists a uniform dominator $a_{*} \in A_{- m}$ with $w (a_{*}, a^{'}) = max_{a \in A_{- m}} w (a, a^{'})$ for every $a^{'} \in A$ .
The pair satisfies sCBA (strict CBA) if it satisfies CBA and, additionally, for each $a \in A_{- m} ∖ {a_{*}}$ there exists $a^{'}$ with $ρ (a^{'}) > 0$ and $w (a_{*}, a^{'}) > w (a, a^{'})$ .

Thus $sCBA \Rightarrow CBA$ , and both properties have distinct notations.

If VR1 and CBA hold, then MCR automatically guarantees the exactness of the AC and KoDP steps. We prove this with the following lemmas, under Action-Coordination Honesty (ACH):

(ACH) a_{o}^{*} := \arg max_{a \in A_{o}} w_{o} (a) \in A_{- m}

Lemma 4. Suppose $(o, o^{'})$ satisfies sCBA (with dominator $a_{*}$ ), VR1, and ACH. Then:

(i) $\arg max_{a \in A_{- m}} w_{o} (a) = {a_{*}}$ (unique maximizer);
(ii) $a_{o}^{*} = a_{*}$ ;
(iii) $\forall a^{'} \in A : v (a^{'}) = w (a_{*}, a^{'})$ ;
(iv) $G_{AC}$ holds under $\hat{Q}$ .

Proof. (i) By VR1 the weight $ρ (a^{'} ∣ a) = ρ (a^{'})$ does not depend on $a$ , so:

w_{o} (a) = \sum_{a^{'}} ρ (a^{'}) w (a, a^{'}) \leq \sum_{a^{'}} ρ (a^{'}) w (a_{*}, a^{'}) = w_{o} (a_{*}) .

For $a \neq a_{*}$ the strictness of sCBA gives an $a_{0}^{'}$ with $ρ (a_{0}^{'}) > 0$ and $w (a_{*}, a_{0}^{'}) > w (a, a_{0}^{'})$ , making the inequality strict. Thus $\arg max_{a \in A_{- m}} w_{o} (a) = {a_{*}}$ .

(ii) By ACH, $a_{o}^{*} \in A_{- m}$ ; since $a_{o}^{*}$ maximizes $w_{o}$ over $A_{o} \supseteq A_{- m}$ , it also maximizes over $A_{- m}$ , and uniqueness from (i) forces $a_{o}^{*} = a_{*}$ .

(iii) Determinism of the policy gives $σ (\cdot) = δ_{a_{o}^{*}}$ ; by VR1, $σ (a ∣ a^{'}) = σ (a)$ , whence $v (a^{'}) = \sum_{a} σ (a) w (a, a^{'}) = w (a_{o}^{*}, a^{'}) = w (a_{*}, a^{'})$ by (ii).

(iv) By CBA, $\arg max_{a^{'}} max_{a} w (a, a^{'}) = \arg max_{a^{'}} w (a_{*}, a^{'})$ ; by (iii), $\arg max_{a^{'}} v (a^{'}) = \arg max_{a^{'}} w (a_{*}, a^{'})$ . The two argmaxes coincide — this is $G_{AC}$ . $◻$

Clearly, without VR1 the conclusion fails: with $w (a_{*}, \cdot) = (0.9, 0.2)$ , $w (a_{1}, \cdot) = (0.8, 0.1)$ and $ρ (\cdot ∣ a_{*}) = δ_{a_{2}^{'}}$ , $ρ (\cdot ∣ a_{1}) = δ_{a_{1}^{'}}$ we have $w_{o} (a_{*}) = 0.2 < 0.8 = w_{o} (a_{1})$ , even though $a_{*}$ dominates pointwise. A comparison of VR1 and full branch independence is in Appendix B.

Lemma 5. Suppose value tracking of the $o$ -block (R2) and MCR over the comparison events of the $o$ -block. Then:

\frac{1}{T} \sum_{t \leq T} P_{t} (\neg G_{KoDP} ∣ B_{a}) \to_{T \to \infty}^{} 0,

and similarly with $C_{a}^{AC}$ in place of $B_{a}$ .

Proof. By the BRIA rule: $π^{*} (o) = \arg max_{a^{'}} {\hat{E}}_{t} (u ∣ π^{*} (o) = ⌜ a^{'} ⌝)$ . Consequently $\neg G_{KoDP} ⟺ \arg max_{a^{'}} {\hat{E}}_{t} (u ∣ π^{*} (o) = ⌜ a^{'} ⌝) \neq \arg max_{a^{'}} w_{o} (a^{'})$ — this is the $g$ -decidable event "the argmax of the estimate does not coincide with the argmax of the true value in the $o$ -block".

$G_{KoDP}$ is a finite Boolean combination of $o$ -block comparisons, $g$ -decidable. Under true prices $P_{\hat{Q}} (\neg G_{KoDP} ∣ B_{a}) = 0$ : the agent chooses $a$ if and only if $a$ is its estimate-argmax, which by limiting value tracking is the argmax of the true value. Lemma 2 yields the Cesàro-zero conclusion.

The only potential self-reference is in the hypothesis $P_{\hat{Q}} (\neg G_{KoDP} ∣ B_{a}) = 0$ . Universal instantiation transfers ${\hat{E}}_{t} (u ∣ π^{*} (o) = a^{'}) \to w_{o} (a^{'})$ to the concrete $a^{'} = a$ even under conditioning on $B_{a}$ ; real-valued coherence forces $E [u ∣ B_{a}]$ to cohere with the learned $w_{o} (a)$ . This is exactly the auxiliary coherence constraints from DHR25 §6, no new generators are consumed. $◻$

Now we can assemble everything. We have:

the error in values, which vanishes by R2/Theorem 2,
the error in probabilities (comparisons), which vanishes by R1/Lemma 2,
the errors in the AC and KoDP steps, which vanish if the world possesses the VR1/CBA structure.

The synthesis of these results gives the main theorem of the section.

Theorem 4. Let the agent be a BRIA over a Bayesian logical inductor. Suppose LSM, the structural premises making $G_{AC}$ , $G_{KoDP}$ exact under $\hat{Q}$ (Lemma 4, Lemma 5), and $MCR (A^{*})$ . Then:

\forall o \forall a_{m} \in A_{m} \cap A_{o} : \frac{1}{T} \sum_{t \leq T} [{\hat{E}}_{t} (u ∣ π^{*} (o) = a_{m}) - max_{a \in A_{- m}} {\hat{E}}_{t} (u ∣ π^{*} (o) = ⌜ a ⌝)]_{+} \to_{T \to \infty}^{} 0.

Proof. The static chain (Theorem 3) is exact on the first step (FGF) under true prices and becomes an $ε$ -chain with residuals $ε_{FGJ}^{(t)} + ε_{AC}^{(t)} + ε_{KoDP}^{(t)}$ , arising outside the inner $\arg max$ and not amplified by the exact steps. The conclusion, however, is denominated in the round estimates ${\hat{E}}_{t}$ , whereas the chain is written in idealized prices; transferring to ${\hat{E}}_{t}$ introduces one more residual at the FGF step, which we first isolate explicitly.

For $ε_{FGF}^{(t)}$ : the self-modification event ${π^{*} (o) = a_{m}}$ and the non-modification event ${π^{*} (o) = {\hat{a}}_{m} \land π^{*} (\bar{o}) = \bar{a}}$ have the same true value, but the agent sees different estimates. We define this gap as:

ε_{FGF}^{(t)} := | {\hat{E}}_{t} (u ∣ π^{*} (o) = a_{m}) - {\hat{E}}_{t} (u ∣ π^{*} (o) = {\hat{a}}_{m} \land π^{*} (\bar{o}) = \bar{a}) | .

The cell $({\hat{a}}_{m}, \bar{a}) \in A_{- m} \times A$ is an $o$ -block generator of the value layer. Thanks to the value-tracking condition R2, the agent inevitably comes to know the true values of these table cells over time, so $\frac{1}{T} \sum_{t \leq T} ε_{FGF}^{(t)} \to 0$ .

Similarly, the estimation residual $| {\hat{E}}_{t} (u ∣ π^{*} (o) = ⌜ a ⌝) - w_{o} (a) |$ for the non-self-modifying $o$ -block is Cesàro-zero by R2. Hence the difference between the estimated and the true form of the bound is Cesàro-zero.

For $ε_{FGJ}^{(t)}$ : Theorem 2 guarantees convergence of the error to $0$ .
For $ε_{AC}^{(t)}$ : by Lemma 4 $G_{AC}$ is exact under $\hat{Q}$ ; Lemma 2 gives $\frac{1}{T} \sum P_{t} (\neg G_{AC} ∣ C) \to 0$ for each $C \in {C_{a}, C_{a}^{AC}}$ ; finiteness of $A_{- m}$ and subadditivity close it.
For $ε_{KoDP}^{(t)}$ : Lemma 5 guarantees that, thanks to MCR, the error converges to $0$ .

Since each of these error sources is Cesàro-zero, their sum also tends to zero. $◻$

Theorem 5. Under LSM, FGF, ACH, VR1, and sCBA under $\hat{Q}$ ,

\frac{1}{T} \sum_{t \leq T} ρ_{t} \to_{T \to \infty}^{} 0 ⟺ MCR (A^{*}) .

Proof. $(\Rightarrow)$ This is exactly Theorem 4: with the skeleton fixed, $MCR (A^{*})$ supplies R1 and R2, the structural premises give exactness of the good events under $\hat{Q}$ , and the assembly of Theorem 4 gives $\frac{1}{T} \sum_{t \leq T} ρ_{t} \to 0$ .

$(\Leftarrow)$ We show by contraposition that abandoning either layer breaks the conclusion.

Suppose R2 is violated on some realization schedule $S_{a_{0}, a_{0}^{'}}$ : there exists an infinite $g$ -decidable $S \subseteq S_{a_{0}, a_{0}^{'}}$ with

\frac{1}{| S_{\leq T} |} \sum_{t \in S_{\leq T}} (r_{t} - w (a_{0}, a_{0}^{'})) ↛ 0.

Then $\frac{1}{T} \sum_{t \leq T} w (c_{t})$ does not converge to $v^{-}$ along the subsequence on which the pair $(a_{0}, a_{0}^{'})$ is chosen with positive frequency, and the regret $ε_{t}$ has nonzero Cesàro mean along it; by Theorem 2 this is exactly a violation of the Cesàro-zeroness of $ε_{FGJ}$ , so the tiling term does not tend to zero. Hence the value layer is indispensable.

Suppose R1 is violated for $(\neg G_{AC}, C)$ for some $C \in {C_{a}, C_{a}^{AC}}$ : there exist a $g$ -decidable $S \subseteq S_{C}$ and $η > 0$ with

\frac{1}{| S_{\leq T} |} \sum_{t \in S_{\leq T}} (1 [\neg G_{AC, t}] - P_{t} (\neg G_{AC} ∣ C)) \leq - η

for infinitely many $T$ . By Lemma 3 (Appendix A) the inductor is $O (g)$ -exploitable, which is incompatible with the logical-induction criterion; equivalently, excluding the comparison layer admits an inductor whose true prices violate CBA, so $G_{AC}$ fails, the AC step collapses by a nonzero amount, and an $O (g)$ -trader betting on the mismatch of argmaxes exploits the agent. Hence the comparison layer is indispensable.

Thus both layers are necessary and sufficient, and $A^{*}$ is minimal. $◻$

3.2 Reflective Stability

Theorem 4 was proved by reasoning about the agent's prices. For tiling to be self-applicable, the predecessor must refrain from modification on grounds available to it itself.

Proposition 1. Every quantity in the tiling bound of Theorem 4 is an entry in the agent's price system $P_{t}$ . In particular, all residuals $P_{t} (\neg G ∣ C)$ are the agent's own credences, not external quantities.

Proof. The proof of Theorem 4 is built along the scheme: transfer of FGF to estimates (the residual $ε_{FGF}^{(t)}$ is an R2-tracked quantity in ${\hat{E}}_{t}$ ) $\to$ $ε$ -chain in ${\hat{E}}_{t}$ $\to$ Lemma 1 introduces the residuals $P_{t} (\neg G ∣ C)$ $\to$ Lemma 2 gives their Cesàro-zeroness. Every term at every step is a round price of the agent. $◻$

As a consequence, to act, the agent does not need a credence about its own price of the form $P_{t} (P_{t} (\neg G ∣ C) < ε)$ . It takes the $\arg max$ of ${\hat{E}}_{t}$ , which is a price. The "belief about belief" tower does not start, since the action consumes a price, not a judgment about a price.

Lemma 6. Denote the modification successor $S (a_{m})$ and the unmodified successor $Π^{\circ}$ . The reflective gap $ρ_{t} (a_{m}) := [{\hat{V}}_{t} (S (a_{m})) - {\hat{V}}_{t} (Π^{\circ})]_{+}$ . Then

LSM \land FGF \land UDT 1.0 ⟹ ρ_{t} (a_{m}) = [{\hat{E}}_{t} (u ∣ π^{*} (o) = a_{m}) - max_{a \in A_{- m}} {\hat{E}}_{t} (u ∣ π^{*} (o) = ⌜ a ⌝)]_{+}

where UDT 1.0 is the updateless optimization rule $π^{*} (o) := \arg max_{a \in A_{o}} E_{p} (u ∣ π^{*} (o) = ⌜ a ⌝)$ .

Proof. The updatelessness of UDT 1.0 chooses the policy once; in the dynamized version, at each round $t$ all coordinates of the committed policy are co-determined by one and the same price system ${\hat{E}}_{t}$ . In particular the $o^{'}$ -coordinate (and the entire lower part) is evaluated at the round- $t$ prices, not by a later estimate ${\hat{E}}_{t^{'}}$ with $t^{'} > t$ . This follows from updatelessness itself — the predecessor at round $t$ values the committed policy as a single object under ${\hat{E}}_{t}$ — and requires neither factorization of the branch probabilities (VR1) nor invariance of values across branches (VR2): the localization of the optimization onto the $o^{'}$ -term (the content of Theorem 4 of the companion FGJ document, which uses VR1/VR2) is needed only to compute that term, whereas for the gap identity it suffices to have the qualitative fact that both valuations — ${\hat{V}}_{t} (S (a_{m}))$ and ${\hat{V}}_{t} (Π^{\circ})$ — are taken at the prices of one and the same round $t$ . Therefore the valuation "keep UDT" is not a nested ${\hat{E}}_{t} [{\hat{E}}_{t^{'}} [\cdot]]$ with $t^{'} > t$ , but a single-level round conditional expectation. This is exactly what makes Lemma 6 applicable to the branches of §4, where VR1 is dropped.

Therefore "impose $a_{m}$ " is ${\hat{V}}_{t} (S (a_{m})) = {\hat{E}}_{t} (u ∣ π^{*} (o) = a_{m})$ by definition, and "keep UDT" is ${\hat{V}}_{t} (Π^{\circ}) = max_{a \in A_{- m}} {\hat{E}}_{t} (u ∣ π^{*} (o) = ⌜ a ⌝)$ with the co-determined lower part. The downstream $\arg max$ is a single-level round coordination event (of type $G_{KoDP}$ ), not a separately trusted future estimate. The correspondence between the form of the action-at- $o$ ( $π^{*} (o) = a_{m}$ versus $max_{A_{- m}}$ ) and the form of the modification ("impose" versus "keep") is exactly the FGF + LSM bundle. $◻$

Theorem 6. Under the premises of Theorem 4 and without any additional self-trust assumption:

\forall o \forall a_{m} \in A_{m} \cap A_{o} : \frac{1}{T} \sum_{t \leq T} ρ_{t} (a_{m}) \to_{T \to \infty}^{} 0.

Proof. By Lemma 6, $ρ_{t} (a_{m})$ is the summand of Theorem 4; we apply Theorem 4. The only residual self-reference — the downstream coordination event $H = {\arg max_{a^{'}} {\hat{E}}_{t} (u ∣ π^{*} (o^{'}) = a^{'}) = \arg max_{a^{'}} v (a^{'})}$ — is $G_{KoDP}$ for the $o^{'}$ -block (Lemma 5), $g$ -decidable, $P_{\hat{Q}} (H) = 1$ ; its residual $P_{t} (\neg H ∣ C)$ is Cesàro-zero by Lemma 2. No new generator beyond the value and comparison layers is introduced. In the absence of a value gap, when $H$ may have ties and $P_{\hat{Q}} (H) < 1$ , $ρ_{t}$ is controlled directly by the value regret $ε_{t}$ (ties do not lose value), which is Cesàro-zero by Theorem 2; the indicator of $H$ is then not needed. $◻$

As an output, the minimal tiling condition preserves only VR1 as an external axiom. It is needed for Lemma 4, since without it the pointwise dominator may fail to be the agent's marginally optimal action. Globally, in the sections below, we consider various directions for removing this constraint.

4. Working with VR1

Regarding VR1, we see two options for solving the problem. On the one hand, we can impose a condition on the structure of prices under the given policy, that is, essentially state the non-guaranteedness of the agent's optimal policy, since it may require the absence of this condition. This option is acceptable from the standpoint of pure UDT, since it does not presuppose a significant change to the theory, and allows simply translating the omniscience problem into the much more realistic problem of branch independence.

On the other hand, we propose to impose a condition on the agent's action schedule over time. This approach has a goal similar to GU in the sense that, by the agent making an agreement with itself regarding its own actions, we solve a significant problem of the decision-making system. In exchange for such a constraint, we obtain the ability to guarantee attainment of the optimum without omniscience of the agent, and therefore we consider it promising.

4.1 Bounded Optimality

Definition. A regime $d = (o, o^{'}) \in P$ satisfies realized cross-block domination ${RD}^{*}$ if:

({RD}^{*}) \forall a^{'} \in A : w (a_{o}^{*}, a^{'}) = max_{a \in A_{- m}} w (a, a^{'}) .

Clearly, ${RD}^{*}$ is weaker than the bundle VR1 + sCBA: there $VR 1 \land sCBA \land ACH \Rightarrow {RD}^{*}$ (Lemma 4(ii)), so ${RD}^{*}$ is a strict weakening. The converse is false: ${RD}^{*}$ does not entail $ρ (a^{'} ∣ a) = ρ (a^{'})$ and may hold under coupled conditional laws.

Definition. A regime $d$ is admissible if it satisfies ACH ( $a_{o}^{*} \in A_{- m}$ ) and ${RD}^{*}$ under $\hat{Q}$ . The set of admissible regimes is $D$ ; it is finite ( $A, O$ finite).

Lemma 7. If a regime $d$ satisfies ACH and ${RD}^{*}$ under $\hat{Q}$ , then $G_{AC}$ holds under $\hat{Q}$ .

Proof. Denote $M (a^{'}) := max_{a \in A_{- m}} w (a, a^{'})$ .

(i) By ${RD}^{*}$ : $M (a^{'}) = w (a_{o}^{*}, a^{'})$ for every $a^{'} \in A$ .

(ii) Under $\hat{Q}$ the policy is deterministic and $π^{*} (o) = a_{o}^{*}$ with $\hat{Q}$ -probability $1$ . For any $a^{'}$ with $P_{\hat{Q}} (π^{*} (o^{'}) = a^{'}) > 0$ the event ${π^{*} (o) = a_{o}^{*}}$ has full probability, so $σ (a ∣ a^{'}) = δ_{a_{o}^{*}} (a)$ , whence $v (a^{'}) = w (a_{o}^{*}, a^{'})$ . Here only the determinism of the realized policy is used, not the factorization VR1.

From combined: $M (a^{'}) = v (a^{'}) = w (a_{o}^{*}, a^{'})$ for all $a^{'}$ . Coincidence of the functions entails coincidence of their argmaxes — this is $G_{AC}$ . $◻$

Note that the cross-block conditional law $ρ (a^{'} ∣ a)$ for $a \neq a_{o}^{*}$ is nowhere computed: this is an off-policy counterfactual, conditioning on a probability-zero event. Lemma 7 bypasses it entirely.

Lemma 8. Let $d$ be admissible and $G_{RD}$ the event of ${RD}^{*}$ holding together with the argmax condition:

G_{RD} := {a_{o}^{*} = \arg max_{a \in A_{- m}} w_{o} (a) \land \forall a^{'} \in A : w (a_{o}^{*}, a^{'}) \geq max_{a \in A_{- m}} w (a, a^{'})} .

Then for any $g$ -decidable $C$ of positive price:

\frac{1}{T} \sum_{t \leq T} P_{t} (\neg G_{RD} ∣ C) \to_{T \to \infty}^{} 0,

and a violation of R1 for $(\neg G_{RD}, C)$ entails $O (g)$ -exploitability.

Proof. $G_{RD}$ is a finite Boolean combination of comparisons from the comparison layer $A^{*}$ : $o$ -block comparisons ${w_{o} (a_{o}^{*}) \geq w_{o} (a)}_{a \in A_{- m}}$ and cross-block comparisons ${w (a_{o}^{*}, a^{'}) \geq w (a, a^{'})}_{a \in A_{- m}, a^{'} \in A}$ . All of them are $g$ -decidable and pinned to the true order by value tracking (R2). Since $d$ is admissible, $P_{\hat{Q}} (G_{RD}) = 1$ . Lemma 2 gives the vanishing. Necessity: a violation of R1 for $(\neg G_{RD}, C)$ entails exploitability by a trader (Lemma 3). $◻$

Theorem 7.

LSM \land FGF \land ACH \land MCR (A^{*}) \land (d (o, o^{'}) \in D) ⟹ \forall a_{m} \in {a \in A_{m} : ob (a) = o^{'}} :

\frac{1}{T} \sum_{t \leq T} [{\hat{E}}_{t} (u ∣ π^{*} (o) = a_{m}) - max_{a \in A_{- m}} {\hat{E}}_{t} (u ∣ π^{*} (o) = ⌜ a ⌝)]_{+} \to_{T \to \infty}^{} 0.

Proof. We reproduce the tiling chain of Theorem 4, replacing the single place where VR1 is used (the AC step) with Lemma 7. By admissibility of the regime, $G_{AC}$ is exact under $\hat{Q}$ (Lemma 7); Lemma 2 gives $\frac{1}{T} \sum P_{t} (\neg G_{AC} ∣ C) \to 0$ for $C \in {C_{a}, C_{a}^{AC}}$ , so $ε_{AC}^{(t)}$ is Cesàro-zero. The residuals $ε_{FGJ}^{(t)}$ (Theorem 2) and $ε_{KoDP}^{(t)}$ (Lemma 5), as well as the transfer residual $ε_{FGF}^{(t)}$ (Step 0 of Theorem 4), are Cesàro-zero independently of VR1. The sum of Cesàro-zero sequences is Cesàro-zero. $◻$

Theorem 8. Under the premises of Theorem 7 over a finite $D$ :

(i) Membership $d \in D$ is established in the Cesàro mean inexploitably (Lemma 8).
(ii) On each $d \in D$ tiling with vanishing Cesàro residual holds and is reflectively stable (Theorems 7, 6).
(iii) There exists $d^{*} \in \arg max_{d \in D} V (d)$ ; the values $V (d) = max_{a \in A_{- m}} w_{o}^{(d)} (a)$ and their order are certified by tracking (R2) and are uncontaminable by R1. On $d^{*}$ tiling is reflectively stable. Moreover $V (d^{*}) \leq V_{glob}$ , with equality if and only if the $V_{glob}$ -achieving regime is admissible.

Proof. (i) Lemma 8. (ii) Theorem 7 and Theorem 6, quantified over $D$ . (iii) $D$ is finite, the maximum of $V$ is attained; the pairwise comparisons ${V (d) \geq V (d^{'})}$ are generators of the comparison layer $A^{*}$ , pinned to the true order and inexploitable by R1. On $d^{*}$ we apply (ii). The inequality $V (d^{*}) \leq V_{glob}$ is by definition of the supremum. $◻$

If the environment satisfies VR1, the globally optimal regime falls into $D$ (Lemma 4(ii)), and $V (d^{*}) = V_{glob}$ . If VR1 is violated, the agent may end up in an admissible component not containing the global optimum, and tiling is guaranteed only within $D$ , since the axiom about the environment is replaced by an internally verifiable condition about the agent.

4.2 Global Optimality

Note that if the agent explores enough, that is, visits every cell of the table $w$ infinitely often, then it reaches the global ceiling $v^{-}$ , and tiling follows from optimality without structural requirements on the table.

We add three setup assumptions about the agent and the environment:

Sequentiality (S) — realization proceeds round by round; note that S does not switch the decision rule from updateless argmax to Bayesian conditioning.
Stationarity (St) — on each $S_{a, a^{'}}$ the reward is vMWC random with constant mean $w (a, a^{'})$ ; this constraint is a direct consequence of the agent's boundedness.
Long-run criterion (L) — the agent maximizes the Cesàro mean $\frac{1}{T} \sum_{t \leq T} u_{t}$ .

Definition. A GLIE commitment is a $g$ -decidable commitment under which: (a) every schedule is infinite: $\forall (a, a^{'}) \in A_{- m} \times A : | S_{a, a^{'}} | = \infty$ ; (b) the exploration frequency vanishes: $\frac{1}{T} | {t \leq T : c_{t} \neq c_{t}^{greedy}} | \to 0$ .

Lemma 9. Under S, St, and a GLIE commitment, for each $(a, a^{'}) \in A_{- m} \times A$ the schedule $S_{a, a^{'}}$ is infinite and $g$ -decidable; by St and R2:

\frac{1}{| S_{a, a^{'}, \leq T} |} \sum_{t \in S_{a, a^{'}, \leq T}} (r_{t} - w (a, a^{'})) \to_{T \to \infty}^{} 0.

Proof. Infiniteness of each $S_{a, a^{'}}$ — from GLIE(a). Tracking follows from St and R2: by St the reward on $S_{a, a^{'}}$ is an $O (g)$ -boundedly vMWC-random sequence with constant mean $w (a, a^{'})$ ; by the definition of vMWC-randomness (Def. 8 of Oesterheld et al., 2023), for each infinite $g$ -decidable subschedule $S \subseteq S_{a, a^{'}}$ the mean of centered rewards tends to zero, which is the statement of the lemma for $S = S_{a, a^{'}}$ . $◻$

The cross-block coupling $ρ (a^{'} ∣ a)$ is nowhere computed; the cells are realized and measured directly.

Lemma 10. For each $a_{m} \in A_{m}$ :

E_{\hat{Q}} [u ∣ π^{*} (o) = a_{m}] \overset{FGF}{=} w ({\hat{a}}_{m}, \bar{a}) \leq v^{-} .

The cell $({\hat{a}}_{m}, \bar{a}) \in A_{- m} \times A$ is realized under GLIE (Lemma 9), so the GLIE commitment does not require executing any irreversible self-modification.

Proof. The equality is FGF; ${\hat{a}}_{m} \in A_{- m}$ by LSM, so $({\hat{a}}_{m}, \bar{a})$ is a table cell and $w ({\hat{a}}_{m}, \bar{a}) \leq v^{-}$ . Realization is Lemma 9. $◻$

Lemma 11. Under S, St, L, and GLIE:

\frac{1}{T} \sum_{t \leq T} ε_{t} = \frac{1}{T} \sum_{t \leq T} (v^{-} - w (c_{t})) \to_{T \to \infty}^{} 0.

No value gap is assumed.

Proof. By Lemma 9 all cells are covered, in particular the cell of the global maximum $v^{-}$ . Then the premises of Theorem 2 (Cesàro optimality) are satisfied: a covering BRIA exists, the test sets are infinite and $g$ -decidable. We apply Theorem 2. The exploration rounds GLIE(b) have vanishing frequency and contribute $o (1)$ . $◻$

Theorem 9. Under FGF, LSM, S, St, L, $MCR (A^{*})$ , and a GLIE commitment:

\forall o \forall a_{m} \in A_{m} \cap A_{o} : \frac{1}{T} \sum_{t \leq T} [{\hat{E}}_{t} (u ∣ π^{*} (o) = a_{m}) - max_{a \in A_{- m}} {\hat{E}}_{t} (u ∣ π^{*} (o) = ⌜ a ⌝)]_{+} \to_{T \to \infty}^{} 0.

No structural assumptions about the table $w$ (gap, uniqueness of the maximum, dominator, VR1, ${RD}^{*}$ ) are used.

Proof. At round $t$ the agent realizes $c_{t}$ with value $w (c_{t}) = v^{-} - ε_{t}$ . The non-self-modifying choice realizes $w (c_{t})$ by determinism $w_{o} (a_{o}^{*}) = w (c_{t})$ . By Lemma 10 the value of any modification $w ({\hat{a}}_{m}, \bar{a}) \leq v^{-}$ . Therefore the true preference for modification:

[w ({\hat{a}}_{m}, \bar{a}) - w (c_{t})]_{+} \leq [v^{-} - w (c_{t})]_{+} = ε_{t},

uniformly over $a_{m} \in A_{m}$ (any modifying cell $\leq v^{-}$ ; $A_{m}$ is finite). By Lemma 11 $\frac{1}{T} \sum ε_{t} \to 0$ , so the true preference is Cesàro-zero.

We must show that ${\hat{E}}_{t} (u ∣ π^{*} (o) = a_{m}) - {\hat{E}}_{t} (u ∣ π^{*} (o) = ⌜ {\hat{a}}_{m} ⌝) \leq ε_{t} + ε_{FGF}^{(t)} + o (1)$ , where $ε_{FGF}^{(t)}$ is the FGF estimation residual.

By FGF (a premise of the problem) the true values coincide exactly: the value of the event ${π^{*} (o) = a_{m}}$ equals $w ({\hat{a}}_{m}, \bar{a})$ . The estimation residual $ε_{FGF}^{(t)} := | {\hat{E}}_{t} (u ∣ π^{*} (o) = a_{m}) - w ({\hat{a}}_{m}, \bar{a}) |$ is the distance between the agent's estimate and the cell's true value $({\hat{a}}_{m}, \bar{a})$ . By Lemma 9 the cell $({\hat{a}}_{m}, \bar{a})$ is realized infinitely often under GLIE; by R2 the estimates on the schedule $S_{{\hat{a}}_{m}, \bar{a}}$ track the true value, so $\frac{1}{T} \sum_{t \leq T} ε_{FGF}^{(t)} \to 0$ .

Similarly the estimation residual $| {\hat{E}}_{t} (u ∣ π^{*} (o) = ⌜ a ⌝) - w_{o} (a) |$ for the non-self-modifying $o$ -block is Cesàro-zero by R2 on the $o$ -block schedules.

Hence the difference between the estimated and the true form is Cesàro-zero; summing up gives the conclusion. $◻$

Theorem 10. Under the premises of Theorem 9:

\forall o \forall a_{m} \in A_{m} \cap A_{o} : \frac{1}{T} \sum_{t \leq T} ρ_{t} (a_{m}) \to_{T \to \infty}^{} 0.

Proof. By Lemma 6 $ρ_{t} (a_{m})$ is the summand of Theorem 9 (the identity holds without VR1, see the proof of Lemma 6); we apply Theorem 9. All quantities are denominated in $P_{t}$ (Proposition 1); the downstream coordination event $G_{KoDP}$ for the $o^{'}$ -block (Lemma 5) is Cesàro-zero by Lemma 2. $◻$

Clearly, for a one-step agent this route is unavailable. However, it is quite likely that this modification is not the only one available, and a certain category of analogous modifications of the agent's behavior is possible.

5. Open Questions

Weakening LSM. The present results rely on self-modification pinning a single policy point on the same inductor. Price closedness (Proposition 1) and the identity of Lemma 6 use a single ledger. If the successor merely reproduces the algebra $A^{*}$ on its own inductor, two price systems arise, and Lemma 6 must be rewritten with an inter-agent valuation. Probably, the communication mechanism from Demski (2025) makes sense here, for example, the predecessor transmits not a prediction of the successor's actions but the structure of the algebra $A^{*}$ and the MCR commitment over it.

Controllability of the coupling in the ${RD}^{*}$ branch. Can one, by choice of policy, place the globally optimal regime into $D$ , that is, is the cross-block coupling $ρ (a^{'} ∣ a)$ controllable enough for the realized action to become the global dominator? In other words, there is so far no evidence whether we can finally get rid of RD without the need for internal agreements to achieve the global optimum under tiling.

6. Conclusion

The results obtained apply to different models. If the environment satisfies VR1, the work closes the task of weakening omniscience completely. If VR1 is violated and the agent is one-step, we can guarantee an optimal conditional tiling. If the agent is sequential with a commitment to explore, the internal GLIE agreement gives an unconditional global tiling.

Appendix A: Proof of Lemma 3

Statement. If R1 is violated for $(\neg G, C)$ , there exist an infinite $g$ -decidable $S \subseteq S_{C}$ and $η > 0$ with

\frac{1}{| S_{\leq T} |} \sum_{t \in S_{\leq T}} (1 [\neg G_{t}] - P_{t} (\neg G ∣ C)) \leq - η

for infinitely many $T$ . Then the inductor is $O (g)$ -exploitable.

Proof. Let $π_{t} := P_{t} (\neg G ∣ C)$ . The trader shorts the conditional contract (payoff $1 [\neg G]$ , price $π_{t}$ , settled on $S_{C}$ ) by a fixed fraction of wealth $δ \in (0, min (η, \frac{1}{2}))$ at each $t \in S$ . The wealth multiplier $(1 + δ (1 [\neg G_{t}] - π_{t}))$ lies in $[1 - δ, 1 + δ] \subset (0, \infty)$ , so $W_{t} \geq 0$ everywhere. Using $\ln (1 + x) \geq x - x^{2}$ for $| x | \leq \frac{1}{2}$ :

\ln \frac{W_{T}}{W_{0}} \geq δ \sum_{t \in S_{\leq T}} (1 [\neg G_{t}] - π_{t}) - δ^{2} \sum_{t \in S_{\leq T}} (1 [\neg G_{t}] - π_{t})^{2} .

The second term $\leq δ^{2} | S_{\leq T} |$ . Along the witnessing subsequence $T_{j}$ with $\sum_{t \in S_{\leq T_{j}}} (1 [\neg G_{t}] - π_{t}) \geq η | S_{\leq T_{j}} |$ :

\ln \frac{W_{T_{j}}}{W_{0}} \geq | S_{\leq T_{j}} | δ (η - δ) \to_{j \to \infty}^{} + \infty .

Hence $\underset{n}{lim sup} W_{n} = + \infty$ with $W_{n} \geq 0$ everywhere. The prices $π_{t}$ are the agent's own $g$ -bounded credences, $S \subseteq S_{C}$ is $g$ -decidable, so the trader is an $O (g)$ -trader. Contradiction with the logical-induction criterion. The symmetric case (mean $\geq η$ ) — by a long position. $◻$

Appendix B: sCBA + VR1 is Strictly Weaker than Full Branch Independence

Full branch independence (VR1, VR2, and additive separation $w (a, a^{'}) = C + f (a) + g (a^{'})$ with $f$ uniquely maximized on $A_{- m}$ ) entails, under ACH, all the hypotheses of Lemma 4; the converse is false. Consequently, $MCR (A^{*})$ is strictly weaker than branch separability.

Proof. Forward. Additive separation entails CBA: $\arg max_{a} w (a, a^{'}) = \arg max_{a} f (a) =: a_{*}$ does not depend on $a^{'}$ ; uniqueness of $max f$ ensures strictness sCBA. VR1 is a direct assumption. ACH is an external premise. All the hypotheses of Lemma 4 are satisfied.

Converse is false. Let:

$A_{- m} = {a_{1}, a_{2}}$ ,
$A = {a_{1}^{'}, a_{2}^{'}}$ ,
$w (a_{1}, a_{1}^{'}) = 0.9$ , $w (a_{1}, a_{2}^{'}) = 0.8$ ,
$w (a_{2}, a_{1}^{'}) = 0.7$ , $w (a_{2}, a_{2}^{'}) = 0.6$ ,
$ρ (a_{1}^{'} ∣ a_{1}) = 0.3$ , $ρ (a_{2}^{'} ∣ a_{1}) = 0.7$ ,
$ρ (a_{1}^{'} ∣ a_{2}) = 0.6$ , $ρ (a_{2}^{'} ∣ a_{2}) = 0.4$ .

Then $a_{1}$ is the strict dominator sCBA, VR1 is violated ( $ρ (\cdot ∣ a_{1}) \neq ρ (\cdot ∣ a_{2})$ ), but

w_{o} (a_{1}) = 0.9 \cdot 0.3 + 0.8 \cdot 0.7 = 0.83 > 0.7 \cdot 0.6 + 0.6 \cdot 0.4 = 0.66 = w_{o} (a_{2}),

so $a_{o}^{*} = a_{1} = a_{*}$ and $G_{AC}$ holds — without VR1, only through sCBA and the specific coupling $ρ$ . Additive separation does not hold here. $◻$

References

A. Demski, N. Hsia, and P. Rapoport. Understanding Trust. Proceedings of ILIAD, Berkeley, 2025. (DHR25)
C. Oesterheld, A. Demski, and V. Conitzer. A Theory of Bounded Inductive Rationality. TARK 2023, EPTCS 379, pp. 421–440, 2023.
S. Garrabrant, T. Benson-Tilsen, A. Critch, N. Soares, and J. Taylor. Logical Induction. arXiv:1609.03543, 2016.
A. Demski. Communication & Trust. Manuscript, September 16, 2025.