Tiling Without Omniscience Dynamizing Faith in Joint Argmax

Abstract

The tiling theorem of Demski, Hsia, and Rapoport proves that an Updateless Decision Theory 1.0 agent has no strict preference for self-modification, given a coherent prior and logical omniscience. In the present document the Faith in Joint Argmax assumption is weakened to a one-sided additive gap that, under iterative learning, vanishes in the Cesàro mean. All four places where omniscience is consumed in the tiling are reduced to a single condition on the inductor over a minimal subalgebra, and the result is reflectively stable. As an output, the condition preserves the independence of branch probabilities as a structural axiom, which can be dropped if one allows bounded optimality, replacing it with an internally verifiable domination condition. If global optimality is required, the agent is placed under an obligation to explore the environment, which removes the entire argmax machinery and yields tiling without structural assumptions about the value table, thereby removing omniscience.

Keywords: alignment research, tiling, updateless decision theory, logical induction, bounded rationality, reflective stability.


1. Introduction

Theorem 3 of DHR25 states: a UDT 1.0 agent has no strict preference for a self-modifying action amAm over the best non-self-modifying one. Formally, for each o and each amAm:

Ep(uπ(o)=am)maxaAmEp(uπ(o)=a).

Throughout we adopt DHR25's notation:

We work with a propositionally coherent logical inductor with true prices Q^, transformed into a Bayesian logical inductor with round prices Pt; the Boundedly Rational Inductive Agent (BRIA) round t is identified with a step of deductive time. The agent's estimate at round t is E^t, with E^tEQ^. Fix a non-decreasing budget g; an event is g-decidable if its truth value is decidable in O(g(t)) time.

We define:

Cross-block conditional laws:


2. Dynamization

The proof proceeds along the chain: Fine-Grained Fairness (FGF) Faith in Joint Argmax (FGJ) Action Coordination (AC) Knowledge of the Decision Procedure (KoDP). Each step consumes omniscience, since the agent must simultaneously not know the concrete action of its argmax and know its value. To remove this computational uncertainty, we check the robustness of the DHR25 chain to computational constraints. We reproduce the chain with explicit tracking of equalities and inequalities, since this determines where and how the assumption can be weakened. If the agent is not omniscient and makes errors in its estimates, we ensure that this error does not propagate to infinity as it passes through the steps of the proof.

Formally, with o fixed and (o¯,a¯)=(ob(am),ac(am)), a^mAm:

(FGF)amAm: Ep(uπ(o)=am)=Ep(uπ(o)=a^mπ(o¯)=a¯).(AC)argmaxaA maxaAmEp[uπ(o)=aπ(o¯)=a]=argmaxaAEp[uπ(o¯)=a].(KoDP)aAm: Ep[u|π(o)=aπ(o¯)=argmaxaAEp[uπ(o¯)=a]]=Ep[uπ(o)=a].(ε-FGJ)AA  (o,o),oo:maxaAEp[u|π(o)=aπ(o)=argmaxaAmaxaAEp[uπ(o)=aπ(o)=a]] maxaAmaxaAEp(uπ(o)=aπ(o)=a)ε,ε0 fixed.

Theorem 1.

FGFLSMε-FGJACKoDP  amAmAo: Ep(uπ(o)=am)maxaAmEp(uπ(o)=a)+ε.

Proof. Fix an arbitrary amAm under observation o, with (o¯,a¯)=(ob(am),ac(am)). Introduce the intermediate quantities

Y0:=Ep(uπ(o)=am),Y1:=Ep(uπ(o)=a^mπ(o¯)=a¯)Y2:=maxaAmEp(uπ(o)=aπ(o¯)=a¯),Y3:=maxaAmaxaAmEp(uπ(o)=aπ(o¯)=a)X:=maxaAmEp[u|π(o)=aπ(o¯)=argmaxaAmaxaAmEp[uπ(o)=aπ(o¯)=a]]XAC:=maxaAmEp[u|π(o)=aπ(o¯)=argmaxaAEp[uπ(o¯)=a]]

The chain is:

Y0 =FGF Y1 a^mAm Y2  Y3 ε-FGJ X+ε =AC XAC+ε =KoDP maxaAmEp(uπ(o)=a)+ε.

Here the equality Y0=Y1 is FGF; Y1Y2 — since a^mAm; Y2Y3 — taking the maximum over the second condition (Y3 is the right-hand side of FGJ); the step Y3X+ε is ε-FGJ in the direction RHSLHS; AC collapses the iterated double maximum inside the argmax into a single one (X=XAC); KoDP removes the condition on the argmax (XAC=maxaAmEp(uπ(o)=a)).

Note that FGJ is used exactly once, only on the lower side, and is sandwiched between exact equalities. The AC and KoDP steps after FGJ apply to X and are exact equalities, i.e. any additive gap introduced at the FGJ step passes straight through without amplification.

The ε-FGJ used here, at ε=0, is exactly Assumption 5 of DHR25.

Thus we have shown that any error in knowing the future argmax is transferred additively to the final result. This does not guarantee that the agent will be able to achieve tiling at all. For the error not to be a permanent obstacle, it must tend to zero in the course of learning. Since o¯=ob(am) varies with self-modification, to obtain "no self-modifying action at o is preferable by more than ε", the assumption must hold uniformly: ε=supoεo,o over all relevant pairs.

We transfer the problem to the BRIA setting (Oesterheld, Demski, Conitzer, 2023). At round t the agent holds estimates Wt(a,a) of the quantities w(a,a), chooses a^t=argmaxamaxaAmWt(a,a) and at=argmaxaAmWt(a,a^t), realizing the pair ct=(at,a^t) with true value w(ct) and estimation regret εt=vw(ct)0. The realization schedules Sa,a:={t:ct=(a,a)} are g-decidable and partition N.

Theorem 2. Let α¯ be a covering BRIA over {ha,a:(a,a)Am×A}, where ha,a recommends the pair (a,a) and promises w(a,a)δ. Under reward-value tracking (R2) on each Sa,a:

1TtTεt=1TtT(vw(ct))T0.

Proof. We use two properties of the covering BRIA from Oesterheld et al. (2023):

Fix (a,a) and δ>0. The test set Ma,aSa,a of the hypothesis ha,a is g-decidable; value tracking (R2) on it gives 1|MT|tMT(rtw(a,a))0. Consequently the empirical record tMT(rt(w(a,a)δ)) has mean δ>0 and does not diverge to . By property (i), αtew(a,a)δ for all but finitely many t. Letting δ0 and taking the supremum over all pairs (finitely many), lim inftαtev, whence lim infT1TtTαtev. By property (ii), 1Trt1Tαteo(1), so lim infT1Trtv.

We partition {1,,T} by realized pairs. For each (a,a) with infinite Sa,a, value tracking (R2) on Sa,a gives tSa,a,T(rtw(a,a))=o(|Sa,a,T|)=o(T). Summing over the finitely many pairs, 1TtT(rtw(ct))0. Since w(ct)v, we obtain lim supT1Trtv.

Combining: 1Trtv; from the ceiling identity 1Tw(ct)=1Trto(1)v, whence 1Tεt0.

Corollary. Set

Δ:=vmax{w(a,a):(a,a)Am×A, w(a,a)<v},[ct suboptimal]:⟺w(ct)<vεt>0.

If Δ>0, then the best response is chosen with limiting frequency 1: since w(ct)<vw(ct)vΔ, we have εtΔ1[εt>0], whence

1TtT1[εt>0]1ΔTtTεtT0.

In the absence of a gap (Δ=0) only the Cesàro statement of Theorem 2 holds. (The gap is taken on the w-scale of realized pairs ct, consistently with the indicator.)

At this point we have shown the attainability of tiling in values; however, we have implicitly assumed that the AC and KoDP steps are exact equalities, whereas so far they too consume omniscience. But in reality the agent may err not only in values (as in FGJ) but in the maximum-selection process itself. If it incorrectly identifies the argmax, this creates a new error. To guarantee that these errors also vanish, we introduce Lemma 1, which links the error in probabilities to the error in expectations.

Lemma 1. Let C,C be events of positive Pt-probability, G an arbitrary event, u[0,1]. If CG=CG, then:

Ep[uC]Ep[uC]+Pt(¬GC)+Pt(¬GC).

Proof. By total expectation and u1: Ep[uC]Ep[uC,G]+Pt(¬GC). From CG=CG:

Ep[uC,G]=Ep[u1CG]Pt(CG)=Ep[uC,G].

Finally, Ep[uC,G]Ep[uC]=Pt(¬GC)(Ep[uC,G]Ep[uC,¬G])Pt(¬GC), since the bracket lies in [1,1].

Good events for AC and KoDP:

GAC={argmaxaAmaxaAmw(a,a)=argmaxaAv(a)}GKoDP={π(o)=argmaxaAoEQ^[uπ(o)=a]}

Let Ba:={π(o)=a}, Ca:={π(o)=aπ(o¯)=a¯}, CaAC — the conditioning after the AC step. We define (at round t; in the idealized form of Theorem 3 the same residuals are written without the index (t)):

εAC(t):=maxaAm[Pt(¬GACCa)+Pt(¬GACCaAC)]εKoDP(t):=maxaAm[Pt(¬GKoDPCaAC)+Pt(¬GKoDPBa)]

Now we combine the results obtained.

Theorem 3. Under the residual bounds εFGJ, εAC, εKoDP:

Ep(uπ(o)=am)maxaAmEp(uπ(o)=a)+εFGJ+εAC+εKoDP.

Proof. By Theorem 1 the steps before FGJ are exact; FGJ contributes +εFGJ and yields X=maxaEp[uCa].

We apply Lemma 1 with (C,C,G)=(Ca,CaAC,GAC). The hypothesis CaGAC=CaACGAC holds: on GAC the iterated double maximum inside the argmax equals the single one, so the conditioning on π(o¯) coincides, while the parts π(o)=a are identical. Hence Ep[uCa]Ep[uCaAC]+Pt(¬GACCa)+Pt(¬GACCaAC); taking maxa: XXAC+εAC.

We apply Lemma 1 with (C,C,G)=(CaAC,Ba,GKoDP): on GKoDP knowledge of the procedure makes the argmax condition redundant, so CaACGKoDP=BaGKoDP. Hence XACmaxaEp[uBa]+εKoDP.

Combining: XmaxaAmEp[uπ(o)=a]+εAC+εKoDP, and adding the FGJ gap closes the chain.

Thus we have carried out the dynamization of FGJ via BRIA.


3. The Minimal Condition on the Inductor

If the agent's errors in value estimates tend to zero, then tiling holds. But what exactly must the inductor ensure for these errors to actually vanish? We cannot demand perfection from the agent in everything, but we must impose requirements on those aspects of its logic that directly affect decision-making. We call this set of requirements the Minimal Condition for Rationality (MCR).

Let the decision-relevant generators be the propositions about policy points {π(o)=a}, value comparisons {w(a,a)w(a~,a~)}, and reward variables (rt) on the realization schedules. Let P:={(o,ob(am)):amAmAo} be the "modification–target" pairs.

Definition (Minimal Condition for Rationality — MCR). The true prices Q^ satisfy MCR(A) under budget g if:

R1 guarantees that the agent's belief that "event E will occur" is mathematically consistent with reality (necessity). R2 guarantees that the agent physically "sees" the rewards and that they are no noisier than the budget permits (sufficiency). Together they create the foundation for convergence.

In the special case, two-block vMWC randomness for a pair (o,o) is exactly MCR restricted to the two-block cross algebra of a single pair of observations. MCR lifts it to an arbitrary decision-relevant algebra.

Lemma 2. Suppose MCR(A) and the auxiliary coherence constraints (inherited from DHR25 §6). Let GA be a g-decidable event with PQ^(G)=1, given by a finite Boolean combination of strict comparisons {w(b)>w(b)} from A, and CA a g-decidable event of positive true price. Then:

1TtTPt(¬GC)T0.

Proof. Write G=k=1KEk, Ek={w(bk)>w(bk)}, with true gaps δk:=w(bk)w(bk)>0 (K<, since A is finite).

By R2 the estimates track the true values on the corresponding schedules; real-valued coherence pins the credences Pt(Ek) to these estimates. Since the gap δk>0 is strict, the estimated comparison coincides with the true one with limiting frequency 1, whence 1TtTPt(¬Ek)0. By subadditivity Pt(¬G)kPt(¬Ek), so 1TtTPt(¬G)0.

Since PtQ^ and PQ^(C)>0, there exist t0 and c:=12PQ^(C)>0 with Pt(C)c for tt0. The auxiliary coherence constraints license the monotonicity PQ^(¬GC)=0 for self-referential C.

We split the sum at t0: the first t0 terms contribute t0/T0. For the rest: Pt(¬GC)(Pt(¬G)+βt)/c, where βt0, which completes the proof.

Lemma 3. If R1 is violated for (¬G,C): there exist an infinite g-decidable subschedule SSC and η>0 such that

1|ST|tST(1[¬Gt]Pt(¬GC))η

for infinitely many T. Then the inductor is O(g)-exploitable.

The proof of Lemma 3 is a standard trader argument and, for readability, is given in Appendix A.

We see that MCR does indeed make sense as a constraint to impose; however, to understand how strict MCR must be, we need to determine the bounds of its domain of application. It would be excessive to demand MCR for all possible propositions in the agent's language. It suffices for it to hold only for those propositions the agent uses to compare actions and choose a policy. This leads us to the notion of a minimal subalgebra.

3.1 The Minimal Subalgebra

The tiling chain conditions and compares values only through the pairs (o,o)P. This allows defining the minimal algebra, with Sa,a:={tN:ct=(a,a)}:

A=σ({rt:tN}{Sa,a:(a,a)Am×A})value layerσ({w(a,a)w(a~,a~)}{wo(a)wo(a~)})comparison layer,

restricted to the pairs P. (The value layer is the joint σ-algebra of rewards and schedule labels Sa,a, since the schedules {Sa,a} themselves partition N and restricting the rewards to their union yields all rt.)

MCR guarantees that the agent learns correctly, but it does not guarantee that the structure of the world (or of the policy) itself helps it make the right choice. For the action-coordination step (AC) in our chain to become exact, we may need a certain structural orderedness in the value table. We introduce two such structural notions: branch independence (VR1) and argmax stability (CBA).

Definition. A pair (o,o) satisfies VR1 under Q^ if the joint policy law factorizes: ρ(aa)=ρ(a) and σ(aa)=σ(a) for all a,a.

Definition.

Thus sCBACBA, and both properties have distinct notations.

If VR1 and CBA hold, then MCR automatically guarantees the exactness of the AC and KoDP steps. We prove this with the following lemmas, under Action-Coordination Honesty (ACH):

(ACH)ao:=argmaxaAowo(a)Am

Lemma 4. Suppose (o,o) satisfies sCBA (with dominator a), VR1, and ACH. Then:

Proof. (i) By VR1 the weight ρ(aa)=ρ(a) does not depend on a, so:

wo(a)=aρ(a)w(a,a)aρ(a)w(a,a)=wo(a).

For aa the strictness of sCBA gives an a0 with ρ(a0)>0 and w(a,a0)>w(a,a0), making the inequality strict. Thus argmaxaAmwo(a)={a}.

(ii) By ACH, aoAm; since ao maximizes wo over AoAm, it also maximizes over Am, and uniqueness from (i) forces ao=a.

(iii) Determinism of the policy gives σ()=δao; by VR1, σ(aa)=σ(a), whence v(a)=aσ(a)w(a,a)=w(ao,a)=w(a,a) by (ii).

(iv) By CBA, argmaxamaxaw(a,a)=argmaxaw(a,a); by (iii), argmaxav(a)=argmaxaw(a,a). The two argmaxes coincide — this is GAC.

Clearly, without VR1 the conclusion fails: with w(a,)=(0.9,0.2), w(a1,)=(0.8,0.1) and ρ(a)=δa2, ρ(a1)=δa1 we have wo(a)=0.2<0.8=wo(a1), even though a dominates pointwise. A comparison of VR1 and full branch independence is in Appendix B.

Lemma 5. Suppose value tracking of the o-block (R2) and MCR over the comparison events of the o-block. Then:

1TtTPt(¬GKoDPBa)T0,

and similarly with CaAC in place of Ba.

Proof. By the BRIA rule: π(o)=argmaxaE^t(uπ(o)=a). Consequently ¬GKoDPargmaxaE^t(uπ(o)=a)argmaxawo(a) — this is the g-decidable event "the argmax of the estimate does not coincide with the argmax of the true value in the o-block".

GKoDP is a finite Boolean combination of o-block comparisons, g-decidable. Under true prices PQ^(¬GKoDPBa)=0: the agent chooses a if and only if a is its estimate-argmax, which by limiting value tracking is the argmax of the true value. Lemma 2 yields the Cesàro-zero conclusion.

The only potential self-reference is in the hypothesis PQ^(¬GKoDPBa)=0. Universal instantiation transfers E^t(uπ(o)=a)wo(a) to the concrete a=a even under conditioning on Ba; real-valued coherence forces E[uBa] to cohere with the learned wo(a). This is exactly the auxiliary coherence constraints from DHR25 §6, no new generators are consumed.

Now we can assemble everything. We have:

The synthesis of these results gives the main theorem of the section.

Theorem 4. Let the agent be a BRIA over a Bayesian logical inductor. Suppose LSM, the structural premises making GAC, GKoDP exact under Q^ (Lemma 4, Lemma 5), and MCR(A). Then:

o amAmAo:1TtT[E^t(uπ(o)=am)maxaAmE^t(uπ(o)=a)]+T0.

Proof. The static chain (Theorem 3) is exact on the first step (FGF) under true prices and becomes an ε-chain with residuals εFGJ(t)+εAC(t)+εKoDP(t), arising outside the inner argmax and not amplified by the exact steps. The conclusion, however, is denominated in the round estimates E^t, whereas the chain is written in idealized prices; transferring to E^t introduces one more residual at the FGF step, which we first isolate explicitly.

For εFGF(t): the self-modification event {π(o)=am} and the non-modification event {π(o)=a^mπ(o¯)=a¯} have the same true value, but the agent sees different estimates. We define this gap as:

εFGF(t):=|E^t(uπ(o)=am)E^t(uπ(o)=a^mπ(o¯)=a¯)|.

The cell (a^m,a¯)Am×A is an o-block generator of the value layer. Thanks to the value-tracking condition R2, the agent inevitably comes to know the true values of these table cells over time, so 1TtTεFGF(t)0.

Similarly, the estimation residual |E^t(uπ(o)=a)wo(a)| for the non-self-modifying o-block is Cesàro-zero by R2. Hence the difference between the estimated and the true form of the bound is Cesàro-zero.

Since each of these error sources is Cesàro-zero, their sum also tends to zero.

Theorem 5. Under LSM, FGF, ACH, VR1, and sCBA under Q^,

1TtTρtT0MCR(A).

Proof. () This is exactly Theorem 4: with the skeleton fixed, MCR(A) supplies R1 and R2, the structural premises give exactness of the good events under Q^, and the assembly of Theorem 4 gives 1TtTρt0.

() We show by contraposition that abandoning either layer breaks the conclusion.

Suppose R2 is violated on some realization schedule Sa0,a0: there exists an infinite g-decidable SSa0,a0 with

1|ST|tST(rtw(a0,a0))0.

Then 1TtTw(ct) does not converge to v along the subsequence on which the pair (a0,a0) is chosen with positive frequency, and the regret εt has nonzero Cesàro mean along it; by Theorem 2 this is exactly a violation of the Cesàro-zeroness of εFGJ, so the tiling term does not tend to zero. Hence the value layer is indispensable.

Suppose R1 is violated for (¬GAC,C) for some C{Ca,CaAC}: there exist a g-decidable SSC and η>0 with

1|ST|tST(1[¬GAC,t]Pt(¬GACC))η

for infinitely many T. By Lemma 3 (Appendix A) the inductor is O(g)-exploitable, which is incompatible with the logical-induction criterion; equivalently, excluding the comparison layer admits an inductor whose true prices violate CBA, so GAC fails, the AC step collapses by a nonzero amount, and an O(g)-trader betting on the mismatch of argmaxes exploits the agent. Hence the comparison layer is indispensable.

Thus both layers are necessary and sufficient, and A is minimal.

3.2 Reflective Stability

Theorem 4 was proved by reasoning about the agent's prices. For tiling to be self-applicable, the predecessor must refrain from modification on grounds available to it itself.

Proposition 1. Every quantity in the tiling bound of Theorem 4 is an entry in the agent's price system Pt. In particular, all residuals Pt(¬GC) are the agent's own credences, not external quantities.

Proof. The proof of Theorem 4 is built along the scheme: transfer of FGF to estimates (the residual εFGF(t) is an R2-tracked quantity in E^t) ε-chain in E^t Lemma 1 introduces the residuals Pt(¬GC) Lemma 2 gives their Cesàro-zeroness. Every term at every step is a round price of the agent.

As a consequence, to act, the agent does not need a credence about its own price of the form Pt(Pt(¬GC)<ε). It takes the argmax of E^t, which is a price. The "belief about belief" tower does not start, since the action consumes a price, not a judgment about a price.

Lemma 6. Denote the modification successor S(am) and the unmodified successor Π. The reflective gap ρt(am):=[V^t(S(am))V^t(Π)]+. Then

LSMFGFUDT 1.0  ρt(am)=[E^t(uπ(o)=am)maxaAmE^t(uπ(o)=a)]+

where UDT 1.0 is the updateless optimization rule π(o):=argmaxaAoEp(uπ(o)=a).

Proof. The updatelessness of UDT 1.0 chooses the policy once; in the dynamized version, at each round t all coordinates of the committed policy are co-determined by one and the same price system E^t. In particular the o-coordinate (and the entire lower part) is evaluated at the round-t prices, not by a later estimate E^t with t>t. This follows from updatelessness itself — the predecessor at round t values the committed policy as a single object under E^t — and requires neither factorization of the branch probabilities (VR1) nor invariance of values across branches (VR2): the localization of the optimization onto the o-term (the content of Theorem 4 of the companion FGJ document, which uses VR1/VR2) is needed only to compute that term, whereas for the gap identity it suffices to have the qualitative fact that both valuations — V^t(S(am)) and V^t(Π) — are taken at the prices of one and the same round t. Therefore the valuation "keep UDT" is not a nested E^t[E^t[]] with t>t, but a single-level round conditional expectation. This is exactly what makes Lemma 6 applicable to the branches of §4, where VR1 is dropped.

Therefore "impose am" is V^t(S(am))=E^t(uπ(o)=am) by definition, and "keep UDT" is V^t(Π)=maxaAmE^t(uπ(o)=a) with the co-determined lower part. The downstream argmax is a single-level round coordination event (of type GKoDP), not a separately trusted future estimate. The correspondence between the form of the action-at-o (π(o)=am versus maxAm) and the form of the modification ("impose" versus "keep") is exactly the FGF + LSM bundle.

Theorem 6. Under the premises of Theorem 4 and without any additional self-trust assumption:

o amAmAo:1TtTρt(am)T0.

Proof. By Lemma 6, ρt(am) is the summand of Theorem 4; we apply Theorem 4. The only residual self-reference — the downstream coordination event H={argmaxaE^t(uπ(o)=a)=argmaxav(a)} — is GKoDP for the o-block (Lemma 5), g-decidable, PQ^(H)=1; its residual Pt(¬HC) is Cesàro-zero by Lemma 2. No new generator beyond the value and comparison layers is introduced. In the absence of a value gap, when H may have ties and PQ^(H)<1, ρt is controlled directly by the value regret εt (ties do not lose value), which is Cesàro-zero by Theorem 2; the indicator of H is then not needed.

As an output, the minimal tiling condition preserves only VR1 as an external axiom. It is needed for Lemma 4, since without it the pointwise dominator may fail to be the agent's marginally optimal action. Globally, in the sections below, we consider various directions for removing this constraint.


4. Working with VR1

Regarding VR1, we see two options for solving the problem. On the one hand, we can impose a condition on the structure of prices under the given policy, that is, essentially state the non-guaranteedness of the agent's optimal policy, since it may require the absence of this condition. This option is acceptable from the standpoint of pure UDT, since it does not presuppose a significant change to the theory, and allows simply translating the omniscience problem into the much more realistic problem of branch independence.

On the other hand, we propose to impose a condition on the agent's action schedule over time. This approach has a goal similar to GU in the sense that, by the agent making an agreement with itself regarding its own actions, we solve a significant problem of the decision-making system. In exchange for such a constraint, we obtain the ability to guarantee attainment of the optimum without omniscience of the agent, and therefore we consider it promising.

4.1 Bounded Optimality

Definition. A regime d=(o,o)P satisfies realized cross-block domination RD if:

(RD)aA: w(ao,a)=maxaAmw(a,a).

Clearly, RD is weaker than the bundle VR1 + sCBA: there VR1sCBAACHRD (Lemma 4(ii)), so RD is a strict weakening. The converse is false: RD does not entail ρ(aa)=ρ(a) and may hold under coupled conditional laws.

Definition. A regime d is admissible if it satisfies ACH (aoAm) and RD under Q^. The set of admissible regimes is D; it is finite (A,O finite).

Lemma 7. If a regime d satisfies ACH and RD under Q^, then GAC holds under Q^.

Proof. Denote M(a):=maxaAmw(a,a).

(i) By RD: M(a)=w(ao,a) for every aA.

(ii) Under Q^ the policy is deterministic and π(o)=ao with Q^-probability 1. For any a with PQ^(π(o)=a)>0 the event {π(o)=ao} has full probability, so σ(aa)=δao(a), whence v(a)=w(ao,a). Here only the determinism of the realized policy is used, not the factorization VR1.

From combined: M(a)=v(a)=w(ao,a) for all a. Coincidence of the functions entails coincidence of their argmaxes — this is GAC.

Note that the cross-block conditional law ρ(aa) for aao is nowhere computed: this is an off-policy counterfactual, conditioning on a probability-zero event. Lemma 7 bypasses it entirely.

Lemma 8. Let d be admissible and GRD the event of RD holding together with the argmax condition:

GRD:={ao=argmaxaAmwo(a)aA:w(ao,a)maxaAmw(a,a)}.

Then for any g-decidable C of positive price:

1TtTPt(¬GRDC)T0,

and a violation of R1 for (¬GRD,C) entails O(g)-exploitability.

Proof. GRD is a finite Boolean combination of comparisons from the comparison layer A: o-block comparisons {wo(ao)wo(a)}aAm and cross-block comparisons {w(ao,a)w(a,a)}aAm,aA. All of them are g-decidable and pinned to the true order by value tracking (R2). Since d is admissible, PQ^(GRD)=1. Lemma 2 gives the vanishing. Necessity: a violation of R1 for (¬GRD,C) entails exploitability by a trader (Lemma 3).

Theorem 7.

LSMFGFACHMCR(A)(d(o,o)D)  am{aAm:ob(a)=o}:1TtT[E^t(uπ(o)=am)maxaAmE^t(uπ(o)=a)]+T0.

Proof. We reproduce the tiling chain of Theorem 4, replacing the single place where VR1 is used (the AC step) with Lemma 7. By admissibility of the regime, GAC is exact under Q^ (Lemma 7); Lemma 2 gives 1TPt(¬GACC)0 for C{Ca,CaAC}, so εAC(t) is Cesàro-zero. The residuals εFGJ(t) (Theorem 2) and εKoDP(t) (Lemma 5), as well as the transfer residual εFGF(t) (Step 0 of Theorem 4), are Cesàro-zero independently of VR1. The sum of Cesàro-zero sequences is Cesàro-zero.

Theorem 8. Under the premises of Theorem 7 over a finite D:

Proof. (i) Lemma 8. (ii) Theorem 7 and Theorem 6, quantified over D. (iii) D is finite, the maximum of V is attained; the pairwise comparisons {V(d)V(d)} are generators of the comparison layer A, pinned to the true order and inexploitable by R1. On d we apply (ii). The inequality V(d)Vglob is by definition of the supremum.

If the environment satisfies VR1, the globally optimal regime falls into D (Lemma 4(ii)), and V(d)=Vglob. If VR1 is violated, the agent may end up in an admissible component not containing the global optimum, and tiling is guaranteed only within D, since the axiom about the environment is replaced by an internally verifiable condition about the agent.

4.2 Global Optimality

Note that if the agent explores enough, that is, visits every cell of the table w infinitely often, then it reaches the global ceiling v, and tiling follows from optimality without structural requirements on the table.

We add three setup assumptions about the agent and the environment:

Definition. A GLIE commitment is a g-decidable commitment under which: (a) every schedule is infinite: (a,a)Am×A: |Sa,a|=; (b) the exploration frequency vanishes: 1T|{tT:ctctgreedy}|0.

Lemma 9. Under S, St, and a GLIE commitment, for each (a,a)Am×A the schedule Sa,a is infinite and g-decidable; by St and R2:

1|Sa,a,T|tSa,a,T(rtw(a,a))T0.

Proof. Infiniteness of each Sa,a — from GLIE(a). Tracking follows from St and R2: by St the reward on Sa,a is an O(g)-boundedly vMWC-random sequence with constant mean w(a,a); by the definition of vMWC-randomness (Def. 8 of Oesterheld et al., 2023), for each infinite g-decidable subschedule SSa,a the mean of centered rewards tends to zero, which is the statement of the lemma for S=Sa,a.

The cross-block coupling ρ(aa) is nowhere computed; the cells are realized and measured directly.

Lemma 10. For each amAm:

EQ^[uπ(o)=am]=FGFw(a^m,a¯)v.

The cell (a^m,a¯)Am×A is realized under GLIE (Lemma 9), so the GLIE commitment does not require executing any irreversible self-modification.

Proof. The equality is FGF; a^mAm by LSM, so (a^m,a¯) is a table cell and w(a^m,a¯)v. Realization is Lemma 9.

Lemma 11. Under S, St, L, and GLIE:

1TtTεt=1TtT(vw(ct))T0.

No value gap is assumed.

Proof. By Lemma 9 all cells are covered, in particular the cell of the global maximum v. Then the premises of Theorem 2 (Cesàro optimality) are satisfied: a covering BRIA exists, the test sets are infinite and g-decidable. We apply Theorem 2. The exploration rounds GLIE(b) have vanishing frequency and contribute o(1).

Theorem 9. Under FGF, LSM, S, St, L, MCR(A), and a GLIE commitment:

o amAmAo:1TtT[E^t(uπ(o)=am)maxaAmE^t(uπ(o)=a)]+T0.

No structural assumptions about the table w (gap, uniqueness of the maximum, dominator, VR1, RD) are used.

Proof. At round t the agent realizes ct with value w(ct)=vεt. The non-self-modifying choice realizes w(ct) by determinism wo(ao)=w(ct). By Lemma 10 the value of any modification w(a^m,a¯)v. Therefore the true preference for modification:

[w(a^m,a¯)w(ct)]+[vw(ct)]+=εt,

uniformly over amAm (any modifying cell v; Am is finite). By Lemma 11 1Tεt0, so the true preference is Cesàro-zero.

We must show that E^t(uπ(o)=am)E^t(uπ(o)=a^m)εt+εFGF(t)+o(1), where εFGF(t) is the FGF estimation residual.

By FGF (a premise of the problem) the true values coincide exactly: the value of the event {π(o)=am} equals w(a^m,a¯). The estimation residual εFGF(t):=|E^t(uπ(o)=am)w(a^m,a¯)| is the distance between the agent's estimate and the cell's true value (a^m,a¯). By Lemma 9 the cell (a^m,a¯) is realized infinitely often under GLIE; by R2 the estimates on the schedule Sa^m,a¯ track the true value, so 1TtTεFGF(t)0.

Similarly the estimation residual |E^t(uπ(o)=a)wo(a)| for the non-self-modifying o-block is Cesàro-zero by R2 on the o-block schedules.

Hence the difference between the estimated and the true form is Cesàro-zero; summing up gives the conclusion.

Theorem 10. Under the premises of Theorem 9:

o amAmAo:1TtTρt(am)T0.

Proof. By Lemma 6 ρt(am) is the summand of Theorem 9 (the identity holds without VR1, see the proof of Lemma 6); we apply Theorem 9. All quantities are denominated in Pt (Proposition 1); the downstream coordination event GKoDP for the o-block (Lemma 5) is Cesàro-zero by Lemma 2.

Clearly, for a one-step agent this route is unavailable. However, it is quite likely that this modification is not the only one available, and a certain category of analogous modifications of the agent's behavior is possible.


5. Open Questions

Weakening LSM. The present results rely on self-modification pinning a single policy point on the same inductor. Price closedness (Proposition 1) and the identity of Lemma 6 use a single ledger. If the successor merely reproduces the algebra A on its own inductor, two price systems arise, and Lemma 6 must be rewritten with an inter-agent valuation. Probably, the communication mechanism from Demski (2025) makes sense here, for example, the predecessor transmits not a prediction of the successor's actions but the structure of the algebra A and the MCR commitment over it.

Controllability of the coupling in the RD branch. Can one, by choice of policy, place the globally optimal regime into D, that is, is the cross-block coupling ρ(aa) controllable enough for the realized action to become the global dominator? In other words, there is so far no evidence whether we can finally get rid of RD without the need for internal agreements to achieve the global optimum under tiling.


6. Conclusion

The results obtained apply to different models. If the environment satisfies VR1, the work closes the task of weakening omniscience completely. If VR1 is violated and the agent is one-step, we can guarantee an optimal conditional tiling. If the agent is sequential with a commitment to explore, the internal GLIE agreement gives an unconditional global tiling.


Appendix A: Proof of Lemma 3

Statement. If R1 is violated for (¬G,C), there exist an infinite g-decidable SSC and η>0 with

1|ST|tST(1[¬Gt]Pt(¬GC))η

for infinitely many T. Then the inductor is O(g)-exploitable.

Proof. Let πt:=Pt(¬GC). The trader shorts the conditional contract (payoff 1[¬G], price πt, settled on SC) by a fixed fraction of wealth δ(0,min(η,12)) at each tS. The wealth multiplier (1+δ(1[¬Gt]πt)) lies in [1δ,1+δ](0,), so Wt0 everywhere. Using ln(1+x)xx2 for |x|12:

lnWTW0δtST(1[¬Gt]πt)δ2tST(1[¬Gt]πt)2.

The second term δ2|ST|. Along the witnessing subsequence Tj with tSTj(1[¬Gt]πt)η|STj|:

lnWTjW0|STj|δ(ηδ)j+.

Hence lim supnWn=+ with Wn0 everywhere. The prices πt are the agent's own g-bounded credences, SSC is g-decidable, so the trader is an O(g)-trader. Contradiction with the logical-induction criterion. The symmetric case (mean η) — by a long position.


Appendix B: sCBA + VR1 is Strictly Weaker than Full Branch Independence

Full branch independence (VR1, VR2, and additive separation w(a,a)=C+f(a)+g(a) with f uniquely maximized on Am) entails, under ACH, all the hypotheses of Lemma 4; the converse is false. Consequently, MCR(A) is strictly weaker than branch separability.

Proof. Forward. Additive separation entails CBA: argmaxaw(a,a)=argmaxaf(a)=:a does not depend on a; uniqueness of maxf ensures strictness sCBA. VR1 is a direct assumption. ACH is an external premise. All the hypotheses of Lemma 4 are satisfied.

Converse is false. Let:

Then a1 is the strict dominator sCBA, VR1 is violated (ρ(a1)ρ(a2)), but

wo(a1)=0.90.3+0.80.7=0.83>0.70.6+0.60.4=0.66=wo(a2),

so ao=a1=a and GAC holds — without VR1, only through sCBA and the specific coupling ρ. Additive separation does not hold here.


References

  1. A. Demski, N. Hsia, and P. Rapoport. Understanding Trust. Proceedings of ILIAD, Berkeley, 2025. (DHR25)

  2. C. Oesterheld, A. Demski, and V. Conitzer. A Theory of Bounded Inductive Rationality. TARK 2023, EPTCS 379, pp. 421–440, 2023.

  3. S. Garrabrant, T. Benson-Tilsen, A. Critch, N. Soares, and J. Taylor. Logical Induction. arXiv:1609.03543, 2016.

  4. A. Demski. Communication & Trust. Manuscript, September 16, 2025.