Active inference convergence estimation

Before answering this question, I would be interested to understand the convergence of active inference. Given my previous research Convergence convexity relations, we can already make an estimation of Agency.

Curiosity

L_{C u r i o s i t y} = D_{K L} (p | | q) = - \sum_{x} p (x) \log q (x)

The KL divergence is smooth everywhere, except at points where the predicted distribution $q (x) = 0$ for an event $x$ that actually occurred (i.e., $p (x) = 1$ ). If $q (x) \to 0$ while $p (x) > 0$ , then the term $p (x) \log q (x)$ tends to $- \infty$ , and the divergence $D_{K L} (p | | q)$ tends to $+ \infty$ . The gradient at these points becomes undefined or infinite. In practice, we never allow $q$ to reach absolute zero. A small constant $ϵ > 0$ is added to the argument of the logarithm, for example, $\log (q (x) + ϵ)$ . With this practical addition, the function becomes smooth $C^{\infty}$ on the admissible simplex of probabilities where $q_{i} > ϵ$ .

Empowerment

It is defined through mutual information:

L_{E m p o w e r m e n t} = max_{p (a^{n})} I (A^{n}; S_{t + n})

In expanded form:

I = H (S_{n e x t}) - H (S_{n e x t} | A)

The Shannon entropy, like the KL divergence, is based on the expression $p (x) \log p (x)$ and is not perfectly smooth due to the singularity at zero, which, as we understood, is fixable. Since MI is a simple algebraic combination of entropies, it inherits the same smoothness characteristics. The problem is that the empowerment definition also includes a maximization step ( $max$ ), where $A$ is a choice or action that can be discrete or continuous.

This implies:

The graph may exhibit kinks, meaning it may not be differentiable.
The maximum of smooth functions is a continuous function.
The maximum of convex functions is also convex.

However, this does not answer our question; I will proceed to read more literature to understand if the smoothness of the maximum can depend on the smoothness of the functions.

Useless:
https://scispace.com/pdf/l-p-smoothness-on-weighted-besov-triebel-lizorkin-spaces-in-2323s4lmnc.pdf

https://www.sciencedirect.com/science/article/pii/1385725888900078?ref=pdf_download&fr=RR-2&rr=9dc41aaaeeb789fe

https://people.math.sc.edu/sharpley/Papers_PDF/DeVoreSharpley1984.pdf

In this extensive work, the smoothness of the maximum function is considered as a consequence of the smoothness of the original function or as a criterion for belonging to certain smoothness classes (my intuition is proving correct here). It is established that the membership of $f$ in Sobolev and Besov smoothness spaces is equivalent to the corresponding maximal function belonging to a certain space or a Besov space (Theorem 7.1). According to this work, since an absolutely smooth function possesses $C^{\infty}$ smoothness, the corresponding maximal function will also belong to the corresponding $C^{\infty}$ spaces.

This sounds extremely unexpected. This is bad. It turns out that mutual information is infinitely differentiable in the neighborhoods where $q > 0$ , and this does not depend at all on the agent's actions or sensor states. It feels like I should have started with this when analyzing and not gotten bogged down in Besov smoothness estimates, but oh well.

If it is just smooth, then $N C = Θ (\log (1 / ε))$ , meaning if a function contains anything worse than $O (\log)$ , apart from agent-dependent terms, it will primarily reduce the minimum in those terms. I would be very cautious without re-verification, but this strongly suggests that models optimizing agent-dependent functions likely become agents before they learn to perform them qualitatively well.

Since we obtained a convergence estimate, it can be used to estimate the volume.

https://arxiv.org/pdf/1610.01145
According to the above article, convergence over parameters and depth will be $O (\ln (1 / ε))$ with unbounded depth, and with bounded depth:
$N = Ω (ε^{- \frac{1}{2 (L - 2)}})$ .

I think we can use exactly this estimate in our work, as it will provide at least an approximate estimate for current networks. For modern models with $L \approx 20; N \approx 10^{10}$ , we get:

{\begin{cases} N = Ω (ε^{- \frac{1}{2 (L - 2)}}) \\ L = 20, N = 10^{10} ⟹ 10^{10} = Ω (ε^{- \frac{1}{36}}) \end{cases}

This means that $ε^{- \frac{1}{36}}$ must be less than or equal to $10^{10}$ (since $Ω$ sets the lower bound). To find the best convergence (which corresponds to the upper bound $O$ , but we use the lower bound for the estimate), we assume equality:

ε^{- \frac{1}{36}} = 10^{10} ⟹ ε = {(10^{10})}^{- 36} ⟹ ε = 10^{- 360}

Well, at least this is something good. According to my previous estimate Joar questions,

P (function is agentic) = \frac{{Measure}_{ϵ}}{Total measure of the space} = \frac{(2 ϵ)^{n}}{M^{n}} = {(\frac{2 ϵ}{M})}^{n} = {(\frac{2 * 10^{- 360}}{M})}^{n}

That is, agentic functions occupy a very small fraction in the space of all functions in principle (though I don't know what I was hoping for), and this fraction will become even smaller as the capabilities of the models grow. Nevertheless, I would attribute this explanation more to the properties of the function space, as it is difficult to fill any volume when discussing high dimensions. In reality, we should note that agentic functions, in principle, tend to converge faster than others, and since in reality we mostly work with piecewise linear functions, the forecast is very disheartening.