Active inference convergence estimation
Before answering this question, I would be interested to understand the convergence of active inference. Given my previous research Convergence convexity relations, we can already make an estimation of Agency.
Curiosity
The KL divergence is smooth everywhere, except at points where the predicted distribution
Empowerment
It is defined through mutual information:
In expanded form:
The Shannon entropy, like the KL divergence, is based on the expression
This implies:
- The graph may exhibit kinks, meaning it may not be differentiable.
- The maximum of smooth functions is a continuous function.
- The maximum of convex functions is also convex.
However, this does not answer our question; I will proceed to read more literature to understand if the smoothness of the maximum can depend on the smoothness of the functions.
Useless:
https://scispace.com/pdf/l-p-smoothness-on-weighted-besov-triebel-lizorkin-spaces-in-2323s4lmnc.pdf
https://people.math.sc.edu/sharpley/Papers_PDF/DeVoreSharpley1984.pdf
In this extensive work, the smoothness of the maximum function is considered as a consequence of the smoothness of the original function or as a criterion for belonging to certain smoothness classes (my intuition is proving correct here). It is established that the membership of
This sounds extremely unexpected. This is bad. It turns out that mutual information is infinitely differentiable in the neighborhoods where
If it is just smooth, then
Since we obtained a convergence estimate, it can be used to estimate the volume.
https://arxiv.org/pdf/1610.01145
According to the above article, convergence over parameters and depth will be
I think we can use exactly this estimate in our work, as it will provide at least an approximate estimate for current networks. For modern models with
This means that
Well, at least this is something good. According to my previous estimate Joar questions,
That is, agentic functions occupy a very small fraction in the space of all functions in principle (though I don't know what I was hoping for), and this fraction will become even smaller as the capabilities of the models grow. Nevertheless, I would attribute this explanation more to the properties of the function space, as it is difficult to fill any volume when discussing high dimensions. In reality, we should note that agentic functions, in principle, tend to converge faster than others, and since in reality we mostly work with piecewise linear functions, the forecast is very disheartening.