Convergence convexity relations

Continuing the topic of Joar questions, I would like to draw attention to the limitations of neural networks in performing tasks according to complexity.

https://math.bu.edu/people/mkon/hhh12.pdf
The complexity of constructing neural networks for approximating functions with a given accuracy can be divided into neuron complexity and information complexity.

For linear combinations and a single layer:

Neuron complexity grows polynomially.

$N C^{w o r} (W^{r, p}, D_{s}; ε) = Θ (ε^{- 1 / (r + 1)})$ .
Information complexity grows similarly, but more information (samples) is required than neurons to achieve a given accuracy.

$I C^{w o r} (W^{r, p}; ε) = Θ (ε^{- 1 / (r + 1 / q)})$ , where $1 / p + 1 / q = 1$ .

Moreover, the presence of noise immediately adds $Ω (ε^{- 2})$ to the information complexity.

These estimates do not account for time, but they seem fine regardless. If we believe in Agency through active inference, then we get an estimate for the worst case.

Nevertheless, in our research, this is roughly useless, since for multilayer networks with activations for smooth functions we always have $N C = Θ (l o g (1 / ε))$ , see https://arxiv.org/pdf/1901.02220. The article does not provide time estimates and shows the excellent applicability of deep networks for a wide class of functions, including Weierstrass function. We conclude on the importance of "smoothness substitutes" in this matter, such as Sobolev space.

https://arxiv.org/pdf/1610.01145

This work provides an estimate for deep networks for Sobolev space.

For any function from the Sobolev space $W^{n, \infty} (^{d})$ with an error $ε$ , there exists a ReLU network architecture with depth $O (\ln (1 / ε))$ and a number of weights/computational units $O (ε^{- d / n} \ln (1 / ε))$ .

The work establishes a lower bound of $\sim ε^{- d / n}$ weights and connections for the unit ball in a Sobolev space.

In a scenario where the network architecture can adapt to the function being approximated (rather than being fixed), even more efficient upper bounds can be obtained. For example, for 1D Lipschitz functions, a depth of 6 and complexity $O (ε^{- 1})$ (instead of $O (ε^{- 1} \ln (1 / ε))$ ) are shown.

For a ReLU network architecture capable of approximating any function from $F_{d, n}$ with error $ε$ , at least $c ε^{- d / (2 n)}$ weights are required. With a constraint on network depth ( $\propto \ln (1 / ε)$ ), the lower bound becomes $c ε^{- d / n} (\ln (1 / ε))^{- 2 p - 1}$ .

From this, we can make estimates by varying the number of weights.

The article https://arxiv.org/pdf/1709.05289 proposes precise estimates for piecewise linear functions.

https://www.research-collection.ethz.ch/server/api/core/bitstreams/2b64dea5-5d0f-4d86-b076-6d8f825d7cad/content

In any ambiguous situation like Weierstrass function, one can use catalogue networks to obtain combinations of functions and logarithmic complexity, respectively. Of course, this does not allow working with wide classes, but it works with individual ones, and indirectly confirms this work https://arxiv.org/pdf/1610.01145.

https://www.mins.ee.ethz.ch/pubs/files/deep-approx-18.pdf

According to this article, neural networks also perfectly approximate affine systems with polynomial speed.

https://arxiv.org/pdf/2502.02679v1

VC dimension (Vapnik-Chervonenkis dimension) is a measure of the complexity or expressive power of a class of functions that can be implemented by a network. Although the article is useless, it can be looked at.

The VC dimension of a neural network is in the range from $W$ to $W \log V$ .
Here $V$ is the number of nodes, $W$ is the number of weights.

https://arxiv.org/pdf/1905.01208 - introduces the concept of approximation spaces.

Deep neural networks have the ability to efficiently approximate piecewise polynomial functions. Functions with low smoothness can often be represented as such piecewise polynomials. Note: not always, which is important to remember.

The complexity of a neural network is measured by the number of connections (connectivity) or the number of neurons, where connectivity is related to the number of operations to apply the model. It has nothing to do with Connectivity, just a similar term.

If a function $f$ has a certain smoothness $s$ (belongs to the Besov space $B_{τ, q}^{s} (Ω)$ ), then it can be approximated by a deep neural network (ReLU activations $ρ_{r}$ ) with an error $O (n^{- a})$ , where the exponent $a$ is related to $s$ :

The article focuses on polynomial rates of error decay $O (n^{- a})$ , in contrast to faster exponential rates $Θ (\log (1 / ε))$ , but it provides an estimate through smoothness.

A network can approximate a function with a rate exponent $a$ , which can be any value less than $\frac{r_{0} + min {1, p^{- 1}}}{d}$ , where $r_{0} = r$ when $r > 2$ and $r_{0} = 0$ .

That is, for multivariate functions, the approximation rate $a$ depends on the function's smoothness $s$ , the activation type $r$ , and is divided by the dimension $d$ . This reflects the "curse of dimensionality" but still indicates the possibility of approximation.

The theorems below establish upper bounds for the exponent $a$ that can be achieved by a neural network. If a network approximates a function with an error $O (n^{- a})$ , then this function must have some minimal smoothness $s$ :

For weight complexity ( $W_{q}^{a} (X, ρ_{r}, L)$ ):
$a \leq ⌈ L / 2 ⌉ \cdot min {s, 2}$
For neuron complexity ( $N_{q}^{a} (X, ρ_{r}, L)$ ):
$a \leq (L - 1) \cdot min {s, 2}$

These inverse estimates show that the approximation rate exponent $a$ is directly proportional to the network depth $L$ (or $⌈ L / 2 ⌉$ ) and the smoothness $s$ of the approximated function (but not more than 2). Even if the function has very low smoothness $s$ (e.g., $s < 1$ , meaning non-differentiable functions), a deep network can still achieve a relatively high exponent $a$ (approximation rate) due to the multiplier $L$ . That is, the higher $a$ , the faster the error decreases with an increasing number of operations.

Summing up: we can establish upper bounds for the error decay rate based on the function's smoothness and the architectural parameters of the network (depth, activation type, number of connections), either polynomial or less.

In summary: This review provides the following insights:

According to https://arxiv.org/pdf/1905.01208, we can estimate the convergence of a Besov function with respect to model parameters and function smoothness.
According to https://arxiv.com/pdf/2502.02679v1, there are limitations on exact binary classification with respect to model parameters.
According to https://arxiv.org/pdf/1610.01145, we can estimate convergence for Sobolev-smooth functions with respect to model parameters and error.
According to https://arxiv.org/pdf/1901.02220, we have estimates for smooth functions.