How sparsity affects convergence

The convergence rate in sparse problems is usually expressed in terms of the number of iterations $T$ and the dimensionality $d$ .

For smooth convex functions, standard gradient descent has a rate of $O (1 / T)$ . However, if the optimal solution is $s$ -sparse (has only $s$ non-zero elements), specialized algorithms (e.g., proximal methods with $L_{1}$ -regularization) can achieve a rate where the dependence on the dimension $d$ becomes logarithmic: $O (\frac{s \log d}{T})$ . This is significantly faster when $s ≪ d$ .

To recall, from Active inference convergence estimation, we have already shown that $E$ is convex and smooth. This means we can say that not only do we have logarithmic convergence, but it can be even faster with $L_{1}$ regularization, since functions in RL are typically sparse.

In deep linear networks with overparameterization, gradient descent has an implicit bias towards solutions with low rank, which, strictly speaking, is not fully understood, but it might indicate that even without $L_{1}$ regularization, we still achieve accelerated convergence under overparameterization.

Results:

We know how to increase speed of agency convergence.
There is a possibility that overparametrisation leads to it too.