Jun 22, 2021

This was written to promote a book [0]

"As a first step in that direction, we discuss an upcoming book on the principles of deep learning theory that attempts to realize this approach.

Comments: written for a special issue of Machine Learning: Science and Technology as an invited perspective piece"

So take it for what it's worth.

[0] https://deeplearningtheory.com/PDLT.pdf

Jun 19, 2021

I just finished looking through the manuscript [https://deeplearningtheory.com/PDLT.pdf]. Mathematics is heavy for me especially for a quick read, albeit a great thing I see is that the authors have reduced dependencies on external literature by inlining the various derivations and proofs instead of just providing references.

## The epilogue section (page 387 of the book, 395 in the PDF) is giving a good overview, presented below per my own understanding:

Networks with a very large number of parameters, much larger than the size of the training data, should as such overfit. The number of parameters is conventionally taken as a measure of model complexity. Having a very large network can allow it to perform well on the training data by just memorizing it and perform poorly on unseen data. Somehow these very large networks are empirically performing well still in achieving generalization, i.e., these are recognizing good patterns from the training data.

The authors show that model complexity (or ability to generalize well I would say) for such large networks is dependent on its depth-to-width ratio:

* When the network is much wider than deeper (the ratio approaches zero), the neurons in the network don't have as many "data-dependent couplings". My understanding from this is that while the large width gives the network power in terms of number of parameters, it has lessor opportunity for a correspondingly large number of feature transformations. While the network can still fit the training data well [2, 3], it may not generalize well. In the authors' words, when the depth-to-width ratio is close to zero (page 394), "such networks are not really deep" (even if depth is much more than two) "and they do not learn representations."

* On the opposite end, when the network is very deep (ratio going closer to one or larger), {I'm rephrasing the authors from my limited understanding} the network needs non-Gaussian description of the model parameter space, which makes it "not tractable" and not practically useful for machine learning.

While it makes intuitive sense that the network's capability to find good patterns and representations depends on the depth-to-width ratio, the authors have supplied the mathematical underpinnings behind this as briefly summarized above. My previous intuition was that having a larger number of layers allows for more feature transformations, giving the network a higher ease of learning. The new understanding via the authors' work is that if for the same number of layers, the width is increased, the network now has a harder job to learn feature transformations commensurate with now larger number of neurons.

## My own commentary and understanding (some from before looking at the manuscript)

If the size of the network is very small, the network won't be able to fit the training data well. A network with a larger size would generally have more 'representation' power, allowing it to know more complex patterns.

The ability to fit the training data is of course however different from ability to generalize to unseen data. Merely adding more representation power can allow it to overfit. As the network size starts exceeding the size of the training data, it could have a tendency to just memorize the training data without generalizing, unless something is done to prevent that.

So as the size of the network is increased with the intentions of giving it more representation power, we need something more such that the network first learns the most common patterns (highest compression, but lossy) and then keeps on learning progressively more intricate patterns (now lessor compression, more accurate).

My intuition so far was that achieving this was an aspect of the training algorithm and cell design innovations and also of the depth-to-width ratio. The authors however show that this depends on the depth-to-width ratio and in the way specified. It is still counter-intuitive to me that algorithmic innovation may not play a role in this, or perhaps I am misunderstanding the work.

So now the 'representation power' of the network and its ability to fit the training data itself would generally increase with the size of the network. However, its ability to learn good representations and generalize depends on the depth-to-width ratio. Loosely speaking then, to increase model accuracy on training data itself, model size may need to be increased while keeping the aspect ratio constant at least as far as the training data size is larger, whereas to improve generalization and finding good representations for a given model size, the aspect ratio should be tuned.

Intuitively, I think that under a pathological case where the network is so large that merely its width (as opposed to width times depth) is exceeding the size of the training data, then even if the depth-to-width ratio is chosen according to the guidance from the authors (page 394 in the book) the model would still fail to learn well.

Finally, I wonder what the implications of the work is for networks with temporal or spatial weight-sharing like convolutional networks, recurrent and recursive networks, attention, transformers, etc. For example, for recurrent neural networks, the effective depth of the network depends on how long the input data sequence was. I.e., the depth-to-width ratio could be varying simply because input length is varying. The learning from the authors' work I think should directly apply if each time step is treated as a training sample on its own, i.e., backpropagation through time is not considered. However, I wonder if the authors' work still presents some challenges on how long could the input sequences be as the non-Gaussian aspect may start coming into the picture.

As time permits, I would read the manuscript in more detail. I'm hopeful however that other people may achieve that faster and help me understand better. :-)

## References:

[1] https://en.wikipedia.org/wiki/Universal_approximation_theore...

[2] http://neuralnetworksanddeeplearning.com/chap4.html

Jun 19, 2021

They meant page 2 of the manuscript linked in the article: