https://arxiv.org/abs/1710.05468 was an interesting paper that came out recently.
It showed that large CNN models which have far greater capacity than the data they are shown (and could have memorized it) still tend to learn very good generalizable minima.
See proposition 1:
(i) For any model class F whose model complexity is large enough to memorize any dataset and which includes f∗ possibly at an arbitrarily sharp minimum, there exists (A, Sm) such that the generalization gap is at most epsilon, and
(ii) For any dataset Sm, there exist arbitrarily unstable and arbitrarily non-robust algorithms A such that the generalization gap of f_A(Sm) is at most epsilon.