Dec 12, 2017

Really don't think that's the best paper to say "sheds quite a bit of light on this". That paper has been somewhat controversial since it came out.

I think https://arxiv.org/abs/1609.04836 is seminal in showing unsharp minima = generalization, the parent's paper is good for showing that gradient descent over non-convex surfaces works fine, https://arxiv.org/abs/1611.03530 is landmark for kicking off this whole generalization business (mainly shows that traditional models of generalization, namely VC dimension and ideas of "capacity" don't make sense for neural nets).

Aug 26, 2017

In fact recent research indicates that you can randomly relabel the training examples and the network still achieves zero training error (https://arxiv.org/abs/1611.03530). So it is not "understanding" anything intrinsic or fundamental about the letter "A". Rather, it's just storing training examples somewhere inside of its millions of parameters, which sounds a lot less impressive.

May 07, 2017

All machine learning models are capable of overfitting. Deep learning does not have a special relationship to overfitting, and in fact, there is some evidence that the structural priors in complex deep networks actually inhibit overfitting and improve generalization [1].

Citing certain examples of misuse of deep learning by folks who "tweaked it till it worked" doesn't say anything about deep learning at large. The same team would do the same with a much less powerful model.

Of course, being able to explain is a very valuable trait. But there's just no evidence that deep learning methods are any less amenable to interpretation than SVMs or decision trees. The latter models' "interpretation" is mostly stuff like "feature 543 and 632 were on" , while deep learning methods can not only do that, but can also synthesize examples of characteristics that the model looked for [2].

[1]: https://arxiv.org/abs/1611.03530 [2]: https://arxiv.org/abs/1605.09304

Jan 29, 2017

Based on our current theoretical understanding the we should expect overfitting. In practice however, a very large number of parameters does not necessarily lead to overfitting. Clearly there is a gap between out theoretical understanding and practice. Check out a recent paper that explores exactly this questions:

https://arxiv.org/abs/1611.03530