Jul 15, 2016

This is incorrect. With custom ASIC that implements ultra-wide low precision SIMD arithmetic (f16, i16, i8) it is possible to squeeze an order of magnitude or more speedup over conventional GPGPUs using same area and power. GPGPUs have to have big f32 (and even f64!) FPUs, and then some additional 3d-specific hardware overhead, that's why they are still suboptimal for deep machine learning. Google's TPU and Nervana's similar ASIC illustrate this point.

Also you don't need the management cores to be fast, it suffices to have lots of GHz+ cores controlling ultra-wide SIMD (as Intel's Xeon Phi has shown).

Arguably, one could get another order of magnitude speedup by implementing XNOR-net (http://arxiv.org/abs/1603.05279) hardware accelerator, but it hasn't been done yet.

Apr 18, 2016

Since it seems this paper's primary focus is on performance, it'd be interesting to see how this technique stacks up against one of those fancy new binary networks (e.g. http://arxiv.org/abs/1603.05279)