Aug 31, 2017

Discussion about DeepL: https://news.ycombinator.com/item?id=15122764

Attention is not new. Everyone uses it (for translation and many related tasks). It's very much the standard right now.

Avoiding recurrent connections inside the encoder or decoder is also not completely new. That came up when people tried to only use convolutions.

Googles Transformer was made public in June 2017, in the paper Attention is all you need, https://arxiv.org/abs/1706.03762, including TensorFlow code, https://github.com/tensorflow/tensor2tensor . Note that the new thing here is that they neither use recurrence nor convolution but rely entirely on self-attention instead, with simple fully-connected layers, in both the encoder and the decoder.

DeepL directly compares their model to Transformer, in terms of performance (BLEU score), here: https://www.deepl.com/press.html