Similar to this post is this:
"Probing Neural Network Comprehension of Natural Language Arguments
Timothy Niven, Hung-Yu Kao (Submitted on 17 Jul 2019)
We are surprised to find that BERT's peak performance of 77% on the Argument Reasoning Comprehension Task reaches just three points below the average untrained human baseline. However, we show that this result is entirely accounted for by exploitation of spurious statistical cues in the dataset. We analyze the nature of these cues and demonstrate that a range of models all exploit them. This analysis informs the construction of an adversarial dataset on which all models achieve random accuracy. Our adversarial dataset provides a more robust assessment of argument comprehension and should be adopted as the standard in future work."
In the field of computer vision, there are similar suspicions that state of the art computer vision models are overfit to ImageNet datasets and the like. The issue being that even average research labs do not have hundreds of thousands of dollars to reproduce extremely expensive and highly tuned models. Reproducing and advancing the field of deep learning is quickly becoming inaccessible to almost everyone except for a few of the highest funded industrial labs (Google, FB, OpenAI, Microsoft Research among a few others). This is not all negative, it's just that not everything can be taken as gospel.