Jul 03, 2021

It should be noted that this kind of behavior is entirely expected from a GPT-style self-supervised sequence model. Rote memorization for this kind of model is indicative of correct training, not overfitting. The underlying training objective of these models ideally results in a representation of the training data which allows complete samples to be extracted by using partial samples as keys. Actual overfitting in this kind of model requires absurd parameter counts. See https://tilde.town/~fessus/reward_is_unnecessary.pdf