Learned representations are more important than what you do with them
It strikes me that most of the latest advancements in Machine Learning and specifically NLP seem to depend much more on pre-training than on fine-tuning. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding is just one example of this, going so far as to use the word “pre-training” in the title of the paper, highlighting its relative importance. The “PT” in GPT stand for pre-training, another famous example.
- Pre-training BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding takes 4 days on 16 TPUs whereas fine-tuning takes 1 hour on a single TPU. This is a 4x16x24/1 = 1,536x difference in training time.
This suggests that coming up with a generic, globally useful representation of an input is a much more important part of the learning process than learning exactly how to leverage that representation for judgement in a particular context. Pre-training takes more time and a much larger dataset than fine-tuning.
Peak corroborates this and notes:
Transclude of Peak#94e758
In this analogy, the forests are lower dimensional representations of the trees, which are the unprocessed raw data.
Research into the idea of “chunking” has found that memorization of chunks of information within a specific domain can act as IQ enhancement in that domain. Someone with otherwise low horsepower can achieve the performance of a high horsepower person in that field with the right knowledge chunks.
Transclude of Augmenting-Long-term-Memory#80aa23
Transclude of Augmenting-Long-term-Memory#d1ba99
This suggests that, once the right representation of an input is achieved, most of the hard work is complete. Fine-tuning is cheap. With the right representation, judgement is cheap.