Local Mixtures of Experts: Essentially Free Test-Time Training via Model Merging

Ryo Bertolissi, Jonas Hübotter, Ido Hakimi, Andreas Krause

Key contributions

  • Test-time model merging (training and merging many expert models for a single task / domain) beats traditional model merging (training and merging a few models for multiple tasks)
  • Test-time model merging approaches performance of test-time training with almost no test-time compute overhead

Inductive vs transductive learning

  • During traditional training, learn a model by inductively extracting general rules from data that can then be applied to downstream examples at test time.
  • Transductive learning directly uses test examples (with no labels) to make predictions. You predict specific examples rather than attempting to learn a more general function, which is in some sense a simpler problem. No generalization beyond the specific test examples is required or expected.

Test-time training (TTT)

  • Fine-tuning a model for every individual task (prompt)
  • Significantly improves model performance at high test-time computational cost

Test-time model merging (TTMM)

  • At train-time, cluster the training data into local neighborhoods and train a small expert LoRA adapter for each cluster
  • At test-time, dynamically select a subset of LoRA adapters and merge their parameters to form a s single task specific model
  • Approaches the performance of TTT without significant compute or memory cost

Model merging

  • In a multitask setting, merge multiple expert models, typically small number of models / tasks. Happens once at the end of training, then model is fixed thereafter
  • TTMM differs in that models are local to a specific task in question, merging many related local models for a single task at test-time

References

paperreadonline