Local Mixtures of Experts: Essentially Free Test-Time Training via Model Merging
Ryo Bertolissi, Jonas Hübotter, Ido Hakimi, Andreas Krause



Key contributions
- Test-time model merging (training and merging many expert models for a single task / domain) beats traditional model merging (training and merging a few models for multiple tasks)
- Test-time model merging approaches performance of test-time training with almost no test-time compute overhead
Inductive vs transductive learning
- During traditional training, learn a model by inductively extracting general rules from data that can then be applied to downstream examples at test time.
- Transductive learning directly uses test examples (with no labels) to make predictions. You predict specific examples rather than attempting to learn a more general function, which is in some sense a simpler problem. No generalization beyond the specific test examples is required or expected.
Test-time training (TTT)
- Fine-tuning a model for every individual task (prompt)
- Significantly improves model performance at high test-time computational cost
Test-time model merging (TTMM)
- At train-time, cluster the training data into local neighborhoods and train a small expert LoRA adapter for each cluster
- At test-time, dynamically select a subset of LoRA adapters and merge their parameters to form a s single task specific model
- Approaches the performance of TTT without significant compute or memory cost
Model merging
- In a multitask setting, merge multiple expert models, typically small number of models / tasks. Happens once at the end of training, then model is fixed thereafter
- TTMM differs in that models are local to a specific task in question, merging many related local models for a single task at test-time