reinforcement learning

The main thing to keep in mind with reinforcement learning, the main thing that distinguishes it from other kinds of model training is that that model is only rewarded for producing the right final answer. How to get to that final answer is left to the model to figure out. The model may explore or search around initially to find behaviors that consistently lead to rewards. But again, the reward is for the final answer, not for any particular kind of behavior that might lead to that. This means that novel behaviors can be discovered by the model itself rather than needing to be pre-programmed by the researcher.

In contrast, standard supervised fine-tuning tells the model directly how to behave, i.e., the behaviors are directly “rewarded” in some sense. This is partly why fine-tuning is very powerful but also fragile – it’s a very dense and pure signal, which strongly affects how the model behaves. If you aren’t careful, you’ll overtune the model, making it overfit to a particular behavior pattern.


References

eonline