CILVR SEMINAR: Theory and Practice in Language Model Fine-Tuning
Fine-tuning ever larger and more capable language models (LMs) has proven to be an effective way to solve a variety of language related tasks. Yet little is understood about what fine-tuning does, and most traditional optimization analyses cannot account for a pre-trained initialization. I will start by formalizing the common intuition that fine-tuning makes a small change to the model. Inspired by the neural tangent kernel (NTK), we propose an empirically validated and theoretically sound hypothesis that can approach answering questions like "Why doesn't a giant LM overfit when fine-tuning it on a few dozen examples?" and "Why does LoRA work?" Our simple mental model motivates an efficient, transferable, and optimizer-aware data selection algorithm, dubbed LESS, to elicit specific capabilities during instruction tuning. Using LESS to select 5% of the data outperforms on the full dataset, and we can also use a small model to select data for other models. Finally, I will describe how insights into the dynamics of fine-tuning inspired us to design a memory-efficient zeroth-order algorithm (MeZO) that can tune large LMs. MeZO frequently matches performance while using up to 12x less memory and half as many GPU-hours as standard fine-tuning. These works were done in collaboration with researchers at Princeton University and University of Washington.