Transfer Learning in the Era of Large Language Models

Speaker: Jason Phang

Location: 60 Fifth Avenue, Room 650
Videoconference link: https://nyu.zoom.us/j/97863929344

Date: Monday, January 29, 2024

Natural language processing has made great strides in recent years, largely driven by developments in large-scale neural networks trained on large quantities of textual data. This thesis is a collection of work studying transfer learning—the transfer of skills or knowledge from one or more task datasets to a related task—within large language models (LLMs). The first half of this thesis focuses on a specific method of transfer learning, intermediate-task training, and explores why and where it is applicable. First, we introduce intermediate-task training (or STILTs), a simple two-stage fine-tuning method that involves first fine-tuning on a data-rich intermediate task, and then the downstream task. We show that intermediate-task training leads to better performance on the GLUE benchmark, as well as improved training stability. Next, we investigate where and why intermediate-task training is effective, and specifically focus on identifying which intermediate tasks lead to improved performance. Through the use of a third set of tasks, probing tasks, that capture a specific linguistic property or skill, we examine correlations between probing and downstream task performance, and find that common-sense language tasks and tasks closest to masked language modeling perform best. Lastly, we extend intermediate-task to the zero-shot cross-lingual transfer setup, demonstrating its effectiveness in multilingual LLMs. The second half of this thesis examines a new approach to transfer learning, wherein task-specific model parameters are generated through a trained hypernetwork. First, we introduce HyperTuning, a method for generating task-specific model-adaptation parameters using specially trained LLMs. We introduce HyperT5, a T5-based hypernetwork that is trained to take in few-shot examples and generate either LoRA parameters or soft prefixes, allowing us to adapt frozen downstream models to a given task without backpropagation. Next, we extend HyperTuning with Gisting, an efficient method for training decoder Transformers to generate soft prefixes. We introduce HyperLlama, a Llama-2-based hypernetwork that similarly generates task-specific soft prefixes based on few-shot examples. We show that HyperLlama can compress task-specific information into soft prefixes, although the performance still falls short of multi-task instruction-tuned models trained for few-shot learning.