Rethinking Data Use in Large Language Models

Speaker: Sewon Min

Location: 60 Fifth Avenue, Room 150
Videoconference link: https://nyu.zoom.us/j/94537456810

Date: Monday, February 26, 2024

Large language models (LMs) such as ChatGPT have revolutionized naturallanguage processing and artificial intelligence more broadly. In this talk, I will discuss my research on understanding and advancing these models, centered around how they use the very large text corpora they are trained on. First, I will describe our efforts to understand how these models learn to perform new tasks after training, demonstrating that their so-called in context learning capabilities are almost entirely determined by what they learn from the training data. Next, I will introduce a new class of LMs—nonparametric LMs—that repurpose this training data as a data store from which they retrieve information for improved accuracy and updatability. I will describe my work on establishing the foundations of such models, including one of the first broadly used neural retrieval models and an approach that simplifies a traditional, two-stage pipeline into one. I will also discuss how nonparametric models open up new avenues for responsible data use, e.g., by segregating permissive and copyrighted text and using them differently. Finally, I will envision the next generation of LMs we should build, focusing on efficient scaling, improved factuality, and decentralization.