NLP and Text-as-Data Speaker Series: Troubles with Training Data for Large Language Models

Speaker: Daphne Ippolito (CMU)

Location: 60 Fifth Avenue, Room 150

Date: Thursday, February 20, 2025

 Language models are trained on billions of words of text. Where does that text come from and how is it processed and filtered on its way to becoming training data? In this talk, we will examine how seemingly small decisions made when preparing pre-training data can have a significant impact on observed model performance. We will also discuss the problems with relying on the Internet as a primary source for training data. Gradual shifts in how content is posted on the web will limit its future usefulness as training data, and since the Internet is public, anyone can edit it, including adversaries aiming to introduce undesirable behaviours by inserting poisoned text.