SILO Language Models: Isolating Legal Risk in a Nonparametric Datastore

Speaker: Suchin Gururangan

Location: 60 Fifth Avenue, Room 204

Date: Monday, August 28, 2023

Abstract: Language models (LMs) are under widespread legal scrutiny because they are trained on copyrighted content, without the consent of data owners. However, model performance significantly degrades if trained only on low-risk text (e.g., out-of-copyright books or government documents), due to its limited size and domain coverage. In this talk, I will describe our new language model, SILO, which is designed to mitigate copyright infringement risks without sacrificing performance. SILO is built by (1) training a parametric LM on the Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text and (2) augmenting it with a more general and easily modifiable nonparametric datastore (e.g., containing copyrighted books or news) that is only queried during inference. Our technique allows model designers to use copyrighted data without training on it, allows for sentence-level attribution of model predictions to data owners, and enables data owners to opt-out from the model entirely if they choose. I will cover the broad technical challenges that face building low-risk LMs that match or outperform LMs trained unrestrictedly on high-risk data, as well as experiments that demonstrate SILO is an effective way to manage the risk vs. performance tradeoff. I will also discuss exciting opportunities for future research into techniques to train and deploy high quality language models while supporting the rights of data owners.

Bio: Suchin Gururangan is a 4th year PhD candidate at the University of Washington, advised by Noah A. Smith and Luke Zettlemoyer. He was previously a visiting researcher at Meta AI, a pre-doctoral resident at the Allen Institute for AI, and spent several years in industry as a data scientist. His research interests span many areas of NLP; currently he works on modular, sparse language models that are efficient to customize and scale. His work has received awards at ACL 2020 and 2021, and he is supported by the Bloomberg Data Science PhD Fellowship.