Events
CDS seminar: Decoding the Genome with Large Language Models
Speaker: Nadav Brandes
Location: 60 Fifth Avenue, Room 150
Date: Friday, April 11, 2025
Large language models are trained to predict the next token in a sequence. This seemingly simple task gives rise to incredibly powerful models with vast knowledge about the world. Like text, our genomes can also be modeled as sequences of tokens. A natural question to ask is: what would large language models trained on genomic sequences learn about our genome? Would they be able to absorb knowledge about the "meaning" of these sequences? The answer is a resounding yes. Models trained on protein sequences absorb critical knowledge about the structure and function of proteins, despite being trained on raw sequences. These models, it turns out, also learn which sequence variations are tolerated by evolution and are therefore likely benign, and which variants can cause disease. In fact, our work has shown that they are one of the most accurate tools we have for identifying pathogenic mutations, which has contributed to better diagnosis of genetic disorders. I'll present these and other recent developments at the intersection of AI and genomics, and discuss some of the exciting opportunities they open to improve our understanding of the genome and diagnosis and treatment of disease.