Events
NLP and Text-as-Data Speaker Series: New directions in synthetic data
Speaker: Tatsu Hashimoto (Stanford)
Location:
60 Fifth Avenue, Room 7th Floor Open Space
Videoconference link:
https://nyu.zoom.us/j/94013156633
Date: Thursday, April 30, 2026
Abstract: Synthetic data has been a effective, if boring set of techniques: prompt some language model to restructure your corpus to match some downstream task, with occasionally some distillation. In this talk, we will take a more expansive view of synthetic data as a general algorithmic tool for generative modeling, arguing that the design space and possibilities of synthetic data are much bigger than it might seem. Through a few recent works, we will show that synthetic data has major benefits beyond transforming the data - improving in-domain perplexities, and enabling unique algorithmic primitives, such as neighborhood smoothing and concatenated ‘mega’ documents. With this broader view, we will point towards a nascent but interesting possibility of treating data itself as an algorithmic object to be engineered and optimized end-to-end.
Bio: Tatsu is an assistant professor at the computer science department in Stanford University. His research uses tools from statistics to make machine learning systems more robust and trustworthy — especially in complex systems such as large language models. He uses robustness and worst-case performance as lenses to understand and make progress on several fundamental challenges in machine learning and natural language processing. Previously, Tatsu was a post-doc at Stanford working for John Duchi and Percy Liang. Before that, he was a graduate student at MIT co-advised by Tommi Jaakkola and David Gifford.