Understanding Comes After Prediction

Speaker: Yutong Bai

Location: TBA
Videoconference link: https://nyu.zoom.us/j/97218688837?pwd=EbE7TuoSPy2MmKerjKMkuavx39Gd2N.1

Date: Thursday, November 20, 2025

Location: 31 Washington Pl (Silver Ctr) 

In this talk, Yutong will discuss prediction problems with a focus on recent work on PEVA (Whole Body Conditioned Egocentric Video Prediction). The work trains models to forecast first person video using past observations and future actions represented through 3D body pose trajectories. By conditioning on kinematic pose structured by the body’s joint hierarchy, the model learns how human physical actions affect the environment from an egocentric viewpoint.
The approach uses an auto regressive conditional diffusion transformer trained on Nymeria, a large scale dataset containing real world egocentric video paired with full body motion capture. A hierarchical evaluation protocol with increasingly challenging tasks allows for thorough analysis of the model’s embodied prediction and control capabilities.
This research offers an early step toward handling the challenges of modeling complex real world environments and embodied agent behavior through video prediction from the human perspective. Yutong will discuss the technical method, open problems, and future directions for this line of work.

Bio: Yutong Bai is currently a Postdoc Researcher at UC Berkeley (Berkeley AI Research), advised by Prof. Alexei (Alyosha) Efros, Prof. Jitendra Malik, and Prof. Trevor Darrell. Prior to that, she obtained her PhD in Computer Science at Johns Hopkins University advised by Prof. Alan Yuille. She has interned at Meta AI (FAIR Labs) and Google Brain, and was selected as a 2023 Apple Scholar and an MIT EECS Rising Star. Her work was selected as a finalist in CVPR 2022 Best Paper Award. Homepage: yutongbai.com
Her research aims to build intelligent systems from first principles: systems that do not merely fit patterns or follow instructions, but that gradually develop structure, abstraction, and behavior through learning itself. She is interested in how intelligence emerges not from handcrafted pipelines or task-specific heuristics, but from exposure to behaviorally rich, understructured environments where models must learn what to attend to, how to reason, and how to improve.