Global AI Frontier Lab Seminar Series: Benchmarking Visual State Tracking in Multimodal Video Understanding

Speaker: Sihyun Yu

Location: 1 MetroTech Center

Date: Monday, May 11, 2026

We are pleased to announce the next session of the Global AI Frontier Lab: Seminar Series on May 11th, 2026. Sihyun Yu will be presenting “Benchmarking Visual State Tracking in Multimodal Video Understanding". Dinner & networking will begin at 6:00 PM and the seminar will start at 7:00 PM EST. The seminar will be held at the Global AI Frontier Lab at 1 Metrotech Center, Brooklyn, NY 11201. This event will be in-person & online. In-person attendance is strongly encouraged for Lab researchers in NYC. All attendees must RSVP to participate. For online attendees, a Zoom link will be sent out prior to the event. Please reach out to global-ai-frontier-lab@nyu.edu with any questions. We hope to see you there!

Title: Benchmarking Visual State Tracking in Multimodal Video Understanding

Abstract: Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce VSTAT, a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 814 clips drawn from both synthetic and real-world videos, paired with 1,480 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we conduct analyses comparing the thinking traces of MLLMs against the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that while tracking within the textual reasoning is largely correct, visual perception emerges as the main bottleneck, particularly when questions require tracking of continuous dynamics. Finally, our preliminary evaluation suggests that current agentic frameworks on existing MLLMs do not trivially resolve these failures.

Bio: Sihyun Yu is a postdoctoral researcher at KAIST. He received his Ph.D. from KAIST, advised by Jinwoo Shin. During his doctoral studies, he was a visiting scholar at NYU Courant, working closely with Saining Xie. He also did several internships at Google Research and NVIDIA Research. His research focuses on visual artificial intelligence: building models that understand complex visual worlds and can reason, plan, and imagine future states.