Are Generative Image Models Physically Grounded?

Speaker: Anand Bhattad

Location: TBA
Videoconference link: https://nyu.zoom.us/j/99300340324

Date: Thursday, December 5, 2024

Generative Image Models like Dalle, Stable Diffusion, and Midjourney have taken the world by storm with their ability to generate uncannily realistic images: objects are in sensible places, lighting seems realistic, and textures appear accurate. But how do they achieve this understanding of our visual world? Probing their internal representations reveals that these models encode fundamental aspects of physical reality. Within these models, we discovered classical computer vision concepts like intrinsic images — decomposing scenes into color, shape, and lighting — learned without explicit training. These discoveries allow us to manipulate real photographs in physically plausible ways. However, we also find surprising gaps in their understanding, such as their limitation of replicating principles of projective geometry, which provides reliable signatures for detecting generated images. This talk explores unexpected capabilities encoded within generative image models. I will discuss how these insights drive applications that require strong scene understanding, pushing us closer to building generative models grounded in the physical world.