CS Colloquium

The Missing Science of AI Evaluation

Time and Location:

Feb. 25, 2026 at 11AM; 4 Washington Square North, Room 1201

Speaker:

Sayash Kapoor, Princeton University

Abstract:

Al evaluations inform critical decisions, from the valuations of trillion-dollar companies to policies on regulating Al. Yet, evaluation methods have failed to keep pace with deployment, creating an evaluation crisis where performance in the lab fails to predict real-world utility. In this talk, I will discuss the evaluation crisis in a high-stakes domain: Al-based science. Across dozens of fields, from medicine to political science, I find that flawed evaluation practices have led to overoptimistic claims about Al's accuracy, affecting hundreds of published papers. To address these evaluation failures, I present a consensus-based checklist that identifies common pitfalls and consolidates best practices for researchers adopting Al, and a benchmark to foster the development of Al agents that can verify scientific reproducibility. Al evaluation failures affect several other applications. Beyond science, I examine how Al agent benchmarks miss many failure modes, and present systems to identify these errors. I examine inference scaling, a recent technique to boost Al capabilities, and show that claims of improvement fail to hold under realistic conditions. Finally, I discuss how better Al evaluation can inform policymaking, drawing on my work on open foundation models and my engagement with state and federal agencies. Why does the evaluation crisis persist? The Al community has poured enormous resources into building evaluations for models, but not into investigating how models impact the world. To address the crisis, we need to build a systematic science of Al evaluation to bridge the gap between benchmark performance and real-world impact.