Global AI Frontier Lab: Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation

Speaker: Radhika Dua

Location: 1 MetroTech Center

Date: Monday, October 20, 2025

Radiological imaging is central to diagnosis, treatment planning, and clinical decision-making. Vision-language foundation models have spurred interest in automated radiology report generation (RRG), but safe deployment requires reliable clinical evaluation of generated reports. Existing metrics often rely on surface-level similarity and/or behave as black boxes, lacking interpretability. We introduce ICARE (Interpretable and Clinically-grounded Agent-based Report Evaluation), an interpretable evaluation framework leveraging large language model agents and dynamic multiple-choice question answering (MCQA). Two agents, each with either the ground-truth or generated report, generate clinically meaningful questions and quiz each other. Agreement on answers captures preservation and consistency of findings, serving as interpretable proxies for clinical precision and recall. By linking scores to question–answer pairs, ICARE enables transparent, and interpretable assessment. Clinician studies show ICARE aligns significantly more with expert judgment than prior metrics. Perturbation analyses confirm sensitivity to clinical content and reproducibility, while model comparisons reveal interpretable error patterns.
Bio: Radhika Dua is a second-year Ph.D. student at New York University, advised by Eric Oermann and Kyunghyun Cho. Her research focuses on developing clinically-grounded AI systems that are trustworthy and can make healthcare more accessible. Before joining NYU, she was a predoctoral researcher at Google DeepMind India, where she developed a nation-scale agricultural landscape understanding system using high-resolution satellite imagery to support sustainable agriculture and land monitoring. Her work was presented at AGU 2023 (oral), the Google Research Conference, Google I/O, and Google for India. She completed her Master’s in Artificial Intelligence from KAIST, where she worked on improving the reliability of deep learning models under unseen distributions.