NLP and Text-as-Data Speaker Series: The Key Ingredients for Scaling Test-Time Compute

Speaker: Aviral Kumar (CMU)

Location: 60 Fifth Avenue, Room 150

Date: Thursday, February 27, 2025

There has been a lot of excitement around finetuning LLMs to utilize more test-time compute, and how it provides a new dimension for improving reasoning capabilities and performance. However, with a number of open-source codebases and report releases, there are a number of open questions around test-time scaling, for e.g., do we need reinforcement learning, or is supervised finetuning sufficient? Do process rewards matter at all? Does length matter?
In this talk, I will systematically develop a perspective on these questions from first principles. First, I’ll transform the problem of scaling test-time compute into a concrete learning objective: on easy problems, models should converge to correct answers within the fewest number of tokens (avoiding overthinking); on hard problems, they should make steady “progress”, so that they can maximize the chances of eventual correctness, even if requires running the model at larger test-time compute budgets than what it was trained to do. Empirically, I’ll show that state-of-the-art models like DeepSeek-R1 derivatives fail to meet these desiderata: they often overthink on easy problems and get stuck on hard ones. I’ll then argue that outcome-reward training alone is insufficient to attain both of these desiderata automatically and that LLMs must instead optimize a combination of outcome rewards and dense intermediate rewards that encourage progress.
In the remainder of the talk, I will present a principled approach to designing these dense rewards and show that these lead to good reasoning performance on mathematical reasoning benchmarks, when run with long chains of thought. I will also discuss how our dense reward design connects to concepts in meta reinforcement learning (RL). Finally, I will briefly talk about how we can show that using RL is critical for attaining good test-time scaling when optimizing dense rewards compared to running supervised finetuning on filtered reasoning traces obtained from off-the-shelf models.