Events
Special ECE Seminar Series Modern Artificial Intelligence: Why does Adam work so well for LLMs? And can we find optimal per-variable step sizes?
Speaker: Mark Schmidt, University of British Columbia
Location: TBA
Date: Tuesday, February 18, 2025
Location: 6 MetroTech, Eventspace
The success of the Adam optimizer on a wide array of architectures has made it the default in settings where stochastic gradient descent (SGD) performs poorly. However, it is unclear why the gap between Adam and SGD is often big for large language models (LLMs) but small for computer vision benchmarks. Recent work proposed that Adam works better for LLMs due to heavy-tailed noise in the stochastic gradients. We show evidence that the noise is not a major factor in the performance gap between SGD and Adam. Instead, we show that a key factor is the class imbalance found in language tasks. In particular, the large number of low-frequency classes causes SGD to converge slowly but has a smaller effect on Adam and sign descent. We show that a gap between SGD and Adam can be induced by adding a large number of low-frequency classes to computer vision models or even to linear models. We further prove in a simple setting that gradient descent converges slowly while sign descent does not. A key component of the Adam optimizer's success is using per-variable step sizes. However, neither Adam nor any other "adaptive" algorithm is known to perform within any known factor of the optimal fixed per-variable step sizes for the textbook problem of minimizing a smooth strongly-convex function. We propose the first method to update per-variable step sizes that provably performs within a known factor of the optimal step sizes. The method is based on a multi-dimensional backtracking procedure that adaptively uses hyper-gradients to generate cutting planes that reduce the search space for optimal step sizes. As black-box cutting-plane approaches like the ellipsoid method are computationally prohibitive, we develop practical linear-time variants for this setting.