Tuning GPT-3 on a Single GPU via Zero-Shot Hyperparameter Transfer

Speaker: Greg Yang

Location: 60 Fifth Avenue, Room 7th Floor Open Space
Videoconference link: https://nyu.zoom.us/j/97187503557

Date: Wednesday, May 4, 2022

You can’t train GPT-3 on a single GPU, much less tune its hyperparameters (HPs). But actually: you *can* tune its HPs on a single GPU — even if you can’t train it that way!
This talk consists of 1 hour of generalist content for practitioners and researchers alike, followed by an optional 1 hour of specialist/theoretically-oriented content.
In the first hour, I’ll describe how, in the so-called “maximal update parametrization” (abbreviated µP), narrow and wide neural networks share the same set of optimal hyperparameters. This lets us tune any large model by just tuning a small version of it — we call this µTransfer. In particular, this allowed us to tune the 6.7 billion parameter version of GPT-3 using only 7% of its pretraining compute budget, and with some asterisks, we get a performance comparable to the original GPT-3 model with twice the parameter count.
In the second hour, I’ll discuss how the study of infinite-width neural networks, in particular the feature learning limit, led to the discovery of µP and µTransfer. Furthermore, going beyond width, I’ll formulate the *Optimal Scaling Thesis* that connects infinite-size limits for general notions of “size” to the optimal design of large models in practice, illustrating a way for theory to reliably guide the future of AI.