CILVR Seminar: Language Model Alignment: Theory & Practice

Speaker: Ahmad Beirami

Location: 60 Fifth Avenue, Room 7th floor open space

Date: Wednesday, November 6, 2024

The goal of language model alignment (post-training) is to draw samples from an aligned model that improve a reward (e.g., safety or factuality) with little perturbation to the base model. A simple baseline for this task is best-of-N, where N responses are drawn from the base model, ranked based on a reward, and the highest ranking one is selected. More sophisticated techniques generally solve a KL-regularized reinforcement learning (RL) objective with the goal of maximizing expected reward subject to a KL divergence constraint between the aligned model and the reference model. In this talk, we give an overview of language model alignment and give an understanding of key results in this space through simplified models. We also present a modular alignment technique, called controlled decoding, which solves the KL-regularized RL problem while keeping the reference model frozen through learning a prefix scorer, offering inference-time configurability. Finally, we also shed light on the remarkable performance of best-of-N in terms of achieving competitive or even better reward-KL tradeoffs when compared to state-of-the-art alignment baselines.