Events
CDS Seminar: Studying representational (mis)alignment: how and why? & Representation-space interventions and their causal implications
Speaker: Ilia Sucholutsky & Shauli Ravfogel
Location: 60 Fifth Avenue, Room 150
Date: Friday, March 21, 2025
Title: Studying representational (mis)alignment: how and why?
Abstract: Do humans and machines represent the world in similar ways? Does it really matter if we do? I'll share a simple method that anyone in the audience can immediately start using to study the representational alignment of their AI systems, even black-box ones, with humans or other models. We'll then explore why we would want to do this by highlighting some examples of how representational alignment helps us learn more about both humans and machines and how it relates to key downstream properties of both machine learning and machine teaching.
Title: Representation-space interventions and their causal implications
Abstract: I will introduce a line of work that focuses on identifying and manipulating the linear representation of specific concepts in language models. The family of techniques I will cover provides effective tools for various applications, including fairness and bias mitigation as well as causal interventions in the representation space. I will formulate the problem of linearly erasing information or steering the model towards one value or the concept, and derive closed form solutions. In the second half of the talk, I will address a key limitation of representation-space interventions---their opaqueness---and present two techniques for making these interventions interpretable by mapping them back to natural language: (1) learning a mapping from the latent space to text, or (2) deriving causally-correct counterfactual strings that correspond to a given intervention in the model.