Identifying and Neutralizing Concepts in Neural Representations

Speaker: Shauli Ravfogel’

Location: 60 Fifth Avenue, Room 204

Date: Wednesday, February 21, 2024

I will introduce a line of works on locating and neutralizing specific concepts in neural models trained on textual data. The first work proposes a concept-neutralization pipeline that involves training linear classifiers and projecting the representations onto their null-space. The second work formulates the problem as a constrained, linear minimax game and derives a closed-form solution for certain objectives, while proposing efficient solutions for others. Both methods are demonstrated to be effective in various use cases, including bias and fairness in word embeddings and multi-class classification. Beyond fairness considerations, I will discuss the promises and limitations associated with the capacity to manipulate LMs through their representations, and the usage of interventions in the representation space as an interpretability and analysis tool.