CILVR SEMINAR: A Framework for Multi-modal learning: Jointly Modeling Inter- and Intra-Modality Dependencies | Variance-Covariance Regularization Improves Representation Learning

Date: Thursday, March 14, 2024, 11AM

Location: 60FA , Room 7th floor common area

Speaker: Divyam Madaan, Jiachen Zhu

Notes:

Talk 1: A Framework for Multi-modal learning: Jointly Modeling Inter- and Intra-Modality Dependencies

Abstract: Supervised multi-modal learning is a key paradigm in machine learning that involves mapping multiple input modalities to a target label. However, its effectiveness can greatly vary across different applications. In this talk, I will delve into the factors behind these performance fluctuations in multi-modal learning and introduce a framework designed to mitigate these disparities. I will demonstrate how traditional methods, typically concentrating on either the inter-modality dependencies or the intra-modality dependencies, may not reliably achieve optimal predictive results. To tackle this challenge, I will introduce our Inter+Intra-Modality (I2M) modeling, which combines inter- and intra-modality dependencies to enhance prediction accuracy. Our findings, drawn from real-world applications in healthcare and vision-and-language tasks, indicate that our approach outperforms traditional methods that focus solely on one type of modality dependency.

Talk 2: Variance-Covariance Regularization (VCReg) Improves Representation Learning

Abstract: Transfer learning plays a key role in advancing machine learning models, yet conventional supervised pretraining often undermines feature transferability by prioritizing features that minimize the pretraining loss. This work adapt a self-supervised learning regularization technique from the VICReg method to supervised learning contexts, introducing VCReg. This adaptation encourages the network to learn high-variance, low-covariance representations, promoting learning more diverse features. Through extensive empirical evaluation, we demonstrate that our method significantly enhances transfer learning for images and videos and improves performance in scenarios like long-tail learning and hierarchical classification.