CILVR Seminar: Discrete Generative Models for Designing and Sampling Atomistic Data | Practical causal representation learning with an asymmetric prior

Speaker: Nate Gruver, Taro Makino

Location: 60 Fifth Avenue, Room 7th floor common area

Date: Thursday, November 16, 2023

Title: Discrete Generative Models for Designing and Sampling Atomistic Data

Abstract: Generative AI is hot, but are current models on the path to solving humanity’s most pressing problems? Instead of generating images of the pope playing basketball, how do we develop models and methods for designing antibodies that fight cancer or new inorganic materials for carbon capture? In this talk I will discuss two of my recent projects on applying generative sequence models to atomistic data, in particular protein design and discovery of stable inorganic materials. Although many people approach these problems by considering symmetries and constraints on continuous objects in 3D space, I will show that relatively small modifications of standard language models can be used to generate atomic structures with desired properties like stability or binding with a target molecule. I will also show how language models can be combined with Bayesian optimization for applications with limited data and expensive validation. Along with introducing new methodology, I will discuss the evaluation in these two papers, including a real-life antibody design campaign with multiple stages of wet-lab synthesis and validation of sampled inorganic materials through extensive density-functional theory (DFT) simulations.

Bio: Nate Gruver is a PhD student at NYU advised by Andrew Gordon Wilson and working closely with Kyunghyun Cho. He is most excited about making foundation models impactful for scientific problems, particularly in biology and chemistry. Nate obtained an MS and BS in computer science from Stanford University where he worked on planning problems and generative modeling for autonomous vehicles. Outside of work, he sometimes spends weeks in the woods and enjoys making things in a kitchen or an art studio.

-----

Title: Practical causal representation learning with an asymmetric prior

Abstract: We often rely on machine learning models to generalize to unseen environments. This problem is called domain generalization, and is extremely challenging due to the presence of environment-specific spurious correlations. We turn to causal representation learning, which aims to learn features that are invariant to the environment. Existing algorithms make unrealistic assumptions, or rely on complex statistical procedures that are too impractical for real-world applications. We propose the Asymmetric Prior Variational Autoencoder (AP-VAE) to address these weaknesses. In the AP-VAE, there are two latent variables which represent the environment-invariant and environment-specific features. Due to an asymmetry in their priors, these latents can be estimated via standard variational inference, making our approach highly practical. The inferred invariant features are used downstream for out-of-distribution prediction. We empirically validate our method using two toy problems and a real-world histopathology dataset.

Bio: Taro Makino is a fourth year PhD candidate at NYU CDS advised by Kyunghyun Cho and Krzysztof Geras. He works on practical algorithms to improve out-of-distribution generalization on real world problems, such as drug discovery. Taro holds an MS in Artificial Intelligence from the University of Edinburgh where he worked with Amos Storkey, and a BA in Mathematics from Northwestern University.