The Case for Bayesian Deep Learning

Andrew Gordon Wilson
January 11, 2020

Prologue: I posted a response to recent misunderstandings around Bayesian deep learning. I have since been urged to collect and develop my remarks into an accessible and self-contained reference. For this purpose, I have written the note posted here. I hope this exposition will be helpful to those seeking to understand what makes Bayesian inference distinctive, and why Bayesian inference is worthwhile in deep learning. This note is also intended to help clarify the poorly understood connections between approximate Bayesian inference and deep ensembles, in light of the recent misunderstanding that deep ensembles and Bayesian methods are competing approaches. A PDF version is available at the end of this post.

Bayesian inference is especially compelling for deep neural networks. The key distinguishing property of a Bayesian approach is marginalization instead of optimization, not the prior, or Bayes rule. Neural networks are typically underspecified by the data, and can represent many different but high performing models corresponding to different settings of parameters, which is exactly when marginalization will make the biggest difference for accuracy and calibration. Moreover, deep ensembles can be seen as approximate Bayesian marginalization.

In many situations, the predictive distribution we want to compute is given by

\begin{equation*} p(y|x, \mathcal{D}) = \int p(y|x, w) p(w|\mathcal{D}) dw \,. \quad\quad\quad\quad\quad\quad\quad\quad\quad (1) \end{equation*}

The outputs are \(y\) (e.g., class labels, regression values, ... ), indexed by inputs \(x\) (e.g. images, spatial locations, ...), the weights (or parameters) of the model \(f(x; w)\) are \(w\), and \(D\) are the data. Eq. (1) represents a Bayesian model average (BMA). Rather than bet everything on one hypothesis — with a single setting of parameters \(w\) — we want to use every possible setting of parameters, weighted by their posterior probabilities. This process is called marginalization of the parameters \(w\), since the predictive distribution of interest no longer conditions on \(w\). This is not a controversial equation, but a direct expression of the sum and product rules of probability. The BMA represents epistemic uncertainty — that is, uncertainty over which setting of weights (hypothesis) is correct, given limited data. Epistemic uncertainty is sometimes referred to as model uncertainty, in contrast to aleatoric uncertainty coming from noise in the measurement process. One can naturally visualize epistemic uncertainty in regression, by looking at the spread of the predictive distribution as we move in \(x\) space. As we move away from the data, there are many more functions that are consistent with our observations, and so our epistemic uncertainty should grow.

In classical training, one typically finds the regularized maximum likelihood solution

\begin{equation*} \hat w = \arg\max_w \log p(w|\mathcal{D}) = \arg\max_w (\log p(\mathcal{D}|w) + \log p(w) + \text{constant}). \quad\quad\quad\quad\quad\quad\quad\quad\quad (2) \end{equation*}

This procedure is sometimes called maximum a-posteriori (MAP) optimization, as it involves maximizing a posterior. \(\log p(\mathcal{D}|w)\) is the log likelihood, formed by relating the function we want to learn \(f(x;w)\) to our observations. If we are performing classification with a softmax link function, \(-\log p(\mathcal{D}|w)\) corresponds to the cross entropy loss. If we are are performing regression with Gaussian noise, such that \(p(\mathcal{D}|w) = \prod_{j=1}^{n} p(y_j | w, x_j) = \prod_{j=1}^{n} \mathcal{N}(y_j; f(x_i;w),\sigma^2)\), then \(-\log p(\mathcal{D}|w)\) is a scaled MSE loss. In this context, the prior p(w) acts as a regularizer. If we choose a flat prior, which has no preference for any setting of the parameters \(w\) (it does not assign any feasible setting any more prior density than any other), then it will have no effect on the optimization solution. On the other hand, a flat prior may have a major effect on marginalization. Indeed, even though MAP involves a posterior and a prior, and an instantiation of Bayes rule, it is not at all Bayesian, since it is performing optimization to bet everything on a single hypothesis \(f(x;\hat{w})\).

We can view classical training as performing approximate Bayesian inference, using the approximate posterior \(p(w | \mathcal{D}) \approx \delta(w=\hat{w})\), where \(\delta\) is a Dirac delta function that is zero everywhere except at \(\hat{w}\). In this case, we recover the standard predictive distribution \(p(y|x,\hat{w})\). From this perspective, many alternatives, albeit imperfect, will be preferable -- including impoverished Gaussian posterior approximations for \(p(w|\mathcal{D})\), even if the posterior or likelihood are actually highly non-Gaussian and multimodal.

The difference between a classical and Bayesian approach will depend on how sharply peaked the posterior \(p(w|\mathcal{D})\) becomes. If the posterior is sharply peaked, there may be almost no difference, since a point mass may then be a reasonable approximation of the posterior. However, deep neural networks are typically very underspecified by the available data, and will thus have diffuse likelihoods \(p(\mathcal{D}|w)\). Not only are the likelihoods diffuse, but different settings of the parameters correspond to a diverse variety of compelling explanations for the data. Indeed, Garipov et al. [5] shows that there are large valleys in the loss landscape of neural networks, over which parameters incur very little loss, but give rise to high performing functions which make meaningfully different predictions on test data. Izmailov et. al [9] and Zołna et al. [33] also demonstrate the variety of good solutions that can be expressed by a neural network posterior. This is exactly the setting when we most want to perform a Bayesian model average, which will lead to an ensemble containing many different but high performing models, for better accuracy and better calibration than classical training.

The recent success of deep ensembles [12] is not discouraging, but indeed strong motivation for following a Bayesian approach. Deep ensembles involves MAP training of the same architecture many times starting from different random initializations, to find different local optima. Thus using these models in an ensemble is an approximate Bayesian model average, with weights that correspond to models with high likelihood and diverse predictions. Instead of using a single point mass to approximate our posterior, as with classical training, we are now using multiple point masses in good locations, enabling a better approximation to the integral in Eq. (1) that we are trying to solve. The functional diversity is important for a good approximation to the BMA integral, because we are summing together terms of the form \(p(y|x,w)\); if two settings of the weights \(w_i\) and \(w_j\) each provide high likelihood (and consequently high posterior density), but give rise to similar models, then they will be largely redundant in the model average, and the second setting of parameters will not contribute much to the integral estimate.

While a recent report [22] shows that deep ensembles appear to outperform some particular approaches to Bayesian neural networks, there are two key reasons behind these results that are actually optimistic for Bayesian approaches. First, the deep ensembles being used are finding many different basins of attraction, corresponding to diverse solutions, which enables a better approximation to a Bayesian model average than the specific Bayesian methods considered in Ovadia et al. [22], which focus their modelling effort on a single basin of attraction. The second is that the deep ensembles require retraining a network from scratch many times, which incurs a great computational expense. If one were to control for computation, the approaches which focus on a single basin may be preferred.

There is an important distinction between a Bayesian model average and some approaches to ensembling. The Bayesian model average assumes that one hypothesis (one setting of the weights) is correct, and averages over models due to an inability to distinguish between hypotheses given limited data [19]. As we observe more data, the posterior collapses, and the Bayesian model average converges to the maximum likelihood solution. If the true explanation for the data is actually a combination of hypotheses, the Bayesian model average may then perform worse as we observe more data. Some ensembling methods instead work by enriching the hypothesis space, and thus do not collapse in this way. Deep ensembles, however, are finding different MAP or maximum likelihood solutions, corresponding to different basins of attraction, starting from different random initializations. Therefore the deep ensemble will collapse when the posterior concentrates, as with a Bayesian model average. Since the hypothesis space is highly expressive for a modern neural network, posterior collapse in many cases is desirable.

Regarding priors, the prior that matters is the prior in function space, not parameter space. In the case of a Gaussian process [e.g. 28], a vague prior would be disastrous, as it is a prior directly in function space and would correspond to white noise. However, when we combine a vague prior over parameters \(p(w)\) with a structured function form \(f(x;w)\) such as a convolutional neural network (CNN), we induce a structured prior distribution over functions \(p(f(x;w))\). Indeed, the inductive biases and equivariance constraints in such models is why they work well in classical settings. We can sample from this induced prior over functions by first sampling parameters from \(p(w)\) and then conditioning on these parameters in \(f(x;w)\) to form a sample from \(p(f(x;w))\) [e.g., 29, Ch 2]. Alternatively, we can use a neural network kernel with a Gaussian process, to induce a structured distribution over functions [30].

Bayesian or not, the prior, just like the functional form of a model, or the likelihood, will certainly be imperfect, and making unassailable assumptions will be impossible. Attempting to avoid an important part of the modelling process because one has to make assumptions, however, will often be a worse alternative than an imperfect assumption. There are many considerations one might have in selecting a prior. Sometimes a consideration is invariance under reparametrization. Parametrization invariance is also a question in considering regularizers, optimization procedures, model specification, etc., and is not specific to whether or not one should follow a Bayesian approach. Nonetheless, I will make some brief additional remarks on these questions.

If we truly have a vague prior over parameters, perhaps subject to some constraint for normalization, then our posterior reflects essentially the same relative preferences between models as our likelihood, for it is a likelihood scaled by a factor that does not depend on \(w\) outside some basic constraints. In computing the integral for a Bayesian model average, each setting of parameters is weighted by the quality of the associated function, as measured by the likelihood. Thus the model average is happening in function space, and is invariant to reparametrization. In the context of many standard architectural specifications, there are also some additional benefits to using relatively broad zero-mean centred Gaussian priors, which help provide smoothness in function space by bounding the norm of the weights. But this smoothness is not a central reason to follow a Bayesian approach, as one could realize similar advantages in performing MAP optimization. Bayesian methods are fundamentally about marginalization as an alternative to optimization.

Moreover, vague priors over parameters are also often a reasonable description of our a priori subjective beliefs. We want to use a given functional form, which is by no means vague, but we often do not have any strong a priori preference for a setting of the parameters. It is worth reiterating that a vague prior in parameter space combined with a highly structured model such as a convolutional neural network does not imply a vague prior in function space, which is also why classical training of neural networks provides good results. Indeed, vague parameter priors are often preferable to entirely ignoring epistemic uncertainty, which would be the standard alternative. In fact, ignoring epistemic uncertainty is a key reason that standard neural network training is miscalibrated. By erroneously assuming that the model (parameter setting we want to use) is completely determined by a finite dataset, the predictive distribution becomes overconfident: for example, the highest softmax output of a CNN that has undergone standard training (e.g. MAP optimization) will typically be much higher than the probability of the corresponding class label [7]. Importantly, ignoring epistemic uncertainty also leads to worse accuracy in point predictions, because we are now ignoring all the other compelling explanations for the data. While improvements in calibration are an empirically recognized benefit of a Bayesian approach, the enormous potential for gains in accuracy through Bayesian marginalization with neural networks is a largely overlooked advantage.

There are also many examples where flat priors over parameters combined with marginalization sidestep pathologies of maximum likelihood training. Priors without marginalization are simply regularization, but Bayesian methods are not about regularization [17, Ch 28]. And there is a large body of work considering approximate Bayesian methods with uninformative priors over parameters (but not functions) [e.g., 3, 2, 21, 1, 6, 17, 14, 20]. This approach is well-motivated, marginalization is still compelling, and the results are often better than regularized optimization.

By accounting for epistemic uncertainty through uninformative parameter (but not function) priors, we, as a community, have developed Bayesian deep learning methods with improved calibration, reliable predictive distributions, and improved accuracy [e.g., 15, 20, 4, 25, 10, 24, 11, 18, 27, 9, 32]. MacKay [16] and Neal [20] are particularly notable early works considering Bayesian inference for neural networks. Seeger [26] also provides a clear tutorial on Bayesian methods in machine learning. Of course, we can always make better assumptions — Bayesian or not. We should strive to build more interpretable parameter priors. There are works that consider building more informative parameter priors for neural networks by reasoning in function space [e.g., 27, 31, 13, 8]. And we should also build better posterior approximations. Deep ensembles are a promising step in this direction.

But we should not undermine the progress we are making so far. Bayesian inference is especially compelling for deep neural networks. Bayesian deep learning is gaining visibility because we are making progress, with good and increasingly scalable practical results. We should not discourage these efforts. If we are shying away from an approximate Bayesian approach because of some challenge or imperfection, we should always ask, ``what is the alternative''? The alternative may indeed be a more impoverished representation of the predictive distribution we want to compute.

There are certainly many challenges to computing the integral of Eq. (1) for modern neural networks, including a posterior landscape which is difficult to navigate, and an enormously high (e.g., 30 million) dimensional parameter space. Many of the above papers are working towards addressing such challenges. We have been particularly working on recycling geometric information in the SGD trajectory for scalable approximate Bayesian inference [9, 18], exploiting large loss valleys [5], and creating subspaces of low dimensionality that capture much of the variability of the network [9]. Pradier et al. [23] also considers different approaches to dimensionality reduction, based on non-linear transformations. For exploring multiple distinct basins of attraction, we have been developing cyclical stochastic MCMC approaches [32], which could be seen as sharing many of the advantages of deep ensembles, but with an added attempt to also marginalize within basins of attraction.

If you wish to cite this note in your work, here is the bibliographic information:
@article{wilson2020case,
title={The case for {B}ayesian deep learning},
author={Wilson, Andrew Gordon},
journal={arXiv preprint arXiv:2001.10995},
year={2020}
}
A PDF version is available here.

References

[1] James Berger et al. The case for objective Bayesian analysis. Bayesian analysis, 1(3): 385–402, 2006.

[2] James O Berger and Luis R Pericchi. The intrinsic Bayes factor for model selection and prediction. Journal of the American Statistical Association, 91(433):109–122, 1996.

[3] Merlise Clyde and Edward I George. Model uncertainty. Statistical science, pages 81–94, 2004.

[4] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.

[5] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of DNNs. In Neural Information Processing Systems, 2018.

[6] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis. Chapman and Hall/CRC, 2013.

[7] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017.

[8] Danijar Hafner, Dustin Tran, Alex Irpan, Timothy Lillicrap, and James Davidson. Reliable uncertainty estimates in deep neural networks using noise contrastive priors. arXiv preprint arXiv:1807.09289, 2018.

[9] Pavel Izmailov, Wesley J Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Subspace inference for Bayesian deep learning. In Uncertainty in Artificial Intelligence, 2019.

[10] Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017.

[11] Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava. Fast and scalable bayesian deep learning by weight-perturbation in adam. arXiv preprint arXiv:1806.04854, 2018.

[12] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017.

[13] Christos Louizos, Xiahan Shi, Klamer Schutte, and Max Welling. The functional neural process. In Advances in Neural Information Processing Systems, 2019.

[14] D. J. MacKay. Bayesian interpolation. Neural Computation, 4(3):415–447, 1992.

[15] David JC MacKay. Bayesian methods for adaptive models. PhD thesis, California Institute of Technology, 1992.

[16] David JC MacKay. Probable networks and plausible predictions?a review of practical bayesian methods for supervised neural networks. Network: computation in neural systems, 6(3):469–505, 1995.

[17] David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.

[18] Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, 2019.

[19] Thomas P Minka. Bayesian model averaging is not model combination. Available electronically at http://www.stat.cmu.edu/minka/papers/bma.html, 2000.

[20] R.M. Neal. Bayesian Learning for Neural Networks. Springer Verlag, 1996. ISBN 0387947248.

[21] Anthony O’Hagan. Fractional Bayes factors for model comparison. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):99–118, 1995.

[22] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D Sculley, Sebastian Nowozin, Joshua V Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. arXiv preprint arXiv:1906.02530, 2019.

[23] Melanie F Pradier, Weiwei Pan, Jiayu Yao, Soumya Ghosh, and Finale Doshi-Velez. Latent projection bnns: Avoiding weight-space pathologies by learning latent representations of neural network weights. arXiv preprint arXiv:1811.07006, 2018.

[24] Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable Laplace approximation for neural networks. In International Conference on Learning Representations (ICLR), 2018.

[25] Yunus Saatci and Andrew G Wilson. Bayesian GAN. In Advances in neural information processing systems, pages 3622–3631, 2017.

[26] Matthias Seeger. Bayesian modelling in machine learning: A tutorial review. Technical report, 2006.

[27] Shengyang Sun, Guodong Zhang, Jiaxin Shi, and Roger Grosse. Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779, 2019.

[28] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning. the MIT Press, 2(3):4, 2006.

[29] Andrew Gordon Wilson. Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian processes. PhD thesis, University of Cambridge, 2014.

[30] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.

[31] Wanqian Yang, Lars Lorch, Moritz A Graule, Srivatsan Srinivasan, Anirudh Suresh, Jiayu Yao, Melanie F Pradier, and Finale Doshi-Velez. Output-constrained bayesian neural networks. arXiv preprint arXiv:1905.06287, 2019.

[32] Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson. Cyclical stochastic gradient MCMC for Bayesian deep learning. In International Conference on Learning Representations, 2020.

[33] Konrad Zołna, Krzysztof J Geras, and Kyunghyun Cho. Classifier-agnostic saliency map extraction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 10087–10088, 2019.