Scalable Kernel Learning and
Gaussian Processes
Update: KISS-GP and Deep Kernel Learning code has just been
released! (Oct 24, 2016)
Update: GP-LSTM paper and code has just been released! (Oct
28, 2016)
Update: Expanded the README in the Deep Kernel Learning
source (Oct 28, 2016)
Update: The SV-DKL paper and code have been posted! (Nov 2,
2016)
More updates to follow soon!
Kernel methods, such as Gaussian processes, have had an
exceptionally consequential impact on machine learning theory
and practice. However, these methods face two fundamental
open questions:
(1) Kernel Selection: The generalisation properties
of a kernel method entirely depend on a kernel function.
However, often one defaults to the RBF kernel, which can
only discover very limited representations of data.
(2) Scalability: Gaussian processes, for example,
scale as O(n^3) for training, O(n^2) for storage, and O(n^2) per
test prediction,
for n training points. This scalability typically limits
GP applicability to at most a few thousand datapoints.
Moreover, standard approaches to GP scalability typically only
work for up to about fifty thousand training points, and
sacrifice too much model structure for kernel learning.
We view kernel learning and scalability as two sides of one
coin. We most want flexible models on large datasets,
which often contain the necessary information to learn rich
statistical representations through kernel learning -- for great
predictive performance, and new scientific insights into our
modelling problems.
There has been exciting recent work targeted at addressing these
limitations.
On this page, we have code, tutorials, demos, and resources for
our work on scalable kernel learning. We introduce
approaches for scalable inference and automatically learning
kernel functions. Several of these approaches have the
inductive biases of deep learning architectures, while retaining
a probabilistic non-parametric representation, and allowing for
O(n) training and O(1) testing (compared
to O(n^3) and O(n^2) for standard GPs). Our approaches to
scalability are largely grounded in exploiting algebraic
structure in kernel
matrices in combination with local kernel interpolation, for
efficient and highly accurate numerical linear algebra
operations.
Moreover, we view kernel methods and deep learning as
complementary
rather than competing approaches. Kernel methods provide
an elegant mathematical approach to representing an infinite
number of basis functions using a finite amount of computional
power. Deep learning approaches provide a way to
create adaptive basis functions which have powerful inductive
biases. With deep kernel learning, we can create infinite
expansions with basis functions that have the structural
properties and inductive biases of neural networks. The
information capacity of such models grows automatically with the
size of the data, and the complexity can be automatically
calibrated through Bayesian inference. And by viewing
neural networks through the lens of
metric learning,
deep learning approaches become more
interpretable.
Many of these themes are described in my talk "
Scalable
Gaussian Processes for Scientific Discovery".
We always welcome constructive feedback. Please feel free
to reach out with bug reports or suggestions. To
understand this material
you must have a basic understanding of Gaussian processes
as
described in the first five chapters of GPML (particularly
chapters 2, 4, 5), or in the
first
two chapters of my PhD thesis.
All of the code here builds upon the
GPML
library. Before using this code, I suggest following
some of the tutorials on the
GPML
website.
To jump navigate to the various sections:
[
Spectral Mixture Kernels |
Kronecker Inference |
Kronecker non-Gaussian Likelihoods |
KISS-GP |
Deep Kernel Learning][
Stochastic Variational Deep Kernel Learning
|
GP-LSTM
|
Datasets]
Andrew Gordon
Wilson
November 2, 2016
The spectral mixture (SM) kernel is an expressive
basis for all stationary covariance functions. It is
formed by modelling the spectral
density (Fourier transform) of a kernel as a scale-location
Gaussian mixture. One can analytically take the inverse
Fourier transform to recover the closed form expression for this
kernel. Learning the kernel amounts to learning the
hyperparameters of this kernel function, using the same
techniques as one would learn the hyperparameters of an RBF
kernel, but with special care for how these parameters are
initialized. This kernel has been used to discover a wide
range of flexible statistical representations, enabling long
range crime prediction, time series, image and video
extrapolation, as well as new interpretable scientific insights
into our modelling problems.
Key Reference:
Gaussian process kernels for pattern discovery and
extrapolation
Andrew Gordon Wilson and Ryan Prescott Adams
International Conference on Machine Learning (ICML),
2013.
[
arXiv,
PDF,
Supplementary,
Slides,
Video
Lecture,
Typo
correction, BibTeX]
Typo
correction regarding the spectral mixture kernel for
multidimensional inputs
Our other papers discussing spectral mixture kernels:
Covariance kernels for fast automatic pattern discovery and
extrapolation with Gaussian processes
Andrew Gordon Wilson
PhD Thesis, January 2014
[
PDF,
BibTeX]
Fast kernel learning for multidimensional pattern
extrapolation
Andrew Gordon Wilson*, Elad Gilboa*, Arye Nehorai, and John
P. Cunningham
Advances in Neural Information Processing Systems (NIPS)
2014
[
PDF,
BibTeX,
Code,
Slides]
Kernel Interpolation for Scalable Structured Gaussian
Processes (KISS-GP)
Andrew Gordon Wilson and Hannes Nickisch
To appear at the
International Conference on Machine
Learning (ICML), 2015
[
PDF,
Supplement,
arXiv,
BibTeX,
Theme Song]
Fast kronecker inference in Gaussian processes with
non-Gaussian likelihoods
Seth Flaxman, Andrew Gordon Wilson, Daniel Neill, Hannes
Nickisch, and Alexander J. Smola
To appear at the
International Conference on Machine
Learning (ICML), 2015
[
PDF,
Supplement,
Code]
À la carte - learning fast kernels
Zichao Yang, Alexander J. Smola, Le Song, and Andrew Gordon
Wilson
Artificial Intelligence and Statistics (AISTATS), 2015
Oral Presentation
[
PDF,
BibTeX]
A process over all stationary covariance kernels
Andrew Gordon Wilson
Technical Report, University of Cambridge. June
2012.
[
PDF]
Scalable Gaussian Processes for
Characterizing Multidimensional Change Surfaces
William Herlands, Andrew Gordon Wilson, Seth Flaxman,
Daniel Neill, Wilbert van Panhuis, and Eric P. Xing
Artificial Intelligence and Statistics
(AISTATS), 2016
[PDF,
BibTeX]
Deep Kernel Learning
Andrew Gordon Wilson*, Zhiting Hu*, Ruslan
Salakhutdinov, and Eric P. Xing
Artificial Intelligence and Statistics
(AISTATS), 2016
[PDF,
arXiv,
BibTeX]
The Human Kernel
Andrew Gordon Wilson, Christoph Dann, Christopher G. Lucas,
and Eric P. Xing
Neural Information Processing Systems (NIPS), 2015
Spotlight
[
PDF,
arXiv,
Supplement,
BibTeX]
Code:
-
GPML
v3.6 [older version]
-
covSM.m
-
covSMfast.m
[only works if the inputs are 1D]
-
initSMhypers.m
-
initSMhypersadvanced.m
-
smspect.m
-
empspect.m
-
plotspectral.m
-
testSMairline.m
-
testSMCO2.m
-
testSMCO2_adv.m
-
testSMaudio_adv.m
-
airline
dataset
-
CO2
dataset
-
audio
dataset
-
everything
for this demo (except GPML) in one zip
The spectral mixture kernel has a simple form:
Here
theta contains all of the kernel hyperparameters,
and
Sigma_q in one dimension simply represents a
variance parameter
v_q.
In some papers (and code) the
weights,
w_q, are
squared, in others they are not. Just be careful to keep
track of whether the weights are squared and adjust your
interpretation accordingly.
One can learn the hyperparameters by maximizing the log marginal
likelihood objective (with training data
y):
Or one can intergrate away these hyperparameters using sampling
techniques.
Running a demo:
Step 1: Download the above files.
Step 2: Run testSMairline.m to produce the following time series
extrapolation results
(I have included an example run result in
airlinerun.mat):
Step 3: Try the same experiment with different random
seeds. Do you find consistent results?
Replace 'covSM.m' with 'covSMfast.m'. How does it affect
the runtime of testSMairline?
[The newer versions of covSM.m are fast and you may not notice a
difference].
Understanding the parameters:
Step 4: The spectral mixture kernel is flexible, capable of
discovering many powerful representations, but the learning
surface for a spectral mixture kernel is often highly
multimodal, making initialization very important for good
predictive success. Thankfully there are several
initialization routines for the spectral mixture kernel which
are reasonably simple and work on a variety of datasets with
different properties.
There are four key hyperparameters: the
weights (
w_i^2),
variances (
v_i), and
means (µ_i) of the
Gaussians in frequency space, and the noise variance (sigma^2)
in the time domain.
The variances can be thought of as inverse length-scales, and
means as frequencies, and the weights as signal variances.
Step 5: Set the number of components Q = 1. Hard-code the
variance to zero, the frequency to 2, the weight to 1, and plot
a draw from a GP with this spectral mixture kernel. Notice
that it's a strictly periodic function with a period of
0.5. Now take a draw with the variance set to 1. The
resulting draws should become quasiperiodic.
Step 6: We can gain more insights into the parameters of the
spectral mixture kernel by looking at its associated spectral
density. Plot the (log) spectral densities, with help from
the
smspect.m associated with these two models.
See what happens, both in function space, and in the spectral
density, as you vary the frequency, weight, and variance
(inverse length-scale) parameters to their extreme values, and
add more components to the model.
Run
plotspectral.m on the learned kernel for airline to
reproduce this plot:
Notice that the empirical spectrum looks like a very noisy
version of the learned spectral mixture spectral density, with
major frequency peaks in the same places [we will use this later
for initialization].
Initialization:
Step 7: The SM kernel can match a wide range of different
covariance functions corresponding to different settings of its
hyperparameters. However, learning a good set of
hyperparameters can be difficult, because the learning objective
is typically quite multimodal in the frequency parameters.
On the airline dataset, try looking at the log marginal
likelihood as a function of the frequency parameters, fixing the
variances and weights.
Step 8: In step 2 we used the script initSMhypers.m to
initialize the spectral mixture kernel. It's a relatively
simple procedure, where we initialize the
weights: the weights are related to the signal variance
of the data. If we look at k(0,0) we get the sum of the
weights. This value is also the signal variance of our
distribution over functions. We therefore initialize the
weights as the standard deviation of the data divided by the
number of components (assuming that the noise standard deviation
is negligable compared to the signal). All the weights are
initialized to the same value here.
variances: I find it helpful to think of the variances as
inverse length-scales. We take the range of the data
(which is roughly an upper bound on the length-scale we might
expect), and then sample from a truncated Gaussian distribution
with a mean equal to the range to get these parameters.
frequencies: the hardest part. I often think of the
frequency parameters in terms of periods (1/frequency). The
marginal likelihood will be most multimodal in these parameters.
However, our approach is simple: we initialize using a uniform
distribution from 0 to the
Nyquist
frequency. When the data are not
regularly spaced, the "Nyquist frequency" typically doesn't
exist. However, in these cases we would not be likely to
see a frequency bigger than Fs = 1/[smallest spacing between two
points in the data per dimension], as there would be no
justification for a period smaller than this spacing. This
parameter, Fs, turns out to be critical to finding a good
initialization.
For any given problem, you should put some thought into a
good setting for Fs, and try varying Fs to see how it
affects the results.
We can then evaluate the marginal likelihood a number of times
(e.g. 100) sampling from this initialization script each time,
and start optimizing the marginal likelihood with the
initialization that has the highest marginal likelihood [
this is what we do in the next section on
image inpainting]. Alternatively, we can try a
smaller number (e.g. 10) of short optimization runs, and then a
longer run, starting with the initialization that had the
highest marginal likelihood after the short runs. Try both
approaches and see which you prefer! In particular, in
testSMairline.m, try playing with the trade-off between the
parameters
numinit and
shortrun.
Remember
to use a different random seed for each run! (The default is
set to 42). Try to find a better
initialization by tuning the initialization routine to a setting
which has more consistently good results across different random
seeds.
Step 9: In Step 8, we considered a simple initialization
strategy. It won't always work nicely, particularly if you
can't find a good setting for Fs. Take a look at
initSMhypersadvanced.m. Here we take a look at the
empirical spectral density (see
Chapter
2, page 62, of my thesis) which is proportional to
the FFT of the squared data. Recall in Step 6 we looked at
the log empirical spectral density (compared to the successful
learned spectral mixture kernel log spectral density) for the
airline dataset, by running
plotspectral.m.
Notice that all of the major peaks in the empirical spectrum
appear to be in the right places. We can use this property
to find a good initialization for the frequency parameters.
Step 10: The empirical spectral density is often too noisy to
use to directly create a good kernel function. However, it often
has its sharp peaks near the right frequencies. We can
treat the empirical spectral density as observed, and then
directly fit a mixture of Gaussians to this empirical spectral
density, using a standard Gaussian mixture density estimator (by
taking samples from the empirical inverse CDF of the spectral
density). We can then use the fitted parameters as an
initialization
for the spectral mixture kernel. We do this in the
initSMhypersadvanced.m.
For variety, try running
testSMCO2_adv.m. (try running a
few times to see any differences in results). I've
included the output run shown here in
CO2run.mat.
Step 11: In addition to fitting the empirical spectral density,
script
testSMCO2_adv.m starts the initialization with a
stochastic mini-batch approach (using gradient descent with
momentum). The marginal likelihood of a GP does not factorize,
so stochastic gradient descent will not be provably
convergent. However, SGD is a reasonable way to initialize
the full batch optimization procedure for spectral mixture
kernels. The idea is that the bad local optima for
frequencies will be in different places for different
mini-batches, but the good global optima will be in similar
places in each batch.
Step 12: Try the included audio dataset, and look at an
initialization that works well for these data. In general,
try writing your own initialization scripts. There are
many ways to initialize the spectral mixture kernel, and I
haven't found a magic bullet (but you might!). There are
some more advanced ideas
in
our work on change surfaces. I have found
initialization gets easier with more data. If you are
interested in using the SM kernel, I highly recommend following
the next tutorial, which considers new
initialization scripts for more data and multidimensional
inputs.
Kronecker
Inference with Spectral Mixture Kernels
When inputs are on a multidimensional grid
(e.g., a Cartesian product across grid dimensions), and
the kernel decomposes as a product across grid
dimensions, then the data covariance matrix decomposes
as a Kronecker product of kernels operating on each
individual grid dimension.
Consider, for example, modelling a 1000 x 1000 pixel
image, using a GP with an RBF kernel. The inputs
are pixel locations (in R^2), and the outputs are pixel
intensities. There are therefore 1 million
datapoints, and the corresponding 10^6 x 10^6 covariance
matrix could not normally even be represented on a
computer. However, in this case, the inputs are on
a fully connected two dimensional grid: every horizontal
pixel coordinate has an associated vertical pixel
coordinate. Moreover, the RBF (or, more generally,
ARD) kernel decomposes as a product of RBF kernels
acting on each input dimension. This decomposition
corresponds to a soft a priori assumption of
independence across input dimensions, but it does not
rule out posterior correlations, and in fact, provides a
useful inductive bias.
Therefore the 10^6 x 10^6 covariance matrix will
decompose into a a Kronecker product of two 10^3 x 10^3
covariance matrices, enabling exact inference
and learning at the computational cost of working with
about 10^3 datapoints!
In this section, we will demo a method which allows
Kronecker inference even in the case of partial grid
structure: suppose we have missing pixels in an image,
or water covering up part of a map in a spatiotemporal
statistics problem. Moreover, we will combine this
scalable and exact inference with spectral mixture
kernels, with demos on image inpainting problems.
The main reference for this section is:
Fast kernel learning for
multidimensional pattern extrapolation
Andrew Gordon Wilson*, Elad Gilboa*, Arye Nehorai,
and John P. Cunningham
Advances in Neural Information Processing Systems (NIPS),
2014
[PDF,
BibTeX,
Code,
Slides]
Other references which use this methodology include:
Covariance
kernels for fast automatic pattern discovery and
extrapolation with Gaussian processes
Andrew Gordon Wilson
PhD Thesis, January 2014
[PDF,
BibTeX]
Fast
kronecker inference in Gaussian processes with
non-Gaussian likelihoods
Seth Flaxman, Andrew Gordon Wilson, Daniel Neill,
Hannes Nickisch, and Alexander J. Smola
International Conference on Machine Learning (ICML),
2015
[PDF,
Supplement,
Code]
Scalable
Gaussian Processes for Characterizing Multidimensional
Change Surfaces
William Herlands, Andrew Gordon Wilson, Seth Flaxman,
Daniel Neill, Wilbert van Panhuis, and Eric P. Xing
Artificial Intelligence and Statistics
(AISTATS), 2016
[PDF,
BibTeX]
Video Resource:
Scalable
Gaussian Processes for Scientific Discovery (EPFL,
Februrary 2016)
Audio Resource:
Abstract,
Slides, and Audio, for my November 2014 Oxford lecture
"Kernel Methods for Large Scale Representation Learning"
Download files and add to
path
1) Download
GPML v3.5 and add the files to the Matlab (or
Octave path).
2) For the tutorials, the only required external
files are 'vincent.m', 'spectral_init.m',
'test_covGrid.m', 'test_infGrid.m', and the
datasets. 'covSMfast.m' is an optional faster
implementation of covSM. All
tutorial files are available in this zip file.
Unzip these tutorial files into one directory.
Example 1: Try running
'test_covGrid.m' and 'test_infGrid.m' . These
show the accuracy of the grid methods compared to
standard methods involving Cholesky decompositions.
covGrid is used to generate a covariance matrix with
Kronecker product structure, which can be exploited
by infGrid. This test shows the
generated covariance matrix is identical to what
would be obtained using standard methods.
infGrid exploits Kronecker structure for scalable
inference and learning. In the case where the
inputs are on a grid, with no missing inputs, the
inference and learning is also exact. In this
demo, the data only has partial grid
structure. Every second data point is missing,
as indicated by idx = (1:2:N)'. The idx
variable contains the locations of the training
points. We use the extensions in Wilson
et. al (2014) to handle partial grids
efficiently. With partial grids, inference
requires linear conjugate gradients (LCG).
With a high tolerance (< 1e-4), and number of
iterations (> 400), inference is exact to within
numerical precision. Typically, a tolerance of
1e-2 and maximum iterations of 200 is sufficient.
With partial grids, the log determinant of the
covariance matrix, used for marginal likelihood
evaluations, undergoes a small approximation.
Exercise 1: Try adjusting opt.cg_maxit and
opt.cg_tol, and observe the effect on inference.
Exercise 2: Try changing idx
to idx = (1:1:N)' and see what
happens.
Exercise 3: Cycle between the three test
cases, cs = 1, 2, 3.
Example 2: Try running 'vincent.m'. You
will be given a choice of patterns, and the code
will extrapolate these patterns automatically.
This script reconstructs some of the results in
Fast
Kernel Learning for Multidimensional Pattern
Extrapolation.
Do not worry if there are occasional warnings
that LCG has not converged. The optimization
will proceed normally. The warnings can be
resolved by adjusting the tolerance and maximum
number of iterations. We have set these values
for a good balance of accuracy and efficiency, not
minding if LCG occasionally does not converge to
within the specified precision.
Some example output:
Note the learned spectral mixture kernels. It
would be difficult to know to a priori hard code
kernels like these. The learned structure is
interpretable and enables good predictions and
extrapolation.
Exercise 1: Extrapolations are produced
with spectral mixture kernels and SE
kernels. Try adding a Matern kernel to this
demo. It will do better on the letters
example than an SE kernel, but not as well as the
spectral mixture kernel on larger missing regions.
Exercise 2: Study the initialisation script
spectral_init.m. This is a basic and
robust initialisation for spectral mixture
kernels. We sample initial parameters from
several basic distributions multiple times, and
then start the optimization run from the
initialization with the highest marginal
likelihood. One of the critical parameters
here is Fs, which is used to bracket the uniform
distribution on frequencies used to sample initial
frequency parameters. This parameter must be
adjusted intelligently to the problem at hand, as
described in the previous tutorial. Also
note that we use a product of 1D spectral mixture
kernels (covSMfast.m), and initialize parameters
separately from each input dimension. Can
you think of other initialisations which might
offer even better performance? For an in
depth initialisation procedure, see Scalable
Gaussian Processes for Characterizing
Multidimensional Change Surfaces.
Exercise 3: Reproduce Figure 1j in
Wilson et. al, NIPS 2014, showing the initial and
learned weights against the initial and learned
frequencies.
Exercise 4: Adjust this script to your own
patterns and general purpose regression problems,
such as the land surface temperature forecasting
in Wilson et. al, NIPS 2014.
Kronecker
Inference with non-Gaussian Likelihoods
When we use a non-Gaussian observation model (for
example, a Student-t noise model, or Poisson process
with a GP intensity, or classification), then GP
inference becomes approximate. Popular approaches
to approximate GP inference include MCMC (such as
Elliptical Slice Sampling), and deterministic
approximations (Laplace, EP, Variational Methods).
In these settings, we are no longer simply taking the
Kronecker decompositionof (K+ \sigma^2 I), but rather
typically encounter (K+A) where A is a full diagonal
matrix, which cannot undergo a straightforward
eigendecomposition via Kronecker algebra.
We show how to approach these linear systems for scalable
inference and learning by exploiting the Kronecker structure in
K:
Fast
kronecker inference in Gaussian processes with
non-Gaussian likelihoods
Seth Flaxman, Andrew Gordon Wilson, Daniel
Neill, Hannes Nickisch, and Alexander J. Smola
To appear at the
International Conference on
Machine Learning (ICML), 2015
[
PDF,
Supplement,
Video Lecture, BibTeX]
This paper also uses
spectral mixture
kernels, to make long range crime rate forecasts, as part
of a Poisson process (and negative Binomial process) observation
model, where the GP is modelling the rate function. The
spectral mixture kernels are able to automatically discover
interpretable structure in the data, which helps inform new
policy decisions, as well as make crime predictions on
unprecedented time scales.
Here we include a simple demo on Hickory data.
Files
Download
infGrid_Laplace.m
Contents
Hickory data
Author: Seth Flaxman 4 March
2015
clear all, close all
xy = csvread('hickories.csv');
plot(xy(:,1),xy(:,2),'.')
n1 = 32; n2 = 32;
x1 = linspace(0,1.0001,n1+1); x2 = linspace(-0.0001,1,n2+1);
xg = {(x1(1:n1)+x1(2:n1+1))'/2, (x2(1:n2)+x2(2:n2+1))'/2};
i = ceil((xy(:,1)-x1(1))/(x1(2)-x1(1)));
j = ceil((xy(:,2)-x2(1))/(x2(2)-x2(1)));
counts = full(sparse(i,j,1));
cv = {@covProd,{{@covMask,{1,@covSEiso}},{@covMask,{2,@covSEisoU}}}};
cvGrid = {@covGrid, { @covSEiso, @covSEisoU}, xg};
hyp0.cov = log([.1 1 .1]);
y = counts(:);
X = covGrid('expand', cvGrid{3});
Idx = (1:length(y))';
lik = {@likPoisson, 'exp'};
hyp0.mean = .5;
Laplace approximation without grid structure
tic
hyp1 = minimize(hyp0, @gp, -100, @infLaplace, @meanConst, cv, lik, X, y);
toc
exp(hyp1.cov)
Function evaluation 0; Value 1.114380e+03
Function evaluation 5; Value 1.104834e+03
Function evaluation 8; Value 1.098415e+03
Function evaluation 12; Value 1.097589e+03
Function evaluation 16; Value 1.097445e+03
Function evaluation 18; Value 1.097417e+03
Function evaluation 23; Value 1.096718e+03
Function evaluation 26; Value 1.096703e+03
Function evaluation 30; Value 1.096424e+03
Function evaluation 33; Value 1.096419e+03
Function evaluation 35; Value 1.096419e+03
Function evaluation 37; Value 1.096419e+03
Function evaluation 46; Value 1.096419e+03
Elapsed time is 300.589011 seconds.
ans =
0.0815 0.6522 0.0878
Laplace approximation with grid structure and
Kronecker inference
tic
hyp2 = minimize(hyp0, @gp, -100, @infGrid_Laplace, @meanConst, cvGrid, lik, Idx, y);
toc
exp(hyp2.cov)
Function evaluation 0; Value 1.151583e+03
Function evaluation 12; Value 1.124633e+03
Function evaluation 14; Value 1.116309e+03
Function evaluation 17; Value 1.115676e+03
Function evaluation 20; Value 1.115373e+03
Function evaluation 23; Value 1.115308e+03
Elapsed time is 41.778548 seconds.
ans =
0.2346 0.5824 0.1100
Model comparison
The models learned slightly different hyperparameters. Are
the likelihoods different?
fprintf('Log-likelihood learned w/out Kronecker: %.04f\n', gp(hyp1, @infLaplace, @meanConst, cv, lik, X, y));
fprintf('Log-likelihood learned with Kronecker: %.04f\n', gp(hyp2, @infLaplace, @meanConst, cv, lik, X, y));
fprintf('Log-likelihood with Kronecker + Fiedler bound: %.04f\n', gp(hyp2, @infGrid_Laplace, @meanConst, cvGrid, lik, Idx, y));
[Ef1 Varf1 fmu1 fs1 ll1] = gp(hyp1, @infGrid_Laplace, @meanConst, cvGrid, lik, Idx, y, Idx, y);
[Ef2 Varf2 fmu2 fs2 ll2] = gp(hyp2, @infGrid_Laplace, @meanConst, cvGrid, lik, Idx, y, Idx, y);
figure(1); imagesc([0 1], [0 1], (reshape(Ef1,32,32)')); set(gca,'YDir','normal'); colorbar;
figure(2); imagesc([0 1], [0 1], (reshape(Ef2,32,32)')); set(gca,'YDir','normal'); colorbar;
Log-likelihood learned w/out Kronecker: 1096.4188
Log-likelihood learned with Kronecker: 1100.3819
Log-likelihood with Kronecker + Fiedler bound: 1115.3084
Kronecker + grid enables a finer resolution
now let's exploit the grid structure to get a finer
resolution map we'll also switch covariances, just for fun.
hyp.cov = log([.3 1 .2 1]);
lik = {@likPoisson, 'exp'}; hyp.lik = [];
mean = {@meanConst}; hyp.mean = .5;
n1 = 64; n2 = 64;
x1 = linspace(0,1.0001,n1+1); x2 = linspace(-0.0001,1,n2+1);
xg = {(x1(1:n1)+x1(2:n1+1))'/2, (x2(1:n2)+x2(2:n2+1))'/2};
i = ceil((xy(:,1)-x1(1))/(x1(2)-x1(1)));
j = ceil((xy(:,2)-x2(1))/(x2(2)-x2(1)));
counts = full(sparse(i,j,1));
y = counts(:);
xi = (1:numel(y))';
cov = {@covGrid, { {@covMaterniso,5}, {@covMaterniso,5}}, xg};
opt.cg_maxit = 400; opt.cg_tol = 1e-2;
inf_method = @(varargin) infGrid_Laplace(varargin{:},opt);
tic
hyp2 = minimize(hyp, @gp, -100, inf_method, mean, cov, lik, xi, y(:));
toc
exp(hyp2.cov)
[post nlZ dnlZ] = infGrid_Laplace(hyp2, mean, cov, lik, xi, y(:));
post.L = @(x) 0*x;
ymu = gp(hyp2, @infGrid_Laplace, mean, cov, lik, xi, post, xi);
figure(3), imagesc([0 1], [0 1], reshape(ymu,size(counts))'),
hold on;
set(gca,'YDir','normal');
plot(xy(:,1),xy(:,2),'w.');
colorbar
hold off
Function evaluation 0; Value 1.954834e+03
Function evaluation 6; Value 1.946841e+03
Function evaluation 8; Value 1.940877e+03
Function evaluation 11; Value 1.940012e+03
Function evaluation 14; Value 1.938218e+03
Function evaluation 17; Value 1.936884e+03
Function evaluation 20; Value 1.934900e+03
Function evaluation 22; Value 1.933195e+03
Elapsed time is 187.843026 seconds.
ans =
0.3051 0.7512 0.1636 0.7512
Published
with MATLAB® R2013a
Kernel
Interpolation for Scalable Structured Gaussian Processes
(KISS-GP)
Structure exploiting (such a Kronecker and Toeplitz methods)
approaches are extremely scalable and
exact, but require
grid structure. In previous sections we have extended
these approaches to account for partial grids, such as images
with missing regions, or spatiotemporal statistics problems with
some non-grid data due to lakes, political boundaries,
etc. However, data are often arbitrarily indexed, without
any type of grid structure, and thus these structure exploiting
methods are not generally applicable.
Conversely,
inducing
point methods, are popular for their general purpose "out
of the box" applicability, without requiring any special
structure in the data. However, these methods are limited by
requiring a small (
m << n) number of
inducing inputs, which can cause a deterioration in predictive
performance, and the
inability
to perform expressive kernel learning.
With KISS-GP, we combine the scalability and accuracy of
structure exploiting methods with the general applicability of
inducing point approaches. These approaches are O(n + g(m)) for
inference and learning, and O(1) per test prediction, where m is
the number of inducing points, and g(m) ~ m. The ability
to use a large number of inducing points
m with little
computational expense, enables near-exact GP inference, kernel
learning, and predictions on large (up to billions) of
datapoints.
The basic idea behind KISS-GP is to create a multidimensional
grid of latent inducing points, and exploit the resulting
Kronecker, Toeplitz, and Circulant structure in the resulting
covariance matrix evaluated over all pairs of these inducing
points. We then perform local kernel interpolation to go
back into the observed input space, which may have arbitrary
structure. We can view this approach as creating an
approximate kernel given user specified kernel which admits fast
numerical linear algebra operations, for inference and learning
with Gaussian processes. The quality of this approximation
is often close to exact within numerical precision, given the
ability of KISS-GP to use many inducing points.
In short, given a user-specified kernel
k(x,z)
we introduce an approximate kernel
where K_{U,U} is evaluated on a set of
m virtual
inducing points U, located to create structure which will allow
K_{U,U} to decompose as a Kronecker product of Toeplitz
matrices.
w_x and
w_z are sparse
interpolation vectors. This approximate kernel admits fast
numerical linear algebra operations for inference and learning
with Gaussian processes.
The full details are described in these
Main references:
Kernel interpolation for scalable structured Gaussian
processes (KISS-GP)
Andrew Gordon Wilson and Hannes Nickisch
International Conference on Machine Learning (ICML),
2015
Oral Presentation
[
PDF,
Supplement,
arXiv,
BibTeX,
Theme Song,
Video
Lecture]
Thoughts on Massively Scalable Gaussian Processes
Andrew Gordon Wilson, Christoph Dann, and Hannes Nickisch
arXiv pre-print, 2015
(See
KISS-GP
and
Deep Kernel
Learning for more empirical demonstrations).
[
arXiv,
PDF,
BibTeX,
Music]
Video Lecture:
KISS-GP
Introduction at ICML 2015
Tutorial:
Note: KISS-GP has been added to the new
GPML v4.0! To see a new demo of KISS-GP in the
new GPML, demonstrating much of the functionality, download
GPML
v4.0 and follow demoGrid1d.m and demogrid2d.m.
For more experience and understanding of KISS-GP, you can
follow the tutorial below, which relies on an older version
of GPML that uses a different interface. Make sure
that the correct files are in your Matlab path for each
demo.
Step 1: Download
kissgpdemo.zip
Step 2: Add all folders to your path.
The key files are
covGrid.
m and
infGrid.m
Read the documentation in these files closely: it will tell you
how to
switch on/off Toeplitz structure, Circulant structure, BTTB
structure, Kronecker structure, etc.,
and how to switch interpolation strategies.
Step 3: Run all the files in /test
Step 4: Run the the files in /examples. In
tea_hn_agw.m, vary:
n: the number of training points
ng:
the number of grid (inducing) points
ns: the number of test points
In constructing the grid, pay close attention to whether single
braces { or double braces {{ are used.
Double braces enables the BTTB functionality.
Step 5: Run the supervised projection example in
experiment_
2d.m
Pay close attention to the helper function
covGrid('create',...)
in both
covGrid.m, but also in how it is called in
proj_gauss.m.
This helper function is extremely valuable for creating grids of
inducing points based on standard input data
x in
n
x
D format, where
n is the number of input
points, and
D is the number of input dimensions.
This helper function can specify whether we want a uniform grid,
how many grid points we want, the boundaries of the grid, and
whether we use, for instance,
k-means, to find locations
for the grid points (in this case we could just exploit
Kronecker structure but not Toeplitz structure).
For any problem with greater than 1D inputs, I would highly
recommend using covGrid('create'), and getting practice with its
use.
One of the defining features of KISS-GP is performing local
kernel interpolation. You can see the routines for local
kernel interpolation, including multidimensional interpolation,
in
covGrid.
m. However, to get an intuition
of how we use local kernel interpolation to create a fast
approximate kernel, given a user specified kernel, take a look
at the stripped down 1D demo
fitc_grid_poly.m,
with associated interpolation script
intermat.m.
For higher dimensional inputs, and KISS-GP with kernel learning,
see
Deep Kernel Learning. This is the code I
would recommend using for good performance on multidimensional
input problems, with a reasonable number
(e.g. n > 1000)
datapoints.
KISS-GP allows us to naturally create and
exploit algebraic structure for highly scalable
inference and learning, without requiring input data
on a grid. However, to exploit Kronecker
structure, the inducing points in KISS-GP must live
in a low dimensional space (e.g. D<5). The
paper "Thoughts on massively scalable Gaussian
processes" discusses how to learn supervised
projections of the data inputs into this low
dimensional, focusing on linear projections.
We can also use very flexible non-linear projections
of the kernel inputs, such as deep feedforward or
convolutional networks. Note that in
regression, for example, deep feed-forward neural
networks will nearly always project into a one
dimensional output space.
By learning a deep transformation of the kernel
inputs, we are both learning a very flexible (and
also interpretable) deep kernel, with all
the structural properties and inductive biases of
deep learning models, while retaining the
non-parametric flexibility and probabilistic
representation of a Gaussian process.
Moreover, these supervised deep projections bring us
into a low dimensional space where we can apply the
scalability ideas of KISS-GP, for O(n) inference and
O(1) predictions.
We test deep kernel learning on a wide range of
regression problems (the entire UCI repository),
with a variety of different input dimensionalities,
data sizes, and properties, showing the general
applicability of this approach. We also
combine deep kernel learning with spectral mixture kernels.
The basic idea is to start with a base kernel, with
parameters \theta, using a non-linear function g
with its own parameters w.
We take g to be given by a deep
architecture.
We then replace the exact deep kernel with the
approximate KISS-GP
approximation for this kernel. We then learn
both the parameters theta and w using
the marginal likelihood of the Gaussian process.
We often find using spectral mixture base kernels
provides a good boost in performance.
Main references:
Deep Kernel Learning
Andrew Gordon Wilson*, Zhiting Hu*, Ruslan
Salakhutdinov, and Eric P. Xing
Artificial Intelligence and Statistics
(AISTATS), 2016
[PDF,
arXiv,
BibTeX]
Thoughts on Massively Scalable Gaussian Processes
Andrew Gordon Wilson, Christoph Dann, and Hannes
Nickisch
arXiv pre-print, 2015
(See KISS-GP
and Deep
Kernel Learning for more empirical
demonstrations).
[arXiv,
PDF,
BibTeX,
Music]
Here we include the code to implement Deep Kernel
Learning, and to reproduce many of the results
in
the paper. It requires GPML 3.6 and
Caffe as dependencies. The GPML version used
is included in the zip. Note that this code
is based on an older version of
covGrid and infGrid for KISS-GP, which includes
flags for circulant embeddings.
You
will need to use these versions in your
path instead of the newer versions for
the deep kernel learning code to
run. Please also pay close
attention to how the spectral mixture
base kernels are initialized:
initialization is fundamental to good
performance for these base kernels.
We
provide 3 (examples/
) :
- UCI Kin40K
regression (section 5.1 of the
ppaer)
- MNIST
digit magnitude extraction (section
5.3)
- Olivetti
face orientation extraction (section
5.2)
The
modified caffe library is under caffe/
. Please refer
to the installation instructions for
prerequisites and configurations. Run
the following commands for
installation:
We
take the olivetti face task for
example.
cd caffe
unzip examples/face/data/face_images.zip
sh examples/face/create_face.sh # create databases for caffe CNN training and DKL training
Databases
will be created under caffe/examples/face/data
Under caffe/
:
sh examples/face/train_face.sh
Please
refer to the tutorial for
CNN configurations, etc.
After
training you will see the trained CNN
model caffe/examples/face/face_iter_6000.caffemodel
.
We have also provided a pre-trained
model caffe/examples/face/face_iter_6000.caffemodel.example
,
so that you can skip this step and go
straight to DKL model training.
The
DKL codes are under examples/face
, where dkl_rbf_face.m
uses
the RBF kernel as the base kernel, and dkl_sm_face.m
uses
the spectral mixture kernel as the
base kernel. Training parameters are
configured at the beginning of the
respective files, including the
pre-trained CNN model for
initialization, number of training
iterations, learning rate, and grid
size, etc.
Under examples/face
, run
the models using matlab
run ../../gpml/startup.m % setup gpml
run dkl_rbf_face.m % or dkl_sm_face.m
- For
preparing the MNIST dataset, run
cd caffe
sh examples/mnist/data/get_mnist.sh
sh examples/mnist/create_mnist.sh
- For
preparing the Kin40k dataset, simply
run
sh
examples/kin40k/create_kin40k.sh
under caffe/
- Use
GPU for the DNN part to speedup
training
Stochastic Variational
Deep Kernel Learning
We may want to use Gaussian processes and KISS-GP
with stochastic gradient descent and mini-batches,
for further scalability. However, the Gaussian
process marginal likelihood does not
factorize. Here we introduce stochastic
variational deep kernel learning,
which allows for (1) stochastic gradient mini-batch
training in combination with KISS-GP scalability,
and (2) deep kernel learning on non-Gaussian
likelihood problems and classification. We
apply this approach to a variety of benchmark
classification problems, including a large airline
passenger dataset, CIFAR and ImageNet.
Main reference:
Stochastic Variational Deep Kernel Learning
Andrew Gordon Wilson*, Zhiting Hu*, Ruslan
Salakhutdinov, Eric P. Xing
To appear in Neural Information Processing
Systems (NIPS), 2016
[arXiv,
PDF]
Here we provide
code to reproduce results in the paper.
We focus on classification problems.
The
GP-LSTM
Many applications in speech, robotics, finance, and biology
deal with sequential data, where ordering matters and
recurrent structures are common. However, this structure
cannot be easily captured by standard kernel functions. To
model such structure, we propose expressive closed-form
kernel functions for Gaussian processes. The resulting
model, GP-LSTM, fully encapsulates the inductive biases of
long short-term memory (LSTM) recurrent networks, while
retaining the non-parametric probabilistic advantages of
Gaussian processes. We learn the properties of the proposed
kernels by optimizing the Gaussian process marginal
likelihood using a new provably convergent semi-stochastic
procedure and exploit the structure of these kernels for
fast and scalable training and prediction. We demonstrate
state-of-the-art performance on several benchmarks, and
investigate a consequential autonomous driving application,
where the predictive uncertainties provided by GP-LSTM are
uniquely valuable.
For scalability, we introduce a new provably convergent
semi-stochastic training procedure for kernel learning via
GP marginal likelihoods, and combine this procedure with
KISS-GP.
Learning Scalable Deep Kernels with Recurrent
Structure
Maruan Al-Shedivat, Andrew Gordon Wilson,
Yunus Saatchi, Zhiting Hu, Eric P. Xing
arXiv
pre-print
The codebase
contains tutorial examples. It also provides
an alternative implementation of Deep
Kernel Learning using Keras with Theano or
Tensorflow, and shows how to call KISS-GP
functionality from Python. However, it is
still in alpha testing, and may not be as stable as
the Caffe based Deep Kernel Learning code.
In several papers, we have compared against all UCI regression
datasets:
À la carte - learning fast kernels
Zichao Yang, Alexander J. Smola, Le Song, and
Andrew Gordon Wilson
Artificial Intelligence and Statistics (AISTATS),
2015
Oral Presentation
[PDF, BibTeX]
Deep
Kernel Learning
Andrew Gordon Wilson*, Zhiting Hu*,
Ruslan Salakhutdinov, and Eric P. Xing
Artificial Intelligence and
Statistics (AISTATS), 2016
[PDF,
arXiv,
BibTeX]
We
provide these data here (kindly compiled by
Zichao Yang) with the normalization and
pre-processing that we used, so that you have a way
to test whether this code is working for you: can
you reproduce the results in Table 1 of Deep Kernel
Learning?
Be sure to run the included pre-processing
scripts on the raw data and verify some of the
results!
These data also provide an easy way to benchmark
your own methods on all of these problems. If
you use the data for this purpose, please provide a
reference to this webpage. Since the code for
many of the methods we used are available (e.g.
DKL), please verify some of these results using the
code here before taking the results in the table for
granted in your own comparison. You may, for
instance, have forgotten to run the pre-processing
or normalization routines, or the pre-processing may
have changed between posting these data and the
writing of those papers. You may also not be
applying the code on this page correctly on other
problems, so testing on these data provides a sanity
check.
Many thanks to my wonderful collaborators. Particular
thanks to Hannes Nickisch, Zhiting Hu, and Maruan Al-Shedivat,
for their contributions to the implementations of KISS-GP, Deep
Kernel Learning, and the GP-LSTM, respectively.