Bayesian Machine Learning
Andrew Gordon Wilson
The project is an opportunity to become involved in machine
learning research. It is an opportunity to be
creative about solving the problems that you find most
interesting. Ideally the process will involve a lot of
fun, hard work, independence, frustration, self-motivation, and
ultimately, a great sense of achievement. The project is
far and above the largest component of the course
evaluation. Therefore it is in your interests to do it really
You can work in groups of three. If you are
looking for a partner, please post on our Piazza forum, or
e-mail me. Smaller groups are acceptable, but I encourage
you to have two teammates. Part of the point is to work
together in a team that is greater than the sum of its
parts. For that reason, I recommend you find partners who
complement your abilities. If you have a team of
three similar people, you may get along very well, but will also
be weak in the same respects, and it could be boring.
At the end of this page, I have listed some ideas that could
turn into great projects. You are also welcome to propose
Discuss the project with the TAs or me. It's a good idea
to get some feedback on the proposal before spending a month
working on it.
The final result report should involve some novel methodological
ideas and/or a strong application with an interesting
model. At the least there will be need to be a clearly
defined idea, and some novel meaningful exploration of that
idea, either theoretical or experimental.
The project has five parts:
Proposal : November 5
Midway Report: November 27
Presentations: December 2-4
Reviews Due: December 7
Final Report : December 17
Reviews Due : December 20
Reviews: You will each review 2 midway reports and 2 final reports, and the presentations. Details about reviews will be sent in a separate post.
The project is worth 55% of your course grade. The breakdown of the project grade is as follows:
Midterm Report: 20/100
Peer Reviews: 20/100
Final Report: 35/100
Each part is described in more detail as follows:
The proposal should be one to two pages in ICML
format. It should contain the title, group members
names, email addresses, and net ids. You must clearly
describe your idea and an organized outline of all that is to be
completed with a rough timeline for completion. It is fine
to build on prior work, but you must clearly state what part of
your project is novel. Novelty can take many forms:
methodological innovation, application, new theoretical
insights, etc. Describe what you would ideally like to
complete for a perfect project, and what at minimum you could
complete and still consider your project a success.
Include at least 3 relevant papers each team members has read at
the time of proposal submission, and a dataset you will use in
your project. Also include a midterm milestone: describe
what you will complete by the midterm report. Describe
what each teammate will be doing in the project.
- What are you trying to do? Articulate your objectives
using absolutely no jargon. What is the problem? Why
is it hard?
- How is it done today, and what are the limits of current
- What's new in your approach and why do you think it will be
- Who cares?
- If you're successful, what difference will it make? What
impact will success have? How will it be measured?
- What are the risks and the payoffs?
- How much will it cost?
- How long will it take?
- What are the midterm and final criteria for success? How
will progress be measured?
These criteria are amazingly useful in making sure that you
won't try the impossible, or solve a really boring problem.
This should be a 4-5 page report, which serves as a
check-point. It should have roughly the same sections as
your final report (introduction, related work, methodology,
experiments, discussion), where some sections or parts of
sections may be in progress. The introduction and related
work sections should be well developed.
The grading scheme for the midterm report is: 50% for the
method, 40% for the design of experiments and current progress,
10% for the planned activities.
All project teams are to present their work at the
end of the semester. Live demonstrations of your
software are highly enocuraged. These will be like
machine learning conference presentations. To get an
idea of the style, see the ICML
videos. However, there should be
more introductory content, clearly defining the context of the
problem and the scope of the project. Each presentation
should be 18 minutes in length + 2 min Q&A, equally divided
across group members.
- All students should attend all presentations.
- Each member in each team will be evaluated separately.
(Members in the same team can get different scores for the final
- More details to follow
Your final report is expected to be 8
pages, with optional extra space in appendices for
proofs, supplemental results, etc. You should
submit both an electronic and hardcopy version fo
the report, with the sections (introduction,
related work, methodology, experiments, discussion)
defined in your report.
Project ideas (please see the updated list on Piazza):
Here are some high level
ideas that can get you started. If you
would like to follow up on these ideas, please let
- The spectral mixture kernel forms an expressive
basis for all stationary kernels. But it is
fundamentally a parametric object, and it may be
hard to control its inductive biases. Instead
imagine using a (transformed) Gaussian process with
a standard kernel (e.g. RBF) to model the spectral
density. The resulting model induces a
distribution over all stationary kernels and can
have a mean kernel centred at some reasonable
choice, such as the RBF kernel. In effect, we
turn interpolation in the frequency domain into
extrapolation in the original input space.
- Exploiting algebraic structure promises to enhance
scalability over a wide range of modelling
paradigms: kernel methods, determinantal processes,
deep learning... For example, the kernel
matrix used with Gaussian processes often has strong
structure imposed by the choice of kernel. We
can exploit that structure for scalable and exact
- Quite often in machine learning we can do fast
matrix vector multiplications (MVMs), but for
inference or learning we need to compute the log
determinant of a matrix. Can we efficiently compute
the log determinant of matrix and its derivatives
relying only on fast MVMs?
- Knowledge distillation has been used to compress
large neural networks into smaller networks for the
purpose of efficiency. But the frameworks
designed for network compression can in principle be
used to transfer knowledge between entirely
different models. For example, we can use this
paradigm to encode logic rules into the weights of a
neural network. This project is about
exploring the transferring and combining structure
from one model into another, both for better
performance, and for enhanced
interpretability. A successful project may
wish to eventually go in the reverse direction:
e.g., can we derive new logic rules from the trained
parameters of a neural network?
- Dropout has been proposed as an effective way of
regularizing neural networks. However, looking
at the plots in the JMLR
dropout paper we see that dropout performs
worse than no dropout for the smallest datasets,
which is not the behaviour that we expect from a
regularizer. Instead, dropout may be helping
to navigate a highly multimodal likelihood surface
by injecting noise into the hidden layers. In
the limit of the learning procedure, we would want
to remove this noise. This project is about
exploring principled dropout schedules for learning
the solutions to complex multimodal objective
functions -- providing both a more effective way to
train neural networks, and the beginnings of a
procedure that may be powerful in general for
- There are many astronomy datasets, such as from
the Kepler and K2 missions, which have fascinating
structure, but are difficult to analyze due to a
severe degree of systematic noise. This
project is about developing kernel learning methods
which can uncover this structure to make new
- Can we derive new kernels corresponding to
infinite basis expansions which have interesting
properties? What would happen if we replaced
the Gaussian basis functions in the derivation with
some other functions? What would happen if we
had many layers in our neural network in the proof
for the neural network kernel?
- Many machine learning algorithms, such as
Hamiltonian Monte Carlo, have undesirable tuning
parameters. Can we use approaches such as
Bayesian optimisation to automate the learning of
- Alternatively, can we generalize fundamental
classical statistical models to probabilistic
models, such as PCA -> Probabilistic PCA?
Can we then place distributions over the parameters
of these models, and (approximately) integrate away
these parameters, to develop objective functions for
learning tuning parameters (such as the
dimensionality of PCA), in
the vein of Minka, which are classically
learned through heuristic approaches such as
- Variational Bayes minimises KL(Q||P) and
Expectation Propagation minimises KL(P||Q).
But there might be a better metric for creating
determinstic approximations? Both KL
divergences below to the more general family of
alpha divergences. Perhaps we could use this
more general metric, and the learn the parameters of
this metric from the data, to find the best possible
(or, a much better!) approximation.
Can we engineer Deep
Kernel Learning to work really
fast on GPUs by exploiting all of the
fast matrix vector multiplications
(MVMs) and matrix matrix
- Can we define a kernel over neural network
architectures? This problem involves many
fundamental challenges: how do we represent the
neural network as a data structure? How do we define
similarity somewhat independently of a particular
dataset? An idea to get you started: we could
consider the information capacity of neural
network (or generally model) architectures, using
criteria such as mutual information. Answering
this question would be a foundational step towards
automating neural network (deep learning)
architecture design and understanding deep learning.
- Why has deep learning worked so well? Can we
philosophically understand the success of deep
learning using physics, probability, and information
theory, such as in https://arxiv.org/abs/1608.08225v2
Some resources from my colleague Alex Smola:
Here's a very incomplete and
short list of datasets. This is really just to get you started
and I encourage you to think beyond the scope of pre-made
Design a streaming algorithm to
find frequent items. Note that the distribution might
change over time. A possible strategy is to modify the
Use secondary information to
improve collaborative filtering, e.g. for the Netflix
problem you could incorporate IMDB and Wikipedia.
Financial forecasting as a
high-dimensional multivariate regression problem. E.g. you
could try predicting the price of a very large of
securities at the same time. Possibly using news, tweets,
and financial data releases to improve the estimates
beyond a simple technical analysis.
Detect trends e.g. in the Tweet
stream. Forecast tomorrow's keywords today. How quickly
can you detect new events (earthquakes, assassinations,
Nonlinear function classes. Can
you find efficient sets of basis functions that are both
fast to compute and sufficiently nonlinear to address a
large set of estimation problems.
Parallel decision trees. Can you
design a data parallel decision tree / boosted decision
tree algorithm? The published results are essentially
sequential in the construction of the trees. One
suggestion would be to take the Random Forests algorithm,
re-interpret it as a Pitman estimator sampling from the
version space of consistent trees, and then extend it to