Bayesian Machine Learning

Fall 2020


Instructor
Andrew Gordon Wilson



 


Project

The project is an opportunity to become involved in machine learning research.  It is an opportunity to be creative about solving the problems that you find most interesting.  Ideally the process will involve a lot of fun, hard work, independence, frustration, self-motivation, and ultimately, a great sense of achievement.  The project is far and above the largest component of the course evaluation.  Therefore it is in your interests to do it really well.

You can work in groups of three.  If you are looking for a partner, please post on our Piazza forum, or e-mail me.  Smaller groups are acceptable, but I encourage you to have two teammates.  Part of the point is to work together in a team that is greater than the sum of its parts.  For that reason, I recommend you find partners who complement your abilities.  If you have a team of three similar people, you may get along very well, but will also be weak in the same respects, and it could be boring.

At the end of this page, I have listed some ideas that could turn into great projects.  You are also welcome to propose an idea.

Discuss the project with the TAs or me.  It's a good idea to get some feedback on the proposal before spending a month working on it.

The final result report should involve some novel methodological ideas and/or a strong application with an interesting model.  At the least there will be need to be a clearly defined idea, and some novel meaningful exploration of that idea, either theoretical or experimental. 

The project has five parts:
Proposal : November 5
Midway Report: November 27
Presentations: December 2-4
Reviews Due: December 7
Final Report : December 17
Reviews Due : December 20
Reviews: You will each review 2 midway reports and 2 final reports, and the presentations. Details about reviews will be sent in a separate post.

The project is worth 55% of your course grade. The breakdown of the project grade is as follows:
Proposal: 10/100
Midterm Report: 20/100
Presentation: 15/100
Peer Reviews: 20/100
Final Report: 35/100

Each part is described in more detail as follows:

Proposal


The proposal should be one to two pages in ICML format.  It should contain the title, group members names, email addresses, and net ids.  You must clearly describe your idea and an organized outline of all that is to be completed with a rough timeline for completion.  It is fine to build on prior work, but you must clearly state what part of your project is novel.  Novelty can take many forms: methodological innovation, application, new theoretical insights, etc.  Describe what you would ideally like to complete for a perfect project, and what at minimum you could complete and still consider your project a success.  Include at least 3 relevant papers each team members has read at the time of proposal submission, and a dataset you will use in your project.  Also include a midterm milestone: describe what you will complete by the midterm report.  Describe what each teammate will be doing in the project.

Heilmeier's Criteria:
- What are you trying to do?  Articulate your objectives using absolutely no jargon.  What is the problem?  Why is it hard?
- How is it done today, and what are the limits of current practice?
- What's new in your approach and why do you think it will be successful?
- Who cares?
- If you're successful, what difference will it make?  What impact will success have?  How will it be measured?
- What are the risks and the payoffs?
- How much will it cost?
- How long will it take?
- What are the midterm and final criteria for success?  How will progress be measured?

These criteria are amazingly useful in making sure that you won't try the impossible, or solve a really boring problem.



Midterm Report

This should be a 4-5 page report, which serves as a check-point.  It should have roughly the same sections as your final report (introduction, related work, methodology, experiments, discussion), where some sections or parts of sections may be in progress.  The introduction and related work sections should be well developed.

The grading scheme for the midterm report is: 50% for the method, 40% for the design of experiments and current progress, 10% for the planned activities.


Presentation

All project teams are to present their work at the end of the semester.  Live demonstrations of your software are highly enocuraged.  These will be like machine learning conference presentations.  To get an idea of the style, see the ICML videos.  However, there should be more introductory content, clearly defining the context of the problem and the scope of the project.  Each presentation should be 18 minutes in length + 2 min Q&A, equally divided across group members. 

- All
students should attend all presentations.
- Each member in each team will be evaluated separately. (Members in the same team can get different scores for the final presentation).

- More details to follow

Final Report

Your final report is expected to be 8 pages, with optional extra space in appendices for proofs, supplemental results, etc.  You should submit both an electronic and hardcopy version fo the report, with the sections
(introduction, related work, methodology, experiments, discussion) defined in your report.



Project ideas (please see the updated list on Piazza):

Here are some high level ideas that can get you started.   If you would like to follow up on these ideas, please let me know.

- The spectral mixture kernel forms an expressive basis for all stationary kernels.  But it is fundamentally a parametric object, and it may be hard to control its inductive biases.  Instead imagine using a (transformed) Gaussian process with a standard kernel (e.g. RBF) to model the spectral density.  The resulting model induces a distribution over all stationary kernels and can have a mean kernel centred at some reasonable choice, such as the RBF kernel.  In effect, we turn interpolation in the frequency domain into extrapolation in the original input space.

- Exploiting algebraic structure promises to enhance scalability over a wide range of modelling paradigms: kernel methods, determinantal processes, deep learning...  For example, the kernel matrix used with Gaussian processes often has strong structure imposed by the choice of kernel.  We can exploit that structure for scalable and exact inference.

- Quite often in machine learning we can do fast matrix vector multiplications (MVMs), but for inference or learning we need to compute the log determinant of a matrix. Can we efficiently compute the log determinant of matrix and its derivatives relying only on fast MVMs?

- Knowledge distillation has been used to compress large neural networks into smaller networks for the purpose of efficiency.  But the frameworks designed for network compression can in principle be used to transfer knowledge between entirely different models.  For example, we can use this paradigm to encode logic rules into the weights of a neural network.  This project is about exploring the transferring and combining structure from one model into another, both for better performance, and for enhanced interpretability.  A successful project may wish to eventually go in the reverse direction: e.g., can we derive new logic rules from the trained parameters of a neural network?

- Dropout has been proposed as an effective way of regularizing neural networks.  However, looking at the plots in the JMLR dropout paper we see that dropout performs worse than no dropout for the smallest datasets, which is not the behaviour that we expect from a regularizer.  Instead, dropout may be helping to navigate a highly multimodal likelihood surface by injecting noise into the hidden layers.  In the limit of the learning procedure, we would want to remove this noise.  This project is about exploring principled dropout schedules for learning the solutions to complex multimodal objective functions -- providing both a more effective way to train neural networks, and the beginnings of a procedure that may be powerful in general for non-convex optimization.

- There are many astronomy datasets, such as from the Kepler and K2 missions, which have fascinating structure, but are difficult to analyze due to a severe degree of systematic noise.  This project is about developing kernel learning methods which can uncover this structure to make new scientific discoveries. 

- Can we derive new kernels corresponding to infinite basis expansions which have interesting properties?  What would happen if we replaced the Gaussian basis functions in the derivation with some other functions?  What would happen if we had many layers in our neural network in the proof for the neural network kernel?

- Many machine learning algorithms, such as Hamiltonian Monte Carlo, have undesirable tuning parameters.  Can we use approaches such as Bayesian optimisation to automate the learning of these parameters?

- Alternatively, can we generalize fundamental classical statistical models to probabilistic models, such as PCA -> Probabilistic PCA?  Can we then place distributions over the parameters of these models, and (approximately) integrate away these parameters, to develop objective functions for learning tuning parameters (such as the dimensionality of PCA), in the vein of Minka, which are classically learned through heuristic approaches such as cross-validation?

- Variational Bayes minimises KL(Q||P) and Expectation Propagation minimises KL(P||Q).  But there might be a better metric for creating determinstic approximations?  Both KL divergences below to the more general family of alpha divergences.  Perhaps we could use this more general metric, and the learn the parameters of this metric from the data, to find the best possible (or, a much better!) approximation. 

- Can we engineer Deep Kernel Learning to work really fast on GPUs by exploiting all of the fast matrix vector multiplications (MVMs) and matrix matrix multiplications?

- Can we define a kernel over neural network architectures?  This problem involves many fundamental challenges: how do we represent the neural network as a data structure? How do we define similarity somewhat independently of a particular dataset?  An idea to get you started: we could consider the information capacity of neural network (or generally model) architectures, using criteria such as mutual information.  Answering this question would be a foundational step towards automating neural network (deep learning) architecture design and understanding deep learning.

- Why has deep learning worked so well?  Can we philosophically understand the success of deep learning using physics, probability, and information theory, such as in https://arxiv.org/abs/1608.08225v2


Some resources from my colleague Alex Smola:

Datasets

Here's a very incomplete and short list of datasets. This is really just to get you started and I encourage you to think beyond the scope of pre-made datasets.

  • Yahoo webscope datasets. There are plenty of them free for download. However, you need to sign up individually since the datasets typically come with noncommercial restrictions.

  • Netflix challenge data is not officially available any more. However, a quick web search will help you locate it.

  • IMDB data

  • Twitter gardenhose

  • AOL query log

  • GigaDB bioinformatics database. Try e.g. searching for homo sapiens.

  • TREC datasets (text retrieval).

  • Linguistic Data Consortium homepage

  • Stanford Social Networks datasets

  • Frequent itemset mining data

  • Wikipedia

Some problems

  • Design a streaming algorithm to find frequent items. Note that the distribution might change over time. A possible strategy is to modify the a-priori algorithm.

  • Use secondary information to improve collaborative filtering, e.g. for the Netflix problem you could incorporate IMDB and Wikipedia.

  • Financial forecasting as a high-dimensional multivariate regression problem. E.g. you could try predicting the price of a very large of securities at the same time. Possibly using news, tweets, and financial data releases to improve the estimates beyond a simple technical analysis.

  • Detect trends e.g. in the Tweet stream. Forecast tomorrow's keywords today. How quickly can you detect new events (earthquakes, assassinations, elections)?

  • Nonlinear function classes. Can you find efficient sets of basis functions that are both fast to compute and sufficiently nonlinear to address a large set of estimation problems.

  • Parallel decision trees. Can you design a data parallel decision tree / boosted decision tree algorithm? The published results are essentially sequential in the construction of the trees. One suggestion would be to take the Random Forests algorithm, re-interpret it as a Pitman estimator sampling from the version space of consistent trees, and then extend it to other objectives