Deep Learning
Yann LeCun (CIMS)

A long-term goal of Machine Learning research (ML) is to help solve highy complex "intelligent" tasks, such as visual perception auditory perception, and language understanding. To reach that goal, the ML community must solve two problems: the Deep Learning Problem, and the Normalization Problem.

The Deep Learning Problem is related to the difficulty of training "deep architectures" composed of many non-linear layers of trainable modules. There is considerable theoretical and practical evidence that complex tasks, such as invariant object recognition in vision, require "deep" architectures, composed of multiple non-linear layers of trainable functions. This is in contrasts with much of ML research of the last 10 years, which has primarily focused on "shallow" models that are essentially linear functions of the parameters to be learned.

Deep learning amounts to optimizing a very highly non-convex function in a very high dimensional space. Several methods have recently been proposed to train (or pre-train) such deep architectures in an unsupervised fashion. Each layer of the deep architecture is composed of an encoder which computes a feature vector from the input, and a decoder which reconstructs the input from the features.  A large number of such layers can be stacked and trained sequentially, thereby learning a deep hierarchy of features (or representations).  Each layer is trained in an unsupervised fashion to minimize the reconstruction error under certain constraints on the features, such as sparsity.  This class of learning methods is called "energy-based", because it amounts to shaping a high-dimensional energy landscape (a.k.a. an un-normalized log-likelihood function).

A particular class of methods for deep energy-based unsupervised learning will be described that can learn sparse and overcomplete representations of data. When applied to natural image patches, the method produces filters similar to those found in the mammalian primary visual cortex. A hierarchical vision system that extracts high-level features suitable for computer vision applications can be produced by stacking multiple layers of this simple module. Applications to invariant object recognition in images, and visual navigation for mobile robots will be shown.