Deep Learning
A long-term goal of Machine Learning research (ML) is to help solve
highy complex "intelligent" tasks, such as visual perception auditory
perception, and language understanding. To reach that goal, the ML
community must solve two problems: the Deep Learning Problem, and the
Normalization Problem.
The Deep Learning Problem is related to the difficulty of training
"deep architectures" composed of many non-linear layers of trainable
modules. There is considerable theoretical and practical evidence that
complex tasks, such as invariant object recognition in vision, require
"deep" architectures, composed of multiple non-linear layers of
trainable functions. This is in contrasts with much of ML research of
the last 10 years, which has primarily focused on "shallow" models that
are essentially linear functions of the parameters to be learned.
Deep learning amounts to optimizing a very highly non-convex function
in a very high dimensional space. Several methods have recently been
proposed to train (or pre-train) such deep architectures in an
unsupervised fashion. Each layer of the deep architecture is composed
of an encoder which computes a feature vector from the input, and a
decoder which reconstructs the input from the features. A large
number of such layers can be stacked and trained sequentially, thereby
learning a deep hierarchy of features (or representations). Each
layer is trained in an unsupervised fashion to minimize the
reconstruction error under certain constraints on the features, such as
sparsity. This class of learning methods is called
"energy-based", because it amounts to shaping a high-dimensional energy
landscape (a.k.a. an un-normalized log-likelihood function).
A particular class of methods for deep energy-based unsupervised
learning will be described that can learn sparse and overcomplete
representations of data. When applied to natural image patches, the
method produces filters similar to those found in the mammalian primary
visual cortex. A hierarchical vision system that extracts high-level
features suitable for computer vision applications can be produced by
stacking multiple layers of this simple module. Applications to
invariant object recognition in images, and visual navigation for
mobile robots will be shown.