Learning Representations of Images
and High-Dimensional Data
Yann LeCun, Courant Institute of
Mathematical Sciences
and Center for Neural Science, NYU
Abstract:
One of the big mysteries in neuroscience is how the brain manages
to
learn good representations of the perceptual world. How does
it
discover the underlying structure of the high-dimensional
signals
coming from the eyes and ears? How does it identify and
disentangle
the explanatory factors of variations of natural images? How does
it
learn to extract relevant features while eliminating
irrelevant
variabilities in visual percepts so as to detect, locate, and
recognize objects in images?
These questions go beyond neuroscience: with the deluge of
data
produced by the digital world, information is increasingly
processed
by machines, not humans. Extracting knowledge from data
automatically,
discovering structure in high-dimensional data, and learning
good
representations of high-dimensional data is one of the great
challenges facing machine learning, statistics, and artificial
intelligence in the coming decades. Arguably, it may also be one
of
the great challenges for mathematics, particularly geometry
and
applied mathematics.
While machine learning researchers have long assumed that
representations of the data were given, and concentrated on
learning
classifiers and predictors from these representation, an
increasingly
large number of researchers are devising methods that
automatically
learn good representations of data. Perceptual signals are
often best
represented by a multi-stage hierarchy in which features in
successive
stages are increasingly global, abstract, and invariant to
irrelevant
transformations of the input. Devising learning algorithms for
such
multi-stage architectures has come to be known as the "deep
learning
problem". Deep learning poses many unsolved mathematical
puzzles such
as how to shape high-dimensional parameterized surfaces whose
level
sets distinguish regions of high data density from regions of low
data
density.
I will describe a number unsupervised methods for learning
representations of images. They are based on learning
over-complete
dictionaries for sparse decoders and sparse auto-encoders.
Invariance
of the representations to small variations of the input, similar
to
complex cells in the visual cortex, is obtained through group
sparsity, lateral inhibition, or temporal constancy criteria.
The methods have been used to train convolutional networks
(ConvNets). ConvNets are biologically-inspired architectures
consisting of multiple stages of filter banks, interspersed
with
non-linear operations, spatial pooling, and contrast
normalization
operations. Several applications will be shown through videos and
live
demos, including a a pedestrian detector, a category-level
object
recognition system that can be trained on the fly, and a "scene
parsing" system that can label every pixel in an image with the
category of the object it belongs to.