Learning Representations of Images and High-Dimensional Data
Yann LeCun, Courant Institute of Mathematical Sciences and Center for Neural Science, NYU


Abstract:
One of the big mysteries in neuroscience is how the brain manages to  learn good representations of the perceptual world.  How does it  discover the underlying structure of the high-dimensional signals  coming from the eyes and ears? How does it identify and disentangle  the explanatory factors of variations of natural images? How does it  learn to extract relevant features while eliminating irrelevant  variabilities in visual percepts so as to detect, locate, and recognize objects in images?

These questions go beyond neuroscience: with the deluge of data  produced by the digital world, information is increasingly processed  by machines, not humans. Extracting knowledge from data automatically,  discovering structure in high-dimensional data, and learning good  representations of high-dimensional data is one of the great challenges facing machine learning, statistics, and artificial  intelligence in the coming decades. Arguably, it may also be one of  the great challenges for mathematics, particularly geometry and  applied mathematics.

While machine learning researchers have long assumed that  representations of the data were given, and concentrated on learning  classifiers and predictors from these representation, an increasingly  large number of researchers are devising methods that automatically  learn good representations of data.  Perceptual signals are often best  represented by a multi-stage hierarchy in which features in successive  stages are increasingly global, abstract, and invariant to irrelevant transformations of the input.  Devising learning algorithms for such  multi-stage architectures has come to be known as the "deep learning  problem".  Deep learning poses many unsolved mathematical puzzles such  as how to shape high-dimensional parameterized surfaces whose level  sets distinguish regions of high data density from regions of low data density.

I will describe a number unsupervised methods for learning  representations of images. They are based on learning over-complete  dictionaries for sparse decoders and sparse auto-encoders. Invariance  of the representations to small variations of the input, similar to  complex cells in the visual cortex, is obtained through group  sparsity, lateral inhibition, or temporal constancy criteria.

The methods have been used to train convolutional networks  (ConvNets). ConvNets are biologically-inspired architectures  consisting of multiple stages of filter banks, interspersed with  non-linear operations, spatial pooling, and contrast normalization  operations. Several applications will be shown through videos and live  demos, including a a pedestrian detector, a category-level object  recognition system that can be trained on the fly, and a "scene parsing" system that can label every pixel in an image with the category of the object it belongs to.