lecun

Learning Representations of Images and High-Dimensional Data
Yann LeCun, Courant Institute of Mathematical Sciences and Center for Neural Science, NYU

Abstract:
One of the big mysteries in neuroscience is how the brain manages to learn good representations of the perceptual world. How does it discover the underlying structure of the high-dimensional signals coming from the eyes and ears? How does it identify and disentangle the explanatory factors of variations of natural images? How does it learn to extract relevant features while eliminating irrelevant variabilities in visual percepts so as to detect, locate, and recognize objects in images?

These questions go beyond neuroscience: with the deluge of data produced by the digital world, information is increasingly processed by machines, not humans. Extracting knowledge from data automatically, discovering structure in high-dimensional data, and learning good representations of high-dimensional data is one of the great challenges facing machine learning, statistics, and artificial intelligence in the coming decades. Arguably, it may also be one of the great challenges for mathematics, particularly geometry and applied mathematics.

While machine learning researchers have long assumed that representations of the data were given, and concentrated on learning classifiers and predictors from these representation, an increasingly large number of researchers are devising methods that automatically learn good representations of data. Perceptual signals are often best represented by a multi-stage hierarchy in which features in successive stages are increasingly global, abstract, and invariant to irrelevant transformations of the input. Devising learning algorithms for such multi-stage architectures has come to be known as the "deep learning problem". Deep learning poses many unsolved mathematical puzzles such as how to shape high-dimensional parameterized surfaces whose level sets distinguish regions of high data density from regions of low data density.

I will describe a number unsupervised methods for learning representations of images. They are based on learning over-complete dictionaries for sparse decoders and sparse auto-encoders. Invariance of the representations to small variations of the input, similar to complex cells in the visual cortex, is obtained through group sparsity, lateral inhibition, or temporal constancy criteria.

The methods have been used to train convolutional networks (ConvNets). ConvNets are biologically-inspired architectures consisting of multiple stages of filter banks, interspersed with non-linear operations, spatial pooling, and contrast normalization operations. Several applications will be shown through videos and live demos, including a a pedestrian detector, a category-level object recognition system that can be trained on the fly, and a "scene parsing" system that can label every pixel in an image with the category of the object it belongs to.