Obtaining information from data by
mixing them well
Esteban Tabak, CIMS
A central problem
in data mining is the assignment of a joint-probability distribution to
a set of variables, given a sample of independent joint
observations. With such distribution in hand, one can answer all kind
of questions about the variables. In particular, one can diagnose the
state of one variable when the others are observed, a problem of
great relevance in medicine (and in many other fields.)
This talk will describe a new methodology to perform such an asignment,
developed in the context of diagnosing the state of a transplanted
heart through the observation of the patient's gene expression in a
microarray.
The central algorithm proposed performs this assignment by mapping the
original variables onto a jointly-Gaussian set, which can be made
independent through a principal component analysis. The map is built
iteratively, through a series of steps that normalize the
marginal distributions along a random set of orthogonal directions.
These can be thought of as "mixing" steps, that paradoxically reveal
detailed information in the data by mapping them to a Gaussian soup
that has maximal entropy, and hence no information left at all.