tabak

A central problem in data mining is the assignment of a joint-probability distribution to a set of variables, given a sample of independent joint observations. With such distribution in hand, one can answer all kind of questions about the variables. In particular, one can diagnose the state of one variable when the others are observed, a problem of great relevance in medicine (and in many other fields.)

This talk will describe a new methodology to perform such an asignment, developed in the context of diagnosing the state of a transplanted heart through the observation of the patient's gene expression in a microarray.

The central algorithm proposed performs this assignment by mapping the original variables onto a jointly-Gaussian set, which can be made independent through a principal component analysis. The map is built iteratively, through a series of steps that normalize the marginal distributions along a random set of orthogonal directions. These can be thought of as "mixing" steps, that paradoxically reveal detailed information in the data by mapping them to a Gaussian soup that has maximal entropy, and hence no information left at all.