An underutilized statistic: the Euclidean distance (instead of chi-square)
Mark Tygert, CIMS
Abstract:
A basic task in statistics is to ascertain whether a given model agrees with a set of independent and identically distributed experiments suspected to be taking draws from the model probability distribution. This task is known as testing "goodness of fit." When there are only finitely many possible values that the draws can take, the canonical approach for this task is what is known as the chi-squared test (including related, asymptotically equivalent variations, such as the likelihood-ratio, "G," or power-divergence tests). The chi-squared test is based on the root-mean-square difference between the model probability distribution and the empirical distribution estimated via the experimental draws, with the weights in the (weighted) average of the root-mean-square being the reciprocals of the model probabilities. Thus, the canonical approach via chi-squared involves dividing by the model probabilities. This can be a bad idea when model probabilities can be small (especially bad when many are small). Unsurprisingly, dividing by nearly zero causes severe trouble even in the absence of round-off errors. Fortunately, with the now widespread availability of computers, it is no longer necessary to divide by small numbers in order to simplify the computation of statistical significance; a more useful measure of the size of the discrepancy between the empirical and model distributions is simply their standard
root-mean-square difference (using the usual, uniformly weighted average in the root-mean-square). Using chi-squared alone should be deprecated; the unadulterated Euclidean metric is more powerful and easier to use. This is joint work with Will Perkins and Rachel Ward; technical details are available at http://cims.nyu.edu/~tygert/rms.pdf