Next: Recent Database-driven Approaches Up: Clustering Algorithms Previous: Projective Techniques   Contents

## Mixture Density Estimation

In statistical pattern recognition, the data is considered as a set of observations from a parametric probability distribution. [Fuk72,DH73]. In a two stage process, the parameters of the relevant distributions are learned and later applied to predict the behavior or origin of a new observation. In Maximum Likelihood (ML) estimation, the parameters are chosen such that the probability of the observed samples is maximized.

 (2.8)

Assuming that all samples are pairwise independent yields

 (2.9)

Since each sample is drawn from a mixture distribution [EH81,Pri94] we have

 (2.10)

The cluster-conditional probability may be assumed to be a multivariate Gaussian which is defined as follows

 (2.11)

If there is domain knowledge or a desired behavior of the parameter 's distribution, Bayes' learning should be used instead of the ML estimation.

 (2.12)

Again, we expand using Bayes' rule.

 (2.13)

The prior distribution can be known from the domain or estimated, is the ML estimate as described above and can be ignored for optimization purposes since it is constant in respect to .

The learned distributions can now be used for categorization and prediction of a sample's cluster label. The Bayes classifier is optimal in terms of prediction error, assuming that the distribution of the data is known precisely:

 (2.14)

Often, using the log-likelihood (equation 2.15) instead of the actual probability values has advantages for optimization (e.g., convexity, products of very small probabilities which may be problematic for fixed precision numerics are avoided).

 (2.15)

The theory behind statistical models is very well understood and explicit computations of error bounds are advantageous. Statistical formulations are advantageous for soft clustering problems with a moderate number of dimensions . The very powerful Expectation-Maximization (EM) algorithm [DLR77] has been applied to -means [FRB98]. However, these parametric models tend to impose structure on the data, that may not be there. The selected distribution family may not be really appropriate. In fact, high-dimensional data as found in data mining is distributed strongly non-Gaussian. Also, the number of parameters increases rapidly with so that the estimation problem becomes more and more ill-posed. Non-parametric models, like -nearest-neighbor, have been found preferable in many tasks where a lot of data is available.

Next: Recent Database-driven Approaches Up: Clustering Algorithms Previous: Projective Techniques   Contents
Alexander Strehl 2002-05-03