In statistical pattern recognition, the data is considered as a set of
observations from a parametric probability
distribution. [Fuk72,DH73]. In a two stage process, the
parameters
of the relevant distributions are *learned* and later applied to *predict* the behavior or origin of
a new observation. In Maximum Likelihood (ML) estimation,
the parameters
are chosen such that the
probability of the observed samples
is maximized.

Assuming that all samples are pairwise independent yields

Since each sample is drawn from a mixture distribution [EH81,Pri94] we have

The cluster-conditional probability may be assumed to be a multivariate Gaussian which is defined as follows

If there is domain knowledge or a desired behavior of the parameter 's distribution, Bayes' learning should be used instead of the ML estimation.

Again, we expand using Bayes' rule.

The prior distribution can be known from the domain or estimated, is the ML estimate as described above and can be ignored for optimization purposes since it is constant in respect to .

The learned distributions can now be used for categorization and prediction of a sample's cluster label. The Bayes classifier is optimal in terms of prediction error, assuming that the distribution of the data is known precisely:

Often, using the log-likelihood (equation 2.15) instead of the actual probability values has advantages for optimization (e.g., convexity, products of very small probabilities which may be problematic for fixed precision numerics are avoided).

The theory behind statistical models is very well understood and explicit computations of error bounds are advantageous. Statistical formulations are advantageous for soft clustering problems with a moderate number of dimensions . The very powerful Expectation-Maximization (EM) algorithm [DLR77] has been applied to -means [FRB98]. However, these parametric models tend to impose structure on the data, that may not be there. The selected distribution family may not be really appropriate. In fact, high-dimensional data as found in data mining is distributed strongly non-Gaussian. Also, the number of parameters increases rapidly with so that the estimation problem becomes more and more ill-posed. Non-parametric models, like -nearest-neighbor, have been found preferable in many tasks where a lot of data is available.