In statistical pattern recognition, the data is considered as a set of observations from a parametric probability distribution. [Fuk72,DH73]. In a two stage process, the parameters of the relevant distributions are learned and later applied to predict the behavior or origin of a new observation. In Maximum Likelihood (ML) estimation, the parameters are chosen such that the probability of the observed samples is maximized.
If there is domain knowledge or a desired behavior of the parameter 's distribution, Bayes' learning should be used instead of the ML estimation.
The learned distributions can now be used for categorization and prediction of a sample's cluster label. The Bayes classifier is optimal in terms of prediction error, assuming that the distribution of the data is known precisely:
Often, using the log-likelihood (equation 2.15) instead of the actual probability values has advantages for optimization (e.g., convexity, products of very small probabilities which may be problematic for fixed precision numerics are avoided).
The theory behind statistical models is very well understood and explicit computations of error bounds are advantageous. Statistical formulations are advantageous for soft clustering problems with a moderate number of dimensions . The very powerful Expectation-Maximization (EM) algorithm [DLR77] has been applied to -means [FRB98]. However, these parametric models tend to impose structure on the data, that may not be there. The selected distribution family may not be really appropriate. In fact, high-dimensional data as found in data mining is distributed strongly non-Gaussian. Also, the number of parameters increases rapidly with so that the estimation problem becomes more and more ill-posed. Non-parametric models, like -nearest-neighbor, have been found preferable in many tasks where a lot of data is available.