next up previous contents
Next: Normalized Asymmetric Mutual Information Up: Derivations Previous: Derivations   Contents


Normalized Symmetric Mutual Information

Let $ X$ and $ Y$ be the random variables described by the cluster labeling $ \mathbf{\lambda}$ and category labeling $ \mathbf{\kappa}$, respectively. Let $ H(X)$ denote the entropy of a random variable $ X$. Mutual information between two random variables is defined as

$\displaystyle I(X,Y) = H(X) + H(Y) - H(X,Y) .$ (8.1)

Also,

$\displaystyle I(X,Y) = H(X) - H(X\vert Y) \leq H(X)$ (8.2)

and

$\displaystyle I(X,Y) = H(X) - H(Y\vert X) \leq H(Y) .$ (8.3)

So

$\displaystyle I(X,Y) \leq \min(H(X),H(Y)) .$ (8.4)

Since $ \min(H(X),H(Y)) \leq \frac{H(X)+H(Y)}{2}$, a tight upper bound on $ I(X,Y)$ is given by $ \frac{H(X)+H(Y)}{2}$. Thus, a worst case upper bound for all possible labelings A (with labels from 1 to $ k$) and categorizations (with labels from 1 to $ g$) is given by

$\displaystyle I(X,Y) \leq \max_{A \in \{1,\ldots,k\},B \in \{1,\ldots,g\}} \left( \frac{H(A)+H(B)}{2} \right) .$ (8.5)

Hence, we define [0,1]-normalized mutual information-based quality as

$\displaystyle NI(X,Y) = \frac{2\cdot I(X,Y)}{\max_{A \in \{1,\ldots,k\}} (H(A)) + \max_{B \in \{1,\ldots,g\}} (H(B))} .$ (8.6)

Using

$\displaystyle I(X,Y) = \sum_{x \in X} \sum_{y \in Y} p(x,y) \log \frac{p(x,y)}{p(x) \cdot p(y)} ,$ (8.7)

and approximating probabilities with frequency counts delivers our quality measure $ \phi^{(\mathrm{NMI})}$:

$\displaystyle \phi^{(\mathrm{NMI})} (\mathbf{\lambda},\mathbf{\kappa}) = \frac{...
...log \frac{n_l^{(h)} / n}{(n^{(h)} / n) \cdot (n_l / n)} } { \log(k) + \log(g) }$ (8.8)

Basic simplifications yield:

$\displaystyle \phi^{(\mathrm{NMI})} (\mathbf{\lambda},\mathbf{\kappa}) = \frac{...
...} \log_{k \cdot g} \left( \frac{ n_{\ell}^{(h)} n} { n^{(h)} n_{\ell} } \right)$ (8.9)

This derivation is used to obtain equations 4.25 and 5.2.

Instead of using the worst case upper bound for all possible labelings and categorizations, one can assume that the categorization priors are given. This allows a less aggressive denominator for normalization: One can use the actual entropies $ H(X)$ and $ H(Y)$ from the labelings in equation B.4. However, this results in a bias towards high $ k$.

Another variant of normalization uses the actual entropies $ H(X)$ and $ H(Y)$ instead of $ \max_{A \in
\{1,\ldots,k\}} (H(A))$ and $ \max_{B \in \{1,\ldots,g\}} (H(B))$ in equation B.5. For the results presented in this dissertation, we will use the worst case normalization (equation B.9).


next up previous contents
Next: Normalized Asymmetric Mutual Information Up: Derivations Previous: Derivations   Contents
Alexander Strehl 2002-05-03