Next: Normalized Asymmetric Mutual Information Up: Derivations Previous: Derivations Contents

Normalized Symmetric Mutual Information

Let

and

be the random variables described by the cluster labeling $\mathbf{\lambda}$ and category labeling $\mathbf{\kappa}$ , respectively. Let

denote the entropy of a random variable

. Mutual information between two random variables is defined as

$\displaystyle I(X,Y) = H(X) + H(Y) - H(X,Y) .$

(8.1)

Also,

$\displaystyle I(X,Y) = H(X) - H(X\vert Y) \leq H(X)$

(8.2)

and

$\displaystyle I(X,Y) = H(X) - H(Y\vert X) \leq H(Y) .$

(8.3)

$\displaystyle I(X,Y) \leq \min(H(X),H(Y)) .$

(8.4)

Since $\min(H(X),H(Y)) \leq \frac{H(X)+H(Y)}{2}$ , a tight upper bound on

is given by $\frac{H(X)+H(Y)}{2}$ . Thus, a worst case upper bound for all possible labelings A (with labels from 1 to

) and categorizations (with labels from 1 to

) is given by

$\displaystyle I(X,Y) \leq \max_{A \in \{1,\ldots,k\},B \in \{1,\ldots,g\}} \left( \frac{H(A)+H(B)}{2} \right) .$

(8.5)

Hence, we define [0,1]-normalized mutual information-based quality as

$\displaystyle NI(X,Y) = \frac{2\cdot I(X,Y)}{\max_{A \in \{1,\ldots,k\}} (H(A)) + \max_{B \in \{1,\ldots,g\}} (H(B))} .$

(8.6)

Using

$\displaystyle I(X,Y) = \sum_{x \in X} \sum_{y \in Y} p(x,y) \log \frac{p(x,y)}{p(x) \cdot p(y)} ,$

(8.7)

and approximating probabilities with frequency counts delivers our quality measure $\phi^{(\mathrm{NMI})}$ :

$\displaystyle \phi^{(\mathrm{NMI})} (\mathbf{\lambda},\mathbf{\kappa}) = \frac{... ...log \frac{n_l^{(h)} / n}{(n^{(h)} / n) \cdot (n_l / n)} } { \log(k) + \log(g) }$

(8.8)

Basic simplifications yield:

$\displaystyle \phi^{(\mathrm{NMI})} (\mathbf{\lambda},\mathbf{\kappa}) = \frac{... ...} \log_{k \cdot g} \left( \frac{ n_{\ell}^{(h)} n} { n^{(h)} n_{\ell} } \right)$

(8.9)

This derivation is used to obtain equations 4.25 and 5.2.

Instead of using the worst case upper bound for all possible labelings and categorizations, one can assume that the categorization priors are given. This allows a less aggressive denominator for normalization: One can use the actual entropies and from the labelings in equation B.4. However, this results in a bias towards high .

Another variant of normalization uses the actual entropies and instead of $\max_{A \in \{1,\ldots,k\}} (H(A))$ and $\max_{B \in \{1,\ldots,g\}} (H(B))$ in equation B.5. For the results presented in this dissertation, we will use the worst case normalization (equation B.9).

Next: Normalized Asymmetric Mutual Information Up: Derivations Previous: Derivations Contents

Alexander Strehl 2002-05-03