FASTOPOSSUM

Next: Parallel Implementation Up: System Issues Previous: Synergy between OPOSSUM and Contents

FASTOPOSSUM

Since OPOSSUM aims to achieve balanced clusters, random sampling is effective for obtaining adequate examples of each cluster. If the clusters are perfectly balanced, the distribution of the number of samples from a specific cluster in a subsample of size $\underline{n}$ taken from the entire population is binomial with mean $\underline{n}/k$ and variance $\underline{n}(k-1)/k^2$ . For finite population, the variance will be even less. Thus, if we require at least

representatives from this cluster, then the number of samples is given by: $\underline{n}/k \geq z_\alpha \sqrt{\underline{n} ( k - 1)} + r$ , where $z_\alpha = 1.96$ or 2.81 for 97.5% and 99.5% confidence levels respectively. This is

. For example, if we have 10 clusters and need to ensure at least 20 representatives from a given cluster with probability 0.995, about 400 samples are adequate. Note that this number is independent of

is adequately large (at least 400 in this case), so even for over one million customers, only 400 representatives are required. This suggests a simple and effective way to scale OPOSSUM to very large number of objects

, using the following four-step process called FASTOPOSSUM:

Pick a boot-sample of size $\underline{n}$ so that the corresponding value is adequate to define each cluster.
Apply OPOSSUM to the boot-sample to get initial clusters.
Find the centroid for each of the clusters.
Assign each of the remaining $n-\underline{n}$ points to the cluster with the nearest centroid.

Using $\underline{n} = \sqrt{n}$ reduces the complexity of FASTOPOSSUM to

. Note that the above algorithm may not result in balanced clusters. We can enforce balancing by allocating the remaining points to the

clusters in groups, each time solving a stable marriage problem [GI89], but this will increase the computation time. Figure 3.8 illustrates the behavior of FASTOPOSSUM for the drugstore customer data-set from subsection 3.5.1. The remaining edge weight fraction indicates how much of the cumulative edge weight remains after the edge separator has been removed: $\frac{\sum_{\ell=1}^k \sum_{\lambda_a=\ell} \sum_{\lambda_b=\ell,b>a} s(\mathbf{x}_a,\mathbf{x}_b)}{\sum_{a=1}^n \sum_{b=a+1}^n s(\mathbf{x}_a,\mathbf{x}_b)}$ The better the partitioning, the smaller the edge-separator, and thus the larger the remaining edge weight fraction. Surprisingly the speedup does not result in a significant decreased quality in terms of remaining edge weight (figure 3.8(a)). However, the balancing property is progressively relaxed as the boot-sample becomes smaller in comparison to the full data-set (figure 3.8(b)). Using $\underline{n} = 100$ initial points reduces the original computation time to less than 1% at comparable remaining edge weight but at an imbalance of 3.5 in the worst of 10 random trials.

Figure 3.8: Effect of sub-sampling on OPOSSUM. Cluster quality as measured by remaining edge weight fraction (a) and imbalance (b) of total graph with 2466 vertices (customers from subsection 3.5.1) for various boot-sample sizes in FASTOPOSSUM. For each setting the results' range and mean of 10 trials are depicted. Using all 2466 customers as the boot-sample (i.e., no sub-sampling) results in balancing within the 1.05 imbalance requirement and approximately 40% of edge weight remaining (as compared to 5% baseline for random clustering). As the boot-sample becomes smaller the remaining edge weight stays approximately the same (a), however the imbalance increases (b).

$\resizebox{.45\columnwidth}{!}{\includegraphics*{eps/scaleopossumedgerem2}}$	$\resizebox{.45\columnwidth}{!}{\includegraphics*{eps/scaleopossumbalance2}}$
(a)	(b)

These results indicate that scaling to large

is easily possible, if one is willing to relax the balancedness constraints.

Next: Parallel Implementation Up: System Issues Previous: Synergy between OPOSSUM and Contents

Alexander Strehl 2002-05-03