In Feature-Distributed Clustering (FDC), we show how cluster ensembles can be used to boost quality of results by combining a set of clusterings obtained from partial views of the data. As mentioned in the introduction, such scenarios result in distributed databases and federated systems that cannot be pooled into one big flat file because of proprietary data aspects, privacy concerns, performance issues, etc. In such situations, it is more realistic to have one clusterer for each database and transmit only the cluster labels (but not the attributes of each record) to a central location where they can be combined using the supra-consensus function.
For our experiments, we simulate such a scenario by running several clusterers, each having access to only a restricted, small subset of features (subspace). Each clusterer has a partial view of the data. Each clusterer has access to all objects. The clusterers find groups in their views/subspaces using the same clustering technique. In the combining stage, individual results are integrated using our supra-consensus function to recover the full structure of the data. As this is a knowledge-reuse framework, the ensemble has no access to the original features.
![]() |
![]()
|
First, let us discuss experimental results for the 8D5K data,
since they lend themselves well to illustration.
We create 5 random 2D views (through selection of a pair of features)
of the 8D data, and use Euclidean-based graph partitioning with
in each view to obtain 5 individual clusterings. The 5 individual
clusterings are then combined using our supra-consensus function proposed in
the previous section.
The clusters are linearly separable in the full 8D space and
clustering in the 8D space yields the original generative labels and
is referred to as the reference clustering. Using PCA to project the
data into 2D separates all 5 clusters fairly well (figure
5.10). In figure 5.10(a), the
reference clustering is illustrated by coloring the data points in the
space spanned by the first and second principal components (PCs).
Figure 5.10(b) shows the final FDC result after
combining 5 subspace clusterings. Each clustering has been computed
from random feature pairs. These subspaces are shown in figure
5.9. Each of the rows corresponds to a random
selection of 2 out of 8 feature dimensions. For each of the 5 chosen
feature pairs, a row shows the 2D clustering (left, feature pair shown
on axis) and the same 2D clustering labels used on the data projected
onto the global 2 principal components (right).
For consistent appearance of clusters across rows, the dot
colors / shapes have been matched using meta-clusters. All points in
clusters of the same meta-cluster share the same color / shape amongst
all plots.
In any subspace, the clusters can not be segregated well due to strong
overlaps. The supra-consensus function can combine the partial knowledge of
the 5 clusterings into a far superior clustering.
FDC (figure 5.10(b)) are
clearly superior compared to any of the 5 individual results (figure
5.9(right)) and is almost flawless compared to
the reference clustering (figure 5.10(a)). The best
individual result has 120 `mislabeled' points while the consensus only
has 3 points `mislabeled'.
We also conducted FDC experiments on the other three data-sets. Table
5.3 summarizes the results and several comparison benchmarks.
The choice of the number of random subspaces and their
dimensionality is currently driven by the user.
For example, in the YAHOO case, 20 clusterings were performed in
128-dimensions (occurrence frequencies of 128 random words) each. The
average quality amongst the results was 0.17 and the best quality was
0.21. Using the supra-consensus function to combine all 20 labelings yields
a quality of 0.38, or 124% higher mutual information than the
average individual clustering.
In all scenarios, the consensus clustering is as good or better than
the best individual input clustering and always better than the
average quality of individual clusterings.
When processing on the all features is not possible and only
limited views exist, a cluster ensemble can boost results
significantly compared to individual clusterings.
Also,
since
the
combiner has no feature information, the consensus clustering tends to
be poorer than the clustering on all features. However, as discussed
in the introduction, in knowledge-reuse application scenarios, the original
features are unavailable, so a comparison to an all-feature clustering is
only done as a reference.
The supra-consensus function chooses either MCLA and CSPA results but the difference is not statistically significant. As noted before MCLA is much faster and should be the method of choice if only one is needed. HGPA delivers poor ANMI for 2D2K and 8D5K but improves with more complex data in PENDIG and YAHOO. However, MCLA and CSPA performed significantly better than HGPA for all four data-sets.