In this dissertation, we propose cluster ensembles and show how this allows distributed clustering without sharing of actual features. The ability to do this analysis without access to the primary data opens opportunities for performing data mining on knowledge maintained in multiple clients without them revealing their data to other clients. Despite these restrictions, the combiner can discover knowledge that could not be discovered by any single client. Cluster ensembles can be used to develop a federated data mining system that ensures privacy and security without requiring that clients send their data to the system or to other clients.
For federated systems, we want to extend our application scenarios. In real scenarios, a variety of hybrids of the investigated RCC, FDC, and ODC scenarios can be encountered. Cluster ensembles could enable federated data mining systems working on top of distributed and heterogeneous databases.
In particular, one can study the impact of the coordinated subsampling strategies on the performance and quality of object distributed clustering. The question is to determine what types of overlap and object ownership structures lend themselves particularly well for knowledge reuse.
A key requirement of data mining techniques is that they must scale to large-scale data-sets. Distributed processing can be used as a tool to scale complex algorithms to large data-sets. As discussed in chapter 5, cluster ensembles can be used in such a scenario to trade-off quality for speed. The proposed cluster ensemble requires no communication during the individual grouping. However, when extending cluster ensembles such that individual clusterers share object features in the inner loop as well as with the combining stage, several designs of inter-processor communication can be made. This might make it interesting to revisit the speedup versus quality tradeoff.