In distributed data mining, adopting a flat node distribution model can affect scalability. To address the problem of modularity, flexibility and scalability, we propose a Hierarchically-distributed Peer-to-Peer (HP2PC) architecture and clustering algorithm. The architecture is based on a multi-layer overlay network of peer neighborhoods. Supernodes, which act as representatives of neighborhoods, are recursively grouped to form higher level neighborhoods. Within a certain level of the hierarchy, peers cooperate within their respective neighborhoods to perform P2P clustering. Using this model, we can partition the clustering problem in a modular way across neighborhoods, solve each part individually using a distributed K-means variant, then successively combine clusterings up the hierarchy where increasingly more global solutions are computed. In addition, for document clustering applications, we summarize the distributed document clusters using a distributed keyphrase extraction algorithm, thus providing interpretation of the clusters. Results show decent speedup, reaching 165 times faster than centralized clustering for a 250-node simulated network, with comparable clustering quality to the centralized approach. We also provide comparison to the P2P K-means algorithm and show that HP2PC accuracy is better for typical hierarchy heights. Results for distributed cluster summarization match those of their centralized counterparts with up to 88% accuracy.
You are here: Home / IEEE 2011 PROJECTS / Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization