Distributed approximate spectral clustering for large-scale datasets

Fei Gao, Wael Abd-Almageed, Mohamed Hefeeda

Research output: Chapter in Book/Report/Conference proceedingConference contribution

14 Citations (Scopus)

Abstract

Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N 2) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernelbased machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.

Original languageEnglish
Title of host publicationHPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing
Pages223-234
Number of pages12
DOIs
Publication statusPublished - 23 Jul 2012
Event21st ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC '12 - Delft, Netherlands
Duration: 18 Jun 201222 Jun 2012

Other

Other21st ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC '12
CountryNetherlands
CityDelft
Period18/6/1222/6/12

    Fingerprint

Keywords

  • Distributed clustering
  • Kernel-based algorithms
  • Large data sets
  • Spectral clustering

ASJC Scopus subject areas

  • Software

Cite this

Gao, F., Abd-Almageed, W., & Hefeeda, M. (2012). Distributed approximate spectral clustering for large-scale datasets. In HPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing (pp. 223-234) https://doi.org/10.1145/2287076.2287111