Distributed approximate spectral clustering for large-scale datasets

Fei Gao, Wael Abd-Almageed, Mohamed Hefeeda

Research output: Chapter in Book/Report/Conference proceedingConference contribution

14 Citations (Scopus)

Abstract

Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N 2) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernelbased machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.

Original languageEnglish
Title of host publicationHPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing
Pages223-234
Number of pages12
DOIs
Publication statusPublished - 23 Jul 2012
Event21st ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC '12 - Delft, Netherlands
Duration: 18 Jun 201222 Jun 2012

Other

Other21st ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC '12
CountryNetherlands
CityDelft
Period18/6/1222/6/12

Fingerprint

Learning algorithms
Learning systems
Approximation algorithms
Data storage equipment
Clustering algorithms
Scalability

Keywords

  • Distributed clustering
  • Kernel-based algorithms
  • Large data sets
  • Spectral clustering

ASJC Scopus subject areas

  • Software

Cite this

Gao, F., Abd-Almageed, W., & Hefeeda, M. (2012). Distributed approximate spectral clustering for large-scale datasets. In HPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing (pp. 223-234) https://doi.org/10.1145/2287076.2287111

Distributed approximate spectral clustering for large-scale datasets. / Gao, Fei; Abd-Almageed, Wael; Hefeeda, Mohamed.

HPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing. 2012. p. 223-234.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Gao, F, Abd-Almageed, W & Hefeeda, M 2012, Distributed approximate spectral clustering for large-scale datasets. in HPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing. pp. 223-234, 21st ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC '12, Delft, Netherlands, 18/6/12. https://doi.org/10.1145/2287076.2287111
Gao F, Abd-Almageed W, Hefeeda M. Distributed approximate spectral clustering for large-scale datasets. In HPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing. 2012. p. 223-234 https://doi.org/10.1145/2287076.2287111
Gao, Fei ; Abd-Almageed, Wael ; Hefeeda, Mohamed. / Distributed approximate spectral clustering for large-scale datasets. HPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing. 2012. pp. 223-234
@inproceedings{92bd5747dfb74875941105e2edc9b638,
title = "Distributed approximate spectral clustering for large-scale datasets",
abstract = "Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N 2) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernelbased machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.",
keywords = "Distributed clustering, Kernel-based algorithms, Large data sets, Spectral clustering",
author = "Fei Gao and Wael Abd-Almageed and Mohamed Hefeeda",
year = "2012",
month = "7",
day = "23",
doi = "10.1145/2287076.2287111",
language = "English",
isbn = "9781450308052",
pages = "223--234",
booktitle = "HPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing",

}

TY - GEN

T1 - Distributed approximate spectral clustering for large-scale datasets

AU - Gao, Fei

AU - Abd-Almageed, Wael

AU - Hefeeda, Mohamed

PY - 2012/7/23

Y1 - 2012/7/23

N2 - Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N 2) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernelbased machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.

AB - Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N 2) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernelbased machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.

KW - Distributed clustering

KW - Kernel-based algorithms

KW - Large data sets

KW - Spectral clustering

UR - http://www.scopus.com/inward/record.url?scp=84863889471&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84863889471&partnerID=8YFLogxK

U2 - 10.1145/2287076.2287111

DO - 10.1145/2287076.2287111

M3 - Conference contribution

SN - 9781450308052

SP - 223

EP - 234

BT - HPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing

ER -