### Abstract

Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N ^{2}) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernelbased machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.

Original language | English |
---|---|

Title of host publication | HPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing |

Pages | 223-234 |

Number of pages | 12 |

DOIs | |

Publication status | Published - 23 Jul 2012 |

Event | 21st ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC '12 - Delft, Netherlands Duration: 18 Jun 2012 → 22 Jun 2012 |

### Other

Other | 21st ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC '12 |
---|---|

Country | Netherlands |

City | Delft |

Period | 18/6/12 → 22/6/12 |

### Fingerprint

### Keywords

- Distributed clustering
- Kernel-based algorithms
- Large data sets
- Spectral clustering

### ASJC Scopus subject areas

- Software

### Cite this

*HPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing*(pp. 223-234) https://doi.org/10.1145/2287076.2287111

**Distributed approximate spectral clustering for large-scale datasets.** / Gao, Fei; Abd-Almageed, Wael; Hefeeda, Mohamed.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

*HPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing.*pp. 223-234, 21st ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC '12, Delft, Netherlands, 18/6/12. https://doi.org/10.1145/2287076.2287111

}

TY - GEN

T1 - Distributed approximate spectral clustering for large-scale datasets

AU - Gao, Fei

AU - Abd-Almageed, Wael

AU - Hefeeda, Mohamed

PY - 2012/7/23

Y1 - 2012/7/23

N2 - Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N 2) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernelbased machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.

AB - Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N 2) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernelbased machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.

KW - Distributed clustering

KW - Kernel-based algorithms

KW - Large data sets

KW - Spectral clustering

UR - http://www.scopus.com/inward/record.url?scp=84863889471&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84863889471&partnerID=8YFLogxK

U2 - 10.1145/2287076.2287111

DO - 10.1145/2287076.2287111

M3 - Conference contribution

SN - 9781450308052

SP - 223

EP - 234

BT - HPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing

ER -