C-Cube

Elastic continuous clustering in the cloud

Zhenjie Zhang, Hu Shu, Zhihong Chong, Hua Lu, Yin Yang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

Continuous clustering analysis over a data stream reports clustering results incrementally as updates arrive. Such analysis has a wide spectrum of applications, including traffic monitoring and topic discovery on microblogs. A common characteristic of streaming applications is that the amount of workload fluctuates, often in an unpredictable manner. On the other hand, most existing solutions for continuous clustering assume either a central server, or a distributed setting with a fixed number of dedicated servers. In other words, they are not ELASTIC, meaning that they cannot dynamically adapt to the amount of computational resources to the fluctuating workload. Consequently, they incur considerable waste of resources, as the servers are under-utilized when the amount of workload is low. This paper proposes C-Cube, the first elastic approach to continuous streaming clustering. Similar to popular cloud-based paradigms such as MapReduce, C-Cube routes each new record to a processing unit, e.g., a virtual machine, based on its hash value. Each processing unit performs the required computations, and sends its results to a lightweight aggregator. This design enables dynamic adding/removing processing units, as well as replacing faulty ones and re-running their tasks. In addition to elasticity, C-Cube is also effective (in that it provides quality guarantees on the clustering results), efficient (it minimizes the computational workload at all times), and generally applicable to a large class of clustering criteria. We implemented C-Cube in a real system based on Twitter Storm, and evaluated it using real and synthetic datasets. Extensive experimental results confirm our performance claims.

Original languageEnglish
Title of host publicationICDE 2013 - 29th International Conference on Data Engineering
Pages577-588
Number of pages12
DOIs
Publication statusPublished - 2013
Externally publishedYes
Event29th International Conference on Data Engineering, ICDE 2013 - Brisbane, QLD, Australia
Duration: 8 Apr 201311 Apr 2013

Other

Other29th International Conference on Data Engineering, ICDE 2013
CountryAustralia
CityBrisbane, QLD
Period8/4/1311/4/13

Fingerprint

Servers
Processing
Elasticity
Monitoring
Virtual machine

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Cite this

Zhang, Z., Shu, H., Chong, Z., Lu, H., & Yang, Y. (2013). C-Cube: Elastic continuous clustering in the cloud. In ICDE 2013 - 29th International Conference on Data Engineering (pp. 577-588). [6544857] https://doi.org/10.1109/ICDE.2013.6544857

C-Cube : Elastic continuous clustering in the cloud. / Zhang, Zhenjie; Shu, Hu; Chong, Zhihong; Lu, Hua; Yang, Yin.

ICDE 2013 - 29th International Conference on Data Engineering. 2013. p. 577-588 6544857.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhang, Z, Shu, H, Chong, Z, Lu, H & Yang, Y 2013, C-Cube: Elastic continuous clustering in the cloud. in ICDE 2013 - 29th International Conference on Data Engineering., 6544857, pp. 577-588, 29th International Conference on Data Engineering, ICDE 2013, Brisbane, QLD, Australia, 8/4/13. https://doi.org/10.1109/ICDE.2013.6544857
Zhang Z, Shu H, Chong Z, Lu H, Yang Y. C-Cube: Elastic continuous clustering in the cloud. In ICDE 2013 - 29th International Conference on Data Engineering. 2013. p. 577-588. 6544857 https://doi.org/10.1109/ICDE.2013.6544857
Zhang, Zhenjie ; Shu, Hu ; Chong, Zhihong ; Lu, Hua ; Yang, Yin. / C-Cube : Elastic continuous clustering in the cloud. ICDE 2013 - 29th International Conference on Data Engineering. 2013. pp. 577-588
@inproceedings{e2694b090680431f8aa55faaa903dd0e,
title = "C-Cube: Elastic continuous clustering in the cloud",
abstract = "Continuous clustering analysis over a data stream reports clustering results incrementally as updates arrive. Such analysis has a wide spectrum of applications, including traffic monitoring and topic discovery on microblogs. A common characteristic of streaming applications is that the amount of workload fluctuates, often in an unpredictable manner. On the other hand, most existing solutions for continuous clustering assume either a central server, or a distributed setting with a fixed number of dedicated servers. In other words, they are not ELASTIC, meaning that they cannot dynamically adapt to the amount of computational resources to the fluctuating workload. Consequently, they incur considerable waste of resources, as the servers are under-utilized when the amount of workload is low. This paper proposes C-Cube, the first elastic approach to continuous streaming clustering. Similar to popular cloud-based paradigms such as MapReduce, C-Cube routes each new record to a processing unit, e.g., a virtual machine, based on its hash value. Each processing unit performs the required computations, and sends its results to a lightweight aggregator. This design enables dynamic adding/removing processing units, as well as replacing faulty ones and re-running their tasks. In addition to elasticity, C-Cube is also effective (in that it provides quality guarantees on the clustering results), efficient (it minimizes the computational workload at all times), and generally applicable to a large class of clustering criteria. We implemented C-Cube in a real system based on Twitter Storm, and evaluated it using real and synthetic datasets. Extensive experimental results confirm our performance claims.",
author = "Zhenjie Zhang and Hu Shu and Zhihong Chong and Hua Lu and Yin Yang",
year = "2013",
doi = "10.1109/ICDE.2013.6544857",
language = "English",
isbn = "9781467349086",
pages = "577--588",
booktitle = "ICDE 2013 - 29th International Conference on Data Engineering",

}

TY - GEN

T1 - C-Cube

T2 - Elastic continuous clustering in the cloud

AU - Zhang, Zhenjie

AU - Shu, Hu

AU - Chong, Zhihong

AU - Lu, Hua

AU - Yang, Yin

PY - 2013

Y1 - 2013

N2 - Continuous clustering analysis over a data stream reports clustering results incrementally as updates arrive. Such analysis has a wide spectrum of applications, including traffic monitoring and topic discovery on microblogs. A common characteristic of streaming applications is that the amount of workload fluctuates, often in an unpredictable manner. On the other hand, most existing solutions for continuous clustering assume either a central server, or a distributed setting with a fixed number of dedicated servers. In other words, they are not ELASTIC, meaning that they cannot dynamically adapt to the amount of computational resources to the fluctuating workload. Consequently, they incur considerable waste of resources, as the servers are under-utilized when the amount of workload is low. This paper proposes C-Cube, the first elastic approach to continuous streaming clustering. Similar to popular cloud-based paradigms such as MapReduce, C-Cube routes each new record to a processing unit, e.g., a virtual machine, based on its hash value. Each processing unit performs the required computations, and sends its results to a lightweight aggregator. This design enables dynamic adding/removing processing units, as well as replacing faulty ones and re-running their tasks. In addition to elasticity, C-Cube is also effective (in that it provides quality guarantees on the clustering results), efficient (it minimizes the computational workload at all times), and generally applicable to a large class of clustering criteria. We implemented C-Cube in a real system based on Twitter Storm, and evaluated it using real and synthetic datasets. Extensive experimental results confirm our performance claims.

AB - Continuous clustering analysis over a data stream reports clustering results incrementally as updates arrive. Such analysis has a wide spectrum of applications, including traffic monitoring and topic discovery on microblogs. A common characteristic of streaming applications is that the amount of workload fluctuates, often in an unpredictable manner. On the other hand, most existing solutions for continuous clustering assume either a central server, or a distributed setting with a fixed number of dedicated servers. In other words, they are not ELASTIC, meaning that they cannot dynamically adapt to the amount of computational resources to the fluctuating workload. Consequently, they incur considerable waste of resources, as the servers are under-utilized when the amount of workload is low. This paper proposes C-Cube, the first elastic approach to continuous streaming clustering. Similar to popular cloud-based paradigms such as MapReduce, C-Cube routes each new record to a processing unit, e.g., a virtual machine, based on its hash value. Each processing unit performs the required computations, and sends its results to a lightweight aggregator. This design enables dynamic adding/removing processing units, as well as replacing faulty ones and re-running their tasks. In addition to elasticity, C-Cube is also effective (in that it provides quality guarantees on the clustering results), efficient (it minimizes the computational workload at all times), and generally applicable to a large class of clustering criteria. We implemented C-Cube in a real system based on Twitter Storm, and evaluated it using real and synthetic datasets. Extensive experimental results confirm our performance claims.

UR - http://www.scopus.com/inward/record.url?scp=84881362360&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84881362360&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2013.6544857

DO - 10.1109/ICDE.2013.6544857

M3 - Conference contribution

SN - 9781467349086

SP - 577

EP - 588

BT - ICDE 2013 - 29th International Conference on Data Engineering

ER -