When two choices are not enough

Balancing at scale in Distributed Stream Processing

Muhammad Anis Uddin Nasir, Gianmarco Morales, Nicolas Kourtellis, Marco Serafini

Research output: Chapter in Book/Report/Conference proceedingConference contribution

29 Citations (Scopus)

Abstract

Carefully balancing load in distributed stream processing systems has a fundamental impact on execution latency and throughput. Load balancing is challenging because real-world workloads are skewed: some tuples in the stream are associated to keys which are significantly more frequent than others. Skew is remarkably more problematic in large deployments: having more workers implies fewer keys per worker, so it becomes harder to average out the cost of hot keys with cold keys. We propose a novel load balancing technique that uses a heavy hitter algorithm to efficiently identify the hottest keys in the stream. These hot keys are assigned to d ≥ 2 choices to ensure a balanced load, where d is tuned automatically to minimize the memory and computation cost of operator replication. The technique works online and does not require the use of routing tables. Our extensive evaluation shows that our technique can balance real-world workloads on large deployments, and improve throughput and latency by 150% and 60% respectively over the previous state-of-the-art when deployed on Apache Storm.

Original languageEnglish
Title of host publication2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages589-600
Number of pages12
ISBN (Electronic)9781509020195
DOIs
Publication statusPublished - 22 Jun 2016
Event32nd IEEE International Conference on Data Engineering, ICDE 2016 - Helsinki, Finland
Duration: 16 May 201620 May 2016

Other

Other32nd IEEE International Conference on Data Engineering, ICDE 2016
CountryFinland
CityHelsinki
Period16/5/1620/5/16

Fingerprint

Resource allocation
Processing
Throughput
Mathematical operators
Costs
Data storage equipment
Load balancing
Workers
Workload
Latency
Operator
Evaluation
Replication
Routing

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computational Theory and Mathematics
  • Computer Graphics and Computer-Aided Design
  • Computer Networks and Communications
  • Information Systems
  • Information Systems and Management

Cite this

Nasir, M. A. U., Morales, G., Kourtellis, N., & Serafini, M. (2016). When two choices are not enough: Balancing at scale in Distributed Stream Processing. In 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016 (pp. 589-600). [7498273] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICDE.2016.7498273

When two choices are not enough : Balancing at scale in Distributed Stream Processing. / Nasir, Muhammad Anis Uddin; Morales, Gianmarco; Kourtellis, Nicolas; Serafini, Marco.

2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016. Institute of Electrical and Electronics Engineers Inc., 2016. p. 589-600 7498273.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Nasir, MAU, Morales, G, Kourtellis, N & Serafini, M 2016, When two choices are not enough: Balancing at scale in Distributed Stream Processing. in 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016., 7498273, Institute of Electrical and Electronics Engineers Inc., pp. 589-600, 32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, 16/5/16. https://doi.org/10.1109/ICDE.2016.7498273
Nasir MAU, Morales G, Kourtellis N, Serafini M. When two choices are not enough: Balancing at scale in Distributed Stream Processing. In 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016. Institute of Electrical and Electronics Engineers Inc. 2016. p. 589-600. 7498273 https://doi.org/10.1109/ICDE.2016.7498273
Nasir, Muhammad Anis Uddin ; Morales, Gianmarco ; Kourtellis, Nicolas ; Serafini, Marco. / When two choices are not enough : Balancing at scale in Distributed Stream Processing. 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016. Institute of Electrical and Electronics Engineers Inc., 2016. pp. 589-600
@inproceedings{e4da7478f7b449a3820268ad558130ff,
title = "When two choices are not enough: Balancing at scale in Distributed Stream Processing",
abstract = "Carefully balancing load in distributed stream processing systems has a fundamental impact on execution latency and throughput. Load balancing is challenging because real-world workloads are skewed: some tuples in the stream are associated to keys which are significantly more frequent than others. Skew is remarkably more problematic in large deployments: having more workers implies fewer keys per worker, so it becomes harder to average out the cost of hot keys with cold keys. We propose a novel load balancing technique that uses a heavy hitter algorithm to efficiently identify the hottest keys in the stream. These hot keys are assigned to d ≥ 2 choices to ensure a balanced load, where d is tuned automatically to minimize the memory and computation cost of operator replication. The technique works online and does not require the use of routing tables. Our extensive evaluation shows that our technique can balance real-world workloads on large deployments, and improve throughput and latency by 150{\%} and 60{\%} respectively over the previous state-of-the-art when deployed on Apache Storm.",
author = "Nasir, {Muhammad Anis Uddin} and Gianmarco Morales and Nicolas Kourtellis and Marco Serafini",
year = "2016",
month = "6",
day = "22",
doi = "10.1109/ICDE.2016.7498273",
language = "English",
pages = "589--600",
booktitle = "2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - When two choices are not enough

T2 - Balancing at scale in Distributed Stream Processing

AU - Nasir, Muhammad Anis Uddin

AU - Morales, Gianmarco

AU - Kourtellis, Nicolas

AU - Serafini, Marco

PY - 2016/6/22

Y1 - 2016/6/22

N2 - Carefully balancing load in distributed stream processing systems has a fundamental impact on execution latency and throughput. Load balancing is challenging because real-world workloads are skewed: some tuples in the stream are associated to keys which are significantly more frequent than others. Skew is remarkably more problematic in large deployments: having more workers implies fewer keys per worker, so it becomes harder to average out the cost of hot keys with cold keys. We propose a novel load balancing technique that uses a heavy hitter algorithm to efficiently identify the hottest keys in the stream. These hot keys are assigned to d ≥ 2 choices to ensure a balanced load, where d is tuned automatically to minimize the memory and computation cost of operator replication. The technique works online and does not require the use of routing tables. Our extensive evaluation shows that our technique can balance real-world workloads on large deployments, and improve throughput and latency by 150% and 60% respectively over the previous state-of-the-art when deployed on Apache Storm.

AB - Carefully balancing load in distributed stream processing systems has a fundamental impact on execution latency and throughput. Load balancing is challenging because real-world workloads are skewed: some tuples in the stream are associated to keys which are significantly more frequent than others. Skew is remarkably more problematic in large deployments: having more workers implies fewer keys per worker, so it becomes harder to average out the cost of hot keys with cold keys. We propose a novel load balancing technique that uses a heavy hitter algorithm to efficiently identify the hottest keys in the stream. These hot keys are assigned to d ≥ 2 choices to ensure a balanced load, where d is tuned automatically to minimize the memory and computation cost of operator replication. The technique works online and does not require the use of routing tables. Our extensive evaluation shows that our technique can balance real-world workloads on large deployments, and improve throughput and latency by 150% and 60% respectively over the previous state-of-the-art when deployed on Apache Storm.

UR - http://www.scopus.com/inward/record.url?scp=84980322422&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84980322422&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2016.7498273

DO - 10.1109/ICDE.2016.7498273

M3 - Conference contribution

SP - 589

EP - 600

BT - 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016

PB - Institute of Electrical and Electronics Engineers Inc.

ER -