Spread-n-share: Improving application performance and cluster throughput with resource-aware job placement

Xiongchao Tang, Haojie Wang, Xiaosong Ma, Nosayba El-Sayed, Jidong Zhai, Wenguang Chen, Ashraf Aboulnaga

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing processes of a parallel job into as few compute nodes as possible. While CE minimizes inter-node network communication, it often brings self-contention among tasks of a resource-intensive application. Recent studies have used virtual containers to balance CPU utilization and memory capacity across physical nodes, but the imbalance in cache and memory bandwidth usage is still under-investigated. In this work, we propose Spread-n-Share (SNS): a new batch scheduling strategy that automatically scales resource-bound applications out onto more nodes to alleviate their performance bottleneck, and co-locate jobs in a resource compatible manner. We implement Uberun, a prototype scheduler to validate SNS, considering shared-cache capacity and memory bandwidth as two types of performance-critical shared resources. Experimental results using 12 diverse cluster workloads show that SNS improves the overall system throughput by 19.8% on average over CE, while achieving an average individual job speedup of 1.8%.

Original languageEnglish
Title of host publicationProceedings of SC 2019
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Electronic)9781450362290
DOIs
Publication statusPublished - 17 Nov 2019
Event2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019 - Denver, United States
Duration: 17 Nov 201922 Nov 2019

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019
CountryUnited States
CityDenver
Period17/11/1922/11/19

Fingerprint

Throughput
Data storage equipment
Bandwidth
Telecommunication networks
Program processors
Containers
Scheduling

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Cite this

Tang, X., Wang, H., Ma, X., El-Sayed, N., Zhai, J., Chen, W., & Aboulnaga, A. (2019). Spread-n-share: Improving application performance and cluster throughput with resource-aware job placement. In Proceedings of SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis [a12] (International Conference for High Performance Computing, Networking, Storage and Analysis, SC). IEEE Computer Society. https://doi.org/10.1145/3295500.3356152

Spread-n-share : Improving application performance and cluster throughput with resource-aware job placement. / Tang, Xiongchao; Wang, Haojie; Ma, Xiaosong; El-Sayed, Nosayba; Zhai, Jidong; Chen, Wenguang; Aboulnaga, Ashraf.

Proceedings of SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 2019. a12 (International Conference for High Performance Computing, Networking, Storage and Analysis, SC).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tang, X, Wang, H, Ma, X, El-Sayed, N, Zhai, J, Chen, W & Aboulnaga, A 2019, Spread-n-share: Improving application performance and cluster throughput with resource-aware job placement. in Proceedings of SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis., a12, International Conference for High Performance Computing, Networking, Storage and Analysis, SC, IEEE Computer Society, 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, United States, 17/11/19. https://doi.org/10.1145/3295500.3356152
Tang X, Wang H, Ma X, El-Sayed N, Zhai J, Chen W et al. Spread-n-share: Improving application performance and cluster throughput with resource-aware job placement. In Proceedings of SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society. 2019. a12. (International Conference for High Performance Computing, Networking, Storage and Analysis, SC). https://doi.org/10.1145/3295500.3356152
Tang, Xiongchao ; Wang, Haojie ; Ma, Xiaosong ; El-Sayed, Nosayba ; Zhai, Jidong ; Chen, Wenguang ; Aboulnaga, Ashraf. / Spread-n-share : Improving application performance and cluster throughput with resource-aware job placement. Proceedings of SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 2019. (International Conference for High Performance Computing, Networking, Storage and Analysis, SC).
@inproceedings{aafefcc8924e476398db81b2aa18ae74,
title = "Spread-n-share: Improving application performance and cluster throughput with resource-aware job placement",
abstract = "Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing processes of a parallel job into as few compute nodes as possible. While CE minimizes inter-node network communication, it often brings self-contention among tasks of a resource-intensive application. Recent studies have used virtual containers to balance CPU utilization and memory capacity across physical nodes, but the imbalance in cache and memory bandwidth usage is still under-investigated. In this work, we propose Spread-n-Share (SNS): a new batch scheduling strategy that automatically scales resource-bound applications out onto more nodes to alleviate their performance bottleneck, and co-locate jobs in a resource compatible manner. We implement Uberun, a prototype scheduler to validate SNS, considering shared-cache capacity and memory bandwidth as two types of performance-critical shared resources. Experimental results using 12 diverse cluster workloads show that SNS improves the overall system throughput by 19.8{\%} on average over CE, while achieving an average individual job speedup of 1.8{\%}.",
author = "Xiongchao Tang and Haojie Wang and Xiaosong Ma and Nosayba El-Sayed and Jidong Zhai and Wenguang Chen and Ashraf Aboulnaga",
year = "2019",
month = "11",
day = "17",
doi = "10.1145/3295500.3356152",
language = "English",
series = "International Conference for High Performance Computing, Networking, Storage and Analysis, SC",
publisher = "IEEE Computer Society",
booktitle = "Proceedings of SC 2019",

}

TY - GEN

T1 - Spread-n-share

T2 - Improving application performance and cluster throughput with resource-aware job placement

AU - Tang, Xiongchao

AU - Wang, Haojie

AU - Ma, Xiaosong

AU - El-Sayed, Nosayba

AU - Zhai, Jidong

AU - Chen, Wenguang

AU - Aboulnaga, Ashraf

PY - 2019/11/17

Y1 - 2019/11/17

N2 - Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing processes of a parallel job into as few compute nodes as possible. While CE minimizes inter-node network communication, it often brings self-contention among tasks of a resource-intensive application. Recent studies have used virtual containers to balance CPU utilization and memory capacity across physical nodes, but the imbalance in cache and memory bandwidth usage is still under-investigated. In this work, we propose Spread-n-Share (SNS): a new batch scheduling strategy that automatically scales resource-bound applications out onto more nodes to alleviate their performance bottleneck, and co-locate jobs in a resource compatible manner. We implement Uberun, a prototype scheduler to validate SNS, considering shared-cache capacity and memory bandwidth as two types of performance-critical shared resources. Experimental results using 12 diverse cluster workloads show that SNS improves the overall system throughput by 19.8% on average over CE, while achieving an average individual job speedup of 1.8%.

AB - Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing processes of a parallel job into as few compute nodes as possible. While CE minimizes inter-node network communication, it often brings self-contention among tasks of a resource-intensive application. Recent studies have used virtual containers to balance CPU utilization and memory capacity across physical nodes, but the imbalance in cache and memory bandwidth usage is still under-investigated. In this work, we propose Spread-n-Share (SNS): a new batch scheduling strategy that automatically scales resource-bound applications out onto more nodes to alleviate their performance bottleneck, and co-locate jobs in a resource compatible manner. We implement Uberun, a prototype scheduler to validate SNS, considering shared-cache capacity and memory bandwidth as two types of performance-critical shared resources. Experimental results using 12 diverse cluster workloads show that SNS improves the overall system throughput by 19.8% on average over CE, while achieving an average individual job speedup of 1.8%.

UR - http://www.scopus.com/inward/record.url?scp=85076128948&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85076128948&partnerID=8YFLogxK

U2 - 10.1145/3295500.3356152

DO - 10.1145/3295500.3356152

M3 - Conference contribution

AN - SCOPUS:85076128948

T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC

BT - Proceedings of SC 2019

PB - IEEE Computer Society

ER -