Automatic Cloud I/O Configurator for I/O Intensive Parallel Applications

Jidong Zhai, Mingliang Liu, Ye Jin, Xiaosong Ma, Wenguang Chen

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

As the cloud platform becomes a promising alternative to traditional HPC (high performance computing) centers or in-house clusters, the I/O bottleneck problem is highlighted in this new environment, typically with top-of-The-line compute instances but sub-par communication and I/O facilities. It has been observed that changing the cloud I/O system configurations, such as choices of file systems, number of I/O servers and their placement strategies, etc., will lead to a considerable variation in the performance and cost efficiency of I/O intensive parallel applications. However, storage system configuration is tedious and error-prone to do manually, even for expert users, leading to solutions that are grossly over-provisioned (low cost inefficiency), substantially under-performing (poor performance) or, in the worst case, both. This paper proposes ACIC, a system which automatically searches for optimized I/O system configurations from many candidates for each individual application running on a given cloud platform. ACIC takes advantage of machine learning models to perform performance/cost predictions. To tackle the high-dimensional parameter exploration space, we enable affordable, reusable, and incremental training on cloud platforms, guided by the Plackett and Burman Matrices for experiment design. Our evaluation results with four representative parallel applications indicate that ACIC consistently identifies optimal or near-optimal configurations among a large group of candidate settings. The top ACIC-recommended configuration is capable of improving the applications' performance by a factor of up to 10.5 (3.1 on average), and cost saving of up to 89 percent (51 percent on average), compared with a commonly used baseline I/O configuration. In addition, we carried out a small-scale user study for one of the test applications, which found that ACIC consistently beat the user and even the application's developer, often by a significant margin, in selecting optimized configurations.

Original languageEnglish
Article number6977978
Pages (from-to)3275-3288
Number of pages14
JournalIEEE Transactions on Parallel and Distributed Systems
Volume26
Issue number12
DOIs
Publication statusPublished - 1 Dec 2015

Fingerprint

Costs
Learning systems
Servers
Communication
Experiments

Keywords

  • Cloud Computing
  • Parallel Applications
  • Performance Tool
  • Storage Configuration

ASJC Scopus subject areas

  • Hardware and Architecture
  • Signal Processing
  • Computational Theory and Mathematics

Cite this

Automatic Cloud I/O Configurator for I/O Intensive Parallel Applications. / Zhai, Jidong; Liu, Mingliang; Jin, Ye; Ma, Xiaosong; Chen, Wenguang.

In: IEEE Transactions on Parallel and Distributed Systems, Vol. 26, No. 12, 6977978, 01.12.2015, p. 3275-3288.

Research output: Contribution to journalArticle

Zhai, Jidong ; Liu, Mingliang ; Jin, Ye ; Ma, Xiaosong ; Chen, Wenguang. / Automatic Cloud I/O Configurator for I/O Intensive Parallel Applications. In: IEEE Transactions on Parallel and Distributed Systems. 2015 ; Vol. 26, No. 12. pp. 3275-3288.
@article{9a58ec2e78d647dd958f10198dad4615,
title = "Automatic Cloud I/O Configurator for I/O Intensive Parallel Applications",
abstract = "As the cloud platform becomes a promising alternative to traditional HPC (high performance computing) centers or in-house clusters, the I/O bottleneck problem is highlighted in this new environment, typically with top-of-The-line compute instances but sub-par communication and I/O facilities. It has been observed that changing the cloud I/O system configurations, such as choices of file systems, number of I/O servers and their placement strategies, etc., will lead to a considerable variation in the performance and cost efficiency of I/O intensive parallel applications. However, storage system configuration is tedious and error-prone to do manually, even for expert users, leading to solutions that are grossly over-provisioned (low cost inefficiency), substantially under-performing (poor performance) or, in the worst case, both. This paper proposes ACIC, a system which automatically searches for optimized I/O system configurations from many candidates for each individual application running on a given cloud platform. ACIC takes advantage of machine learning models to perform performance/cost predictions. To tackle the high-dimensional parameter exploration space, we enable affordable, reusable, and incremental training on cloud platforms, guided by the Plackett and Burman Matrices for experiment design. Our evaluation results with four representative parallel applications indicate that ACIC consistently identifies optimal or near-optimal configurations among a large group of candidate settings. The top ACIC-recommended configuration is capable of improving the applications' performance by a factor of up to 10.5 (3.1 on average), and cost saving of up to 89 percent (51 percent on average), compared with a commonly used baseline I/O configuration. In addition, we carried out a small-scale user study for one of the test applications, which found that ACIC consistently beat the user and even the application's developer, often by a significant margin, in selecting optimized configurations.",
keywords = "Cloud Computing, Parallel Applications, Performance Tool, Storage Configuration",
author = "Jidong Zhai and Mingliang Liu and Ye Jin and Xiaosong Ma and Wenguang Chen",
year = "2015",
month = "12",
day = "1",
doi = "10.1109/TPDS.2014.2378277",
language = "English",
volume = "26",
pages = "3275--3288",
journal = "IEEE Transactions on Parallel and Distributed Systems",
issn = "1045-9219",
publisher = "IEEE Computer Society",
number = "12",

}

TY - JOUR

T1 - Automatic Cloud I/O Configurator for I/O Intensive Parallel Applications

AU - Zhai, Jidong

AU - Liu, Mingliang

AU - Jin, Ye

AU - Ma, Xiaosong

AU - Chen, Wenguang

PY - 2015/12/1

Y1 - 2015/12/1

N2 - As the cloud platform becomes a promising alternative to traditional HPC (high performance computing) centers or in-house clusters, the I/O bottleneck problem is highlighted in this new environment, typically with top-of-The-line compute instances but sub-par communication and I/O facilities. It has been observed that changing the cloud I/O system configurations, such as choices of file systems, number of I/O servers and their placement strategies, etc., will lead to a considerable variation in the performance and cost efficiency of I/O intensive parallel applications. However, storage system configuration is tedious and error-prone to do manually, even for expert users, leading to solutions that are grossly over-provisioned (low cost inefficiency), substantially under-performing (poor performance) or, in the worst case, both. This paper proposes ACIC, a system which automatically searches for optimized I/O system configurations from many candidates for each individual application running on a given cloud platform. ACIC takes advantage of machine learning models to perform performance/cost predictions. To tackle the high-dimensional parameter exploration space, we enable affordable, reusable, and incremental training on cloud platforms, guided by the Plackett and Burman Matrices for experiment design. Our evaluation results with four representative parallel applications indicate that ACIC consistently identifies optimal or near-optimal configurations among a large group of candidate settings. The top ACIC-recommended configuration is capable of improving the applications' performance by a factor of up to 10.5 (3.1 on average), and cost saving of up to 89 percent (51 percent on average), compared with a commonly used baseline I/O configuration. In addition, we carried out a small-scale user study for one of the test applications, which found that ACIC consistently beat the user and even the application's developer, often by a significant margin, in selecting optimized configurations.

AB - As the cloud platform becomes a promising alternative to traditional HPC (high performance computing) centers or in-house clusters, the I/O bottleneck problem is highlighted in this new environment, typically with top-of-The-line compute instances but sub-par communication and I/O facilities. It has been observed that changing the cloud I/O system configurations, such as choices of file systems, number of I/O servers and their placement strategies, etc., will lead to a considerable variation in the performance and cost efficiency of I/O intensive parallel applications. However, storage system configuration is tedious and error-prone to do manually, even for expert users, leading to solutions that are grossly over-provisioned (low cost inefficiency), substantially under-performing (poor performance) or, in the worst case, both. This paper proposes ACIC, a system which automatically searches for optimized I/O system configurations from many candidates for each individual application running on a given cloud platform. ACIC takes advantage of machine learning models to perform performance/cost predictions. To tackle the high-dimensional parameter exploration space, we enable affordable, reusable, and incremental training on cloud platforms, guided by the Plackett and Burman Matrices for experiment design. Our evaluation results with four representative parallel applications indicate that ACIC consistently identifies optimal or near-optimal configurations among a large group of candidate settings. The top ACIC-recommended configuration is capable of improving the applications' performance by a factor of up to 10.5 (3.1 on average), and cost saving of up to 89 percent (51 percent on average), compared with a commonly used baseline I/O configuration. In addition, we carried out a small-scale user study for one of the test applications, which found that ACIC consistently beat the user and even the application's developer, often by a significant margin, in selecting optimized configurations.

KW - Cloud Computing

KW - Parallel Applications

KW - Performance Tool

KW - Storage Configuration

UR - http://www.scopus.com/inward/record.url?scp=84961710284&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84961710284&partnerID=8YFLogxK

U2 - 10.1109/TPDS.2014.2378277

DO - 10.1109/TPDS.2014.2378277

M3 - Article

VL - 26

SP - 3275

EP - 3288

JO - IEEE Transactions on Parallel and Distributed Systems

JF - IEEE Transactions on Parallel and Distributed Systems

SN - 1045-9219

IS - 12

M1 - 6977978

ER -