MPI-ACC: Accelerator-Aware MPI for Scientific Applications

Ashwin M. Aji, Lokendra S. Panwar, Feng Ji, Karthik Murthy, Milind Chabbi, Pavan Balaji, Keith R. Bisset, James Dinan, Wu Chun Feng, John Mellor-Crummey, Xiaosong Ma, Rajeev Thakur

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-to-end data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC supports data transfer among CUDA, OpenCL and CPU memory spaces and is extensible to other offload models as well. MPI-ACC's runtime system enables several key optimizations, including pipelining of data transfers, scalable memory management techniques, and balancing of communication based on accelerator and node architecture. MPI-ACC is designed to work concurrently with other GPU workloads with minimum contention. We describe how MPI-ACC can be used to design new communication-computation patterns in scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We present experimental results on a state-of-the-art cluster with hundreds of GPUs; and we compare the performance and productivity of MPI-ACC with MVAPICH, a popular CUDA-aware MPI solution. MPI-ACC encourages programmers to explore novel application-specific optimizations for improved overall cluster utilization.

Original languageEnglish
Article number7127020
Pages (from-to)1401-1414
Number of pages14
JournalIEEE Transactions on Parallel and Distributed Systems
Volume27
Issue number5
DOIs
Publication statusPublished - 1 May 2016
Externally publishedYes

Fingerprint

Message passing
Particle accelerators
Data storage equipment
Data transfer
Program processors
Communication
Productivity
Seismology
Epidemiology
Parallel programming
Computer systems

Keywords

  • concurrent programming
  • distributed architectures
  • Heterogeneous (hybrid) systems
  • parallel systems

ASJC Scopus subject areas

  • Hardware and Architecture
  • Signal Processing
  • Computational Theory and Mathematics

Cite this

Aji, A. M., Panwar, L. S., Ji, F., Murthy, K., Chabbi, M., Balaji, P., ... Thakur, R. (2016). MPI-ACC: Accelerator-Aware MPI for Scientific Applications. IEEE Transactions on Parallel and Distributed Systems, 27(5), 1401-1414. [7127020]. https://doi.org/10.1109/TPDS.2015.2446479

MPI-ACC : Accelerator-Aware MPI for Scientific Applications. / Aji, Ashwin M.; Panwar, Lokendra S.; Ji, Feng; Murthy, Karthik; Chabbi, Milind; Balaji, Pavan; Bisset, Keith R.; Dinan, James; Feng, Wu Chun; Mellor-Crummey, John; Ma, Xiaosong; Thakur, Rajeev.

In: IEEE Transactions on Parallel and Distributed Systems, Vol. 27, No. 5, 7127020, 01.05.2016, p. 1401-1414.

Research output: Contribution to journalArticle

Aji, AM, Panwar, LS, Ji, F, Murthy, K, Chabbi, M, Balaji, P, Bisset, KR, Dinan, J, Feng, WC, Mellor-Crummey, J, Ma, X & Thakur, R 2016, 'MPI-ACC: Accelerator-Aware MPI for Scientific Applications', IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 5, 7127020, pp. 1401-1414. https://doi.org/10.1109/TPDS.2015.2446479
Aji AM, Panwar LS, Ji F, Murthy K, Chabbi M, Balaji P et al. MPI-ACC: Accelerator-Aware MPI for Scientific Applications. IEEE Transactions on Parallel and Distributed Systems. 2016 May 1;27(5):1401-1414. 7127020. https://doi.org/10.1109/TPDS.2015.2446479
Aji, Ashwin M. ; Panwar, Lokendra S. ; Ji, Feng ; Murthy, Karthik ; Chabbi, Milind ; Balaji, Pavan ; Bisset, Keith R. ; Dinan, James ; Feng, Wu Chun ; Mellor-Crummey, John ; Ma, Xiaosong ; Thakur, Rajeev. / MPI-ACC : Accelerator-Aware MPI for Scientific Applications. In: IEEE Transactions on Parallel and Distributed Systems. 2016 ; Vol. 27, No. 5. pp. 1401-1414.
@article{19324d4b97a94d499bd511b7b9890326,
title = "MPI-ACC: Accelerator-Aware MPI for Scientific Applications",
abstract = "Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-to-end data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC supports data transfer among CUDA, OpenCL and CPU memory spaces and is extensible to other offload models as well. MPI-ACC's runtime system enables several key optimizations, including pipelining of data transfers, scalable memory management techniques, and balancing of communication based on accelerator and node architecture. MPI-ACC is designed to work concurrently with other GPU workloads with minimum contention. We describe how MPI-ACC can be used to design new communication-computation patterns in scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We present experimental results on a state-of-the-art cluster with hundreds of GPUs; and we compare the performance and productivity of MPI-ACC with MVAPICH, a popular CUDA-aware MPI solution. MPI-ACC encourages programmers to explore novel application-specific optimizations for improved overall cluster utilization.",
keywords = "concurrent programming, distributed architectures, Heterogeneous (hybrid) systems, parallel systems",
author = "Aji, {Ashwin M.} and Panwar, {Lokendra S.} and Feng Ji and Karthik Murthy and Milind Chabbi and Pavan Balaji and Bisset, {Keith R.} and James Dinan and Feng, {Wu Chun} and John Mellor-Crummey and Xiaosong Ma and Rajeev Thakur",
year = "2016",
month = "5",
day = "1",
doi = "10.1109/TPDS.2015.2446479",
language = "English",
volume = "27",
pages = "1401--1414",
journal = "IEEE Transactions on Parallel and Distributed Systems",
issn = "1045-9219",
publisher = "IEEE Computer Society",
number = "5",

}

TY - JOUR

T1 - MPI-ACC

T2 - Accelerator-Aware MPI for Scientific Applications

AU - Aji, Ashwin M.

AU - Panwar, Lokendra S.

AU - Ji, Feng

AU - Murthy, Karthik

AU - Chabbi, Milind

AU - Balaji, Pavan

AU - Bisset, Keith R.

AU - Dinan, James

AU - Feng, Wu Chun

AU - Mellor-Crummey, John

AU - Ma, Xiaosong

AU - Thakur, Rajeev

PY - 2016/5/1

Y1 - 2016/5/1

N2 - Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-to-end data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC supports data transfer among CUDA, OpenCL and CPU memory spaces and is extensible to other offload models as well. MPI-ACC's runtime system enables several key optimizations, including pipelining of data transfers, scalable memory management techniques, and balancing of communication based on accelerator and node architecture. MPI-ACC is designed to work concurrently with other GPU workloads with minimum contention. We describe how MPI-ACC can be used to design new communication-computation patterns in scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We present experimental results on a state-of-the-art cluster with hundreds of GPUs; and we compare the performance and productivity of MPI-ACC with MVAPICH, a popular CUDA-aware MPI solution. MPI-ACC encourages programmers to explore novel application-specific optimizations for improved overall cluster utilization.

AB - Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-to-end data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC supports data transfer among CUDA, OpenCL and CPU memory spaces and is extensible to other offload models as well. MPI-ACC's runtime system enables several key optimizations, including pipelining of data transfers, scalable memory management techniques, and balancing of communication based on accelerator and node architecture. MPI-ACC is designed to work concurrently with other GPU workloads with minimum contention. We describe how MPI-ACC can be used to design new communication-computation patterns in scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We present experimental results on a state-of-the-art cluster with hundreds of GPUs; and we compare the performance and productivity of MPI-ACC with MVAPICH, a popular CUDA-aware MPI solution. MPI-ACC encourages programmers to explore novel application-specific optimizations for improved overall cluster utilization.

KW - concurrent programming

KW - distributed architectures

KW - Heterogeneous (hybrid) systems

KW - parallel systems

UR - http://www.scopus.com/inward/record.url?scp=84963830103&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84963830103&partnerID=8YFLogxK

U2 - 10.1109/TPDS.2015.2446479

DO - 10.1109/TPDS.2015.2446479

M3 - Article

AN - SCOPUS:84963830103

VL - 27

SP - 1401

EP - 1414

JO - IEEE Transactions on Parallel and Distributed Systems

JF - IEEE Transactions on Parallel and Distributed Systems

SN - 1045-9219

IS - 5

M1 - 7127020

ER -