Efficient intranode communication in GPU-accelerated systems

Feng Ji, Ashwin M. Aji, James Dinan, Darius Buntinas, Pavan Balaji, Wu Chun Feng, Xiaosong Ma

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

Current implementations of MPI are unaware of accelerator memory (i.e., GPU device memory) and require programmers to explicitly move data between memory spaces. This approach is inefficient, especially for intranode communication where it can result in several extra copy operations. In this work, we integrate GPU-awareness into a popular MPI runtime system and develop techniques to significantly reduce the cost of intranode communication involving one or more GPUs. Experiment results show an up to 2x increase in bandwidth, resulting in an average of 4.3% improvement to the total execution time of a halo exchange benchmark.

Original languageEnglish
Title of host publicationProceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012
Pages1838-1847
Number of pages10
DOIs
Publication statusPublished - 18 Oct 2012
Externally publishedYes
Event2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012 - Shanghai, China
Duration: 21 May 201225 May 2012

Other

Other2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012
CountryChina
CityShanghai
Period21/5/1225/5/12

Fingerprint

Data storage equipment
Communication
Particle accelerators
Bandwidth
Graphics processing unit
Costs
Experiments

Keywords

  • CUDA
  • GPU
  • Intranode communication
  • MPI
  • MPICH2
  • Nemesis

ASJC Scopus subject areas

  • Software

Cite this

Ji, F., Aji, A. M., Dinan, J., Buntinas, D., Balaji, P., Feng, W. C., & Ma, X. (2012). Efficient intranode communication in GPU-accelerated systems. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012 (pp. 1838-1847). [6270862] https://doi.org/10.1109/IPDPSW.2012.227

Efficient intranode communication in GPU-accelerated systems. / Ji, Feng; Aji, Ashwin M.; Dinan, James; Buntinas, Darius; Balaji, Pavan; Feng, Wu Chun; Ma, Xiaosong.

Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012. 2012. p. 1838-1847 6270862.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ji, F, Aji, AM, Dinan, J, Buntinas, D, Balaji, P, Feng, WC & Ma, X 2012, Efficient intranode communication in GPU-accelerated systems. in Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012., 6270862, pp. 1838-1847, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012, Shanghai, China, 21/5/12. https://doi.org/10.1109/IPDPSW.2012.227
Ji F, Aji AM, Dinan J, Buntinas D, Balaji P, Feng WC et al. Efficient intranode communication in GPU-accelerated systems. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012. 2012. p. 1838-1847. 6270862 https://doi.org/10.1109/IPDPSW.2012.227
Ji, Feng ; Aji, Ashwin M. ; Dinan, James ; Buntinas, Darius ; Balaji, Pavan ; Feng, Wu Chun ; Ma, Xiaosong. / Efficient intranode communication in GPU-accelerated systems. Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012. 2012. pp. 1838-1847
@inproceedings{c77609fa112a4a488b023ce29784575e,
title = "Efficient intranode communication in GPU-accelerated systems",
abstract = "Current implementations of MPI are unaware of accelerator memory (i.e., GPU device memory) and require programmers to explicitly move data between memory spaces. This approach is inefficient, especially for intranode communication where it can result in several extra copy operations. In this work, we integrate GPU-awareness into a popular MPI runtime system and develop techniques to significantly reduce the cost of intranode communication involving one or more GPUs. Experiment results show an up to 2x increase in bandwidth, resulting in an average of 4.3{\%} improvement to the total execution time of a halo exchange benchmark.",
keywords = "CUDA, GPU, Intranode communication, MPI, MPICH2, Nemesis",
author = "Feng Ji and Aji, {Ashwin M.} and James Dinan and Darius Buntinas and Pavan Balaji and Feng, {Wu Chun} and Xiaosong Ma",
year = "2012",
month = "10",
day = "18",
doi = "10.1109/IPDPSW.2012.227",
language = "English",
isbn = "9780769546766",
pages = "1838--1847",
booktitle = "Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012",

}

TY - GEN

T1 - Efficient intranode communication in GPU-accelerated systems

AU - Ji, Feng

AU - Aji, Ashwin M.

AU - Dinan, James

AU - Buntinas, Darius

AU - Balaji, Pavan

AU - Feng, Wu Chun

AU - Ma, Xiaosong

PY - 2012/10/18

Y1 - 2012/10/18

N2 - Current implementations of MPI are unaware of accelerator memory (i.e., GPU device memory) and require programmers to explicitly move data between memory spaces. This approach is inefficient, especially for intranode communication where it can result in several extra copy operations. In this work, we integrate GPU-awareness into a popular MPI runtime system and develop techniques to significantly reduce the cost of intranode communication involving one or more GPUs. Experiment results show an up to 2x increase in bandwidth, resulting in an average of 4.3% improvement to the total execution time of a halo exchange benchmark.

AB - Current implementations of MPI are unaware of accelerator memory (i.e., GPU device memory) and require programmers to explicitly move data between memory spaces. This approach is inefficient, especially for intranode communication where it can result in several extra copy operations. In this work, we integrate GPU-awareness into a popular MPI runtime system and develop techniques to significantly reduce the cost of intranode communication involving one or more GPUs. Experiment results show an up to 2x increase in bandwidth, resulting in an average of 4.3% improvement to the total execution time of a halo exchange benchmark.

KW - CUDA

KW - GPU

KW - Intranode communication

KW - MPI

KW - MPICH2

KW - Nemesis

UR - http://www.scopus.com/inward/record.url?scp=84867431904&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84867431904&partnerID=8YFLogxK

U2 - 10.1109/IPDPSW.2012.227

DO - 10.1109/IPDPSW.2012.227

M3 - Conference contribution

SN - 9780769546766

SP - 1838

EP - 1847

BT - Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012

ER -