Using shared memory to accelerate mapreduce on graphics processing units

Feng Ji, Xiaosong Ma

Research output: Chapter in Book/Report/Conference proceedingConference contribution

33 Citations (Scopus)

Abstract

Modern General Purpose Graphics Processing Units (GPGPUs) provide high degrees of parallelism in computation and memory access, making them suitable for data parallel applications such as those using the elastic MapReduce model. Yet designing a MapReduce framework for GPUs faces significant challenges brought by their multi-level memory hierarchy. Due to the absence of atomic operations in the earlier generations of GPUs, existing GPU MapReduce frameworks have problems in handling input/output data with varied or unpredictable sizes. Also, existing frameworks utilize mostly a single level of memory, i.e., the relatively spacious yet slow global memory. In this work, we attempt to explore the potential benefit of enabling a GPU MapReduce framework to use multiple levels of the GPU memory hierarchy. We propose a novel GPU data staging scheme for MapReduce workloads, tailored toward the GPU memory hierarchy. Centering around the efficient utilization of the fast but very small shared memory, we designed and implemented a GPU MapReduce framework, whose key techniques include (1) shared memory staging area management, (2) thread-role partitioning, and (3) intra-block thread synchronization. We carried out evaluation with five popular MapReduce workloads and studied their performance under different GPU memory usage choices. Our results reveal that exploiting GPU shared memory is highly promising for the Map phase (with an average 2.85x speedup over using global memory only), while in the Reduce phase the benefit of using shared memory is much less pronounced, due to the high input-to-output ratio. In addition, when compared to Mars, an existing GPU MapReduce framework, our system is shown to bring a significant speedup in Map/Reduce phases.

Original languageEnglish
Title of host publicationProceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
Pages805-816
Number of pages12
DOIs
Publication statusPublished - 3 Oct 2011
Externally publishedYes
Event25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011 - Anchorage, AK, United States
Duration: 16 May 201120 May 2011

Other

Other25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
CountryUnited States
CityAnchorage, AK
Period16/5/1120/5/11

Fingerprint

Data storage equipment
Graphics processing unit
Synchronization

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications

Cite this

Ji, F., & Ma, X. (2011). Using shared memory to accelerate mapreduce on graphics processing units. In Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011 (pp. 805-816). [6012890] https://doi.org/10.1109/IPDPS.2011.80

Using shared memory to accelerate mapreduce on graphics processing units. / Ji, Feng; Ma, Xiaosong.

Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011. 2011. p. 805-816 6012890.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ji, F & Ma, X 2011, Using shared memory to accelerate mapreduce on graphics processing units. in Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011., 6012890, pp. 805-816, 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011, Anchorage, AK, United States, 16/5/11. https://doi.org/10.1109/IPDPS.2011.80
Ji F, Ma X. Using shared memory to accelerate mapreduce on graphics processing units. In Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011. 2011. p. 805-816. 6012890 https://doi.org/10.1109/IPDPS.2011.80
Ji, Feng ; Ma, Xiaosong. / Using shared memory to accelerate mapreduce on graphics processing units. Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011. 2011. pp. 805-816
@inproceedings{e43c6d2e987e4c48906af32fb8a77adb,
title = "Using shared memory to accelerate mapreduce on graphics processing units",
abstract = "Modern General Purpose Graphics Processing Units (GPGPUs) provide high degrees of parallelism in computation and memory access, making them suitable for data parallel applications such as those using the elastic MapReduce model. Yet designing a MapReduce framework for GPUs faces significant challenges brought by their multi-level memory hierarchy. Due to the absence of atomic operations in the earlier generations of GPUs, existing GPU MapReduce frameworks have problems in handling input/output data with varied or unpredictable sizes. Also, existing frameworks utilize mostly a single level of memory, i.e., the relatively spacious yet slow global memory. In this work, we attempt to explore the potential benefit of enabling a GPU MapReduce framework to use multiple levels of the GPU memory hierarchy. We propose a novel GPU data staging scheme for MapReduce workloads, tailored toward the GPU memory hierarchy. Centering around the efficient utilization of the fast but very small shared memory, we designed and implemented a GPU MapReduce framework, whose key techniques include (1) shared memory staging area management, (2) thread-role partitioning, and (3) intra-block thread synchronization. We carried out evaluation with five popular MapReduce workloads and studied their performance under different GPU memory usage choices. Our results reveal that exploiting GPU shared memory is highly promising for the Map phase (with an average 2.85x speedup over using global memory only), while in the Reduce phase the benefit of using shared memory is much less pronounced, due to the high input-to-output ratio. In addition, when compared to Mars, an existing GPU MapReduce framework, our system is shown to bring a significant speedup in Map/Reduce phases.",
author = "Feng Ji and Xiaosong Ma",
year = "2011",
month = "10",
day = "3",
doi = "10.1109/IPDPS.2011.80",
language = "English",
isbn = "9780769543857",
pages = "805--816",
booktitle = "Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011",

}

TY - GEN

T1 - Using shared memory to accelerate mapreduce on graphics processing units

AU - Ji, Feng

AU - Ma, Xiaosong

PY - 2011/10/3

Y1 - 2011/10/3

N2 - Modern General Purpose Graphics Processing Units (GPGPUs) provide high degrees of parallelism in computation and memory access, making them suitable for data parallel applications such as those using the elastic MapReduce model. Yet designing a MapReduce framework for GPUs faces significant challenges brought by their multi-level memory hierarchy. Due to the absence of atomic operations in the earlier generations of GPUs, existing GPU MapReduce frameworks have problems in handling input/output data with varied or unpredictable sizes. Also, existing frameworks utilize mostly a single level of memory, i.e., the relatively spacious yet slow global memory. In this work, we attempt to explore the potential benefit of enabling a GPU MapReduce framework to use multiple levels of the GPU memory hierarchy. We propose a novel GPU data staging scheme for MapReduce workloads, tailored toward the GPU memory hierarchy. Centering around the efficient utilization of the fast but very small shared memory, we designed and implemented a GPU MapReduce framework, whose key techniques include (1) shared memory staging area management, (2) thread-role partitioning, and (3) intra-block thread synchronization. We carried out evaluation with five popular MapReduce workloads and studied their performance under different GPU memory usage choices. Our results reveal that exploiting GPU shared memory is highly promising for the Map phase (with an average 2.85x speedup over using global memory only), while in the Reduce phase the benefit of using shared memory is much less pronounced, due to the high input-to-output ratio. In addition, when compared to Mars, an existing GPU MapReduce framework, our system is shown to bring a significant speedup in Map/Reduce phases.

AB - Modern General Purpose Graphics Processing Units (GPGPUs) provide high degrees of parallelism in computation and memory access, making them suitable for data parallel applications such as those using the elastic MapReduce model. Yet designing a MapReduce framework for GPUs faces significant challenges brought by their multi-level memory hierarchy. Due to the absence of atomic operations in the earlier generations of GPUs, existing GPU MapReduce frameworks have problems in handling input/output data with varied or unpredictable sizes. Also, existing frameworks utilize mostly a single level of memory, i.e., the relatively spacious yet slow global memory. In this work, we attempt to explore the potential benefit of enabling a GPU MapReduce framework to use multiple levels of the GPU memory hierarchy. We propose a novel GPU data staging scheme for MapReduce workloads, tailored toward the GPU memory hierarchy. Centering around the efficient utilization of the fast but very small shared memory, we designed and implemented a GPU MapReduce framework, whose key techniques include (1) shared memory staging area management, (2) thread-role partitioning, and (3) intra-block thread synchronization. We carried out evaluation with five popular MapReduce workloads and studied their performance under different GPU memory usage choices. Our results reveal that exploiting GPU shared memory is highly promising for the Map phase (with an average 2.85x speedup over using global memory only), while in the Reduce phase the benefit of using shared memory is much less pronounced, due to the high input-to-output ratio. In addition, when compared to Mars, an existing GPU MapReduce framework, our system is shown to bring a significant speedup in Map/Reduce phases.

UR - http://www.scopus.com/inward/record.url?scp=80053264087&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053264087&partnerID=8YFLogxK

U2 - 10.1109/IPDPS.2011.80

DO - 10.1109/IPDPS.2011.80

M3 - Conference contribution

SN - 9780769543857

SP - 805

EP - 816

BT - Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011

ER -