Exploiting locality in graph analytics through hardware-Accelerated traversal scheduling

Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, Daniel Sanchez

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Graph processing is increasingly bottlenecked by main memory accesses. On-chip caches are of little help because the irregular structure of graphs causes seemingly random memory references. However, most real-world graphs offer significant potential locality-it is just hard to predict ahead of time. In practice, graphs have well-connected regions where relatively few vertices share edges with many common neighbors. If these vertices were processed together, graph processing would enjoy significant data reuse. Hence, a graph's traversal schedule largely determines its locality. This paper explores online traversal scheduling strategies that exploit the community structure of real-world graphs to improve locality. Software graph processing frameworks use simple, locality-oblivious scheduling because, on general-purpose cores, the benefits of locality-Aware scheduling are outweighed by its overheads. Software frameworks rely on offline preprocessing to improve locality. Unfortunately, preprocessing is so expensive that its costs often negate any benefits from improved locality. Recent graph processing accelerators have inherited this design. Our insight is that this misses an opportunity: Hardware acceleration allows for more sophisticated, online locality-Aware scheduling than can be realized in software, letting systems significantly improve locality without any preprocessing. To exploit this insight, we present bounded depth-first scheduling (BDFS), a simple online locality-Aware scheduling strategy. BDFS restricts each core to explore one small, connected region of the graph at a time, improving locality on graphs with good community structure. We then present HATS, a hardware-Accelerated traversal scheduler that adds just 0.4% area and 0.2% power over general-purpose cores. We evaluate BDFS and HATS on several algorithms using large real-world graphs. On a simulated 16-core system, BDFS reduces main memory accesses by up to 2.4x and by 30% on average. However, BDFS is too expensive in software and degrades performance by 21% on average. HATS eliminates these overheads, allowing BDFS to improve performance by 83% on average (up to 3.1x) over a locality-oblivious software implementation and by 31% on average (up to 2.1x) over specialized prefetchers.

Original languageEnglish
Title of host publicationProceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018
PublisherIEEE Computer Society
Pages1-14
Number of pages14
Volume2018-October
ISBN (Electronic)9781538662403
DOIs
Publication statusPublished - 12 Dec 2018
Event51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018 - Fukuoka, Japan
Duration: 20 Oct 201824 Oct 2018

Other

Other51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018
CountryJapan
CityFukuoka
Period20/10/1824/10/18

Fingerprint

Scheduling
Hardware
Processing
Data storage equipment
Particle accelerators
Costs

Keywords

  • Caches
  • Graph analytics
  • Locality
  • Multicore
  • Prefetching
  • Scheduling

ASJC Scopus subject areas

  • Hardware and Architecture

Cite this

Mukkara, A., Beckmann, N., Abeydeera, M., Ma, X., & Sanchez, D. (2018). Exploiting locality in graph analytics through hardware-Accelerated traversal scheduling. In Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018 (Vol. 2018-October, pp. 1-14). [8574527] IEEE Computer Society. https://doi.org/10.1109/MICRO.2018.00010

Exploiting locality in graph analytics through hardware-Accelerated traversal scheduling. / Mukkara, Anurag; Beckmann, Nathan; Abeydeera, Maleen; Ma, Xiaosong; Sanchez, Daniel.

Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018. Vol. 2018-October IEEE Computer Society, 2018. p. 1-14 8574527.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Mukkara, A, Beckmann, N, Abeydeera, M, Ma, X & Sanchez, D 2018, Exploiting locality in graph analytics through hardware-Accelerated traversal scheduling. in Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018. vol. 2018-October, 8574527, IEEE Computer Society, pp. 1-14, 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018, Fukuoka, Japan, 20/10/18. https://doi.org/10.1109/MICRO.2018.00010
Mukkara A, Beckmann N, Abeydeera M, Ma X, Sanchez D. Exploiting locality in graph analytics through hardware-Accelerated traversal scheduling. In Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018. Vol. 2018-October. IEEE Computer Society. 2018. p. 1-14. 8574527 https://doi.org/10.1109/MICRO.2018.00010
Mukkara, Anurag ; Beckmann, Nathan ; Abeydeera, Maleen ; Ma, Xiaosong ; Sanchez, Daniel. / Exploiting locality in graph analytics through hardware-Accelerated traversal scheduling. Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018. Vol. 2018-October IEEE Computer Society, 2018. pp. 1-14
@inproceedings{9570a6f937274c73846dbd95ac1c1f93,
title = "Exploiting locality in graph analytics through hardware-Accelerated traversal scheduling",
abstract = "Graph processing is increasingly bottlenecked by main memory accesses. On-chip caches are of little help because the irregular structure of graphs causes seemingly random memory references. However, most real-world graphs offer significant potential locality-it is just hard to predict ahead of time. In practice, graphs have well-connected regions where relatively few vertices share edges with many common neighbors. If these vertices were processed together, graph processing would enjoy significant data reuse. Hence, a graph's traversal schedule largely determines its locality. This paper explores online traversal scheduling strategies that exploit the community structure of real-world graphs to improve locality. Software graph processing frameworks use simple, locality-oblivious scheduling because, on general-purpose cores, the benefits of locality-Aware scheduling are outweighed by its overheads. Software frameworks rely on offline preprocessing to improve locality. Unfortunately, preprocessing is so expensive that its costs often negate any benefits from improved locality. Recent graph processing accelerators have inherited this design. Our insight is that this misses an opportunity: Hardware acceleration allows for more sophisticated, online locality-Aware scheduling than can be realized in software, letting systems significantly improve locality without any preprocessing. To exploit this insight, we present bounded depth-first scheduling (BDFS), a simple online locality-Aware scheduling strategy. BDFS restricts each core to explore one small, connected region of the graph at a time, improving locality on graphs with good community structure. We then present HATS, a hardware-Accelerated traversal scheduler that adds just 0.4{\%} area and 0.2{\%} power over general-purpose cores. We evaluate BDFS and HATS on several algorithms using large real-world graphs. On a simulated 16-core system, BDFS reduces main memory accesses by up to 2.4x and by 30{\%} on average. However, BDFS is too expensive in software and degrades performance by 21{\%} on average. HATS eliminates these overheads, allowing BDFS to improve performance by 83{\%} on average (up to 3.1x) over a locality-oblivious software implementation and by 31{\%} on average (up to 2.1x) over specialized prefetchers.",
keywords = "Caches, Graph analytics, Locality, Multicore, Prefetching, Scheduling",
author = "Anurag Mukkara and Nathan Beckmann and Maleen Abeydeera and Xiaosong Ma and Daniel Sanchez",
year = "2018",
month = "12",
day = "12",
doi = "10.1109/MICRO.2018.00010",
language = "English",
volume = "2018-October",
pages = "1--14",
booktitle = "Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - Exploiting locality in graph analytics through hardware-Accelerated traversal scheduling

AU - Mukkara, Anurag

AU - Beckmann, Nathan

AU - Abeydeera, Maleen

AU - Ma, Xiaosong

AU - Sanchez, Daniel

PY - 2018/12/12

Y1 - 2018/12/12

N2 - Graph processing is increasingly bottlenecked by main memory accesses. On-chip caches are of little help because the irregular structure of graphs causes seemingly random memory references. However, most real-world graphs offer significant potential locality-it is just hard to predict ahead of time. In practice, graphs have well-connected regions where relatively few vertices share edges with many common neighbors. If these vertices were processed together, graph processing would enjoy significant data reuse. Hence, a graph's traversal schedule largely determines its locality. This paper explores online traversal scheduling strategies that exploit the community structure of real-world graphs to improve locality. Software graph processing frameworks use simple, locality-oblivious scheduling because, on general-purpose cores, the benefits of locality-Aware scheduling are outweighed by its overheads. Software frameworks rely on offline preprocessing to improve locality. Unfortunately, preprocessing is so expensive that its costs often negate any benefits from improved locality. Recent graph processing accelerators have inherited this design. Our insight is that this misses an opportunity: Hardware acceleration allows for more sophisticated, online locality-Aware scheduling than can be realized in software, letting systems significantly improve locality without any preprocessing. To exploit this insight, we present bounded depth-first scheduling (BDFS), a simple online locality-Aware scheduling strategy. BDFS restricts each core to explore one small, connected region of the graph at a time, improving locality on graphs with good community structure. We then present HATS, a hardware-Accelerated traversal scheduler that adds just 0.4% area and 0.2% power over general-purpose cores. We evaluate BDFS and HATS on several algorithms using large real-world graphs. On a simulated 16-core system, BDFS reduces main memory accesses by up to 2.4x and by 30% on average. However, BDFS is too expensive in software and degrades performance by 21% on average. HATS eliminates these overheads, allowing BDFS to improve performance by 83% on average (up to 3.1x) over a locality-oblivious software implementation and by 31% on average (up to 2.1x) over specialized prefetchers.

AB - Graph processing is increasingly bottlenecked by main memory accesses. On-chip caches are of little help because the irregular structure of graphs causes seemingly random memory references. However, most real-world graphs offer significant potential locality-it is just hard to predict ahead of time. In practice, graphs have well-connected regions where relatively few vertices share edges with many common neighbors. If these vertices were processed together, graph processing would enjoy significant data reuse. Hence, a graph's traversal schedule largely determines its locality. This paper explores online traversal scheduling strategies that exploit the community structure of real-world graphs to improve locality. Software graph processing frameworks use simple, locality-oblivious scheduling because, on general-purpose cores, the benefits of locality-Aware scheduling are outweighed by its overheads. Software frameworks rely on offline preprocessing to improve locality. Unfortunately, preprocessing is so expensive that its costs often negate any benefits from improved locality. Recent graph processing accelerators have inherited this design. Our insight is that this misses an opportunity: Hardware acceleration allows for more sophisticated, online locality-Aware scheduling than can be realized in software, letting systems significantly improve locality without any preprocessing. To exploit this insight, we present bounded depth-first scheduling (BDFS), a simple online locality-Aware scheduling strategy. BDFS restricts each core to explore one small, connected region of the graph at a time, improving locality on graphs with good community structure. We then present HATS, a hardware-Accelerated traversal scheduler that adds just 0.4% area and 0.2% power over general-purpose cores. We evaluate BDFS and HATS on several algorithms using large real-world graphs. On a simulated 16-core system, BDFS reduces main memory accesses by up to 2.4x and by 30% on average. However, BDFS is too expensive in software and degrades performance by 21% on average. HATS eliminates these overheads, allowing BDFS to improve performance by 83% on average (up to 3.1x) over a locality-oblivious software implementation and by 31% on average (up to 2.1x) over specialized prefetchers.

KW - Caches

KW - Graph analytics

KW - Locality

KW - Multicore

KW - Prefetching

KW - Scheduling

UR - http://www.scopus.com/inward/record.url?scp=85060057243&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85060057243&partnerID=8YFLogxK

U2 - 10.1109/MICRO.2018.00010

DO - 10.1109/MICRO.2018.00010

M3 - Conference contribution

VL - 2018-October

SP - 1

EP - 14

BT - Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018

PB - IEEE Computer Society

ER -