LRC

Dependency-aware cache management for data analytics clusters

Yinghao Yu, Wei Wang, Jun Zhang, Khaled Letaief

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

Memory caches are being aggressively used in today's data-parallel systems such as Spark, Tez, and Piccolo. However, prevalent systems employ rather simple cache management policies - notably the Least Recently Used (LRU) policy - that are oblivious to the application semantics of data dependency, expressed as a directed acyclic graph (DAG). Without this knowledge, memory caching can at best be performed by 'guessing' the future data access patterns based on historical information (e.g., the access recency and/or frequency), which frequently results in inefficient, erroneous caching with low hit ratio and a long response time. In this paper, we propose a novel cache replacement policy, Least Reference Count (LRC), which exploits the application-specific DAG information to optimize the cache management. LRC evicts the cached data blocks whose reference count is the smallest. The reference count is defined, for each data block, as the number of dependent child blocks that have not been computed yet. We demonstrate the efficacy of LRC through both empirical analysis and cluster deployments against popular benchmarking workloads. Our Spark implementation shows that, compared with LRU, LRC speeds up typical applications by 60%.

Original languageEnglish
Title of host publicationINFOCOM 2017 - IEEE Conference on Computer Communications
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781509053360
DOIs
Publication statusPublished - 2 Oct 2017
Externally publishedYes
Event2017 IEEE Conference on Computer Communications, INFOCOM 2017 - Atlanta, United States
Duration: 1 May 20174 May 2017

Other

Other2017 IEEE Conference on Computer Communications, INFOCOM 2017
CountryUnited States
CityAtlanta
Period1/5/174/5/17

Fingerprint

Electric sparks
Cache memory
Benchmarking
Semantics
Data storage equipment

ASJC Scopus subject areas

  • Computer Science(all)
  • Electrical and Electronic Engineering

Cite this

Yu, Y., Wang, W., Zhang, J., & Letaief, K. (2017). LRC: Dependency-aware cache management for data analytics clusters. In INFOCOM 2017 - IEEE Conference on Computer Communications [8057007] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/INFOCOM.2017.8057007

LRC : Dependency-aware cache management for data analytics clusters. / Yu, Yinghao; Wang, Wei; Zhang, Jun; Letaief, Khaled.

INFOCOM 2017 - IEEE Conference on Computer Communications. Institute of Electrical and Electronics Engineers Inc., 2017. 8057007.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yu, Y, Wang, W, Zhang, J & Letaief, K 2017, LRC: Dependency-aware cache management for data analytics clusters. in INFOCOM 2017 - IEEE Conference on Computer Communications., 8057007, Institute of Electrical and Electronics Engineers Inc., 2017 IEEE Conference on Computer Communications, INFOCOM 2017, Atlanta, United States, 1/5/17. https://doi.org/10.1109/INFOCOM.2017.8057007
Yu Y, Wang W, Zhang J, Letaief K. LRC: Dependency-aware cache management for data analytics clusters. In INFOCOM 2017 - IEEE Conference on Computer Communications. Institute of Electrical and Electronics Engineers Inc. 2017. 8057007 https://doi.org/10.1109/INFOCOM.2017.8057007
Yu, Yinghao ; Wang, Wei ; Zhang, Jun ; Letaief, Khaled. / LRC : Dependency-aware cache management for data analytics clusters. INFOCOM 2017 - IEEE Conference on Computer Communications. Institute of Electrical and Electronics Engineers Inc., 2017.
@inproceedings{a000bf565beb4c6d8c0f359b7b57d5ff,
title = "LRC: Dependency-aware cache management for data analytics clusters",
abstract = "Memory caches are being aggressively used in today's data-parallel systems such as Spark, Tez, and Piccolo. However, prevalent systems employ rather simple cache management policies - notably the Least Recently Used (LRU) policy - that are oblivious to the application semantics of data dependency, expressed as a directed acyclic graph (DAG). Without this knowledge, memory caching can at best be performed by 'guessing' the future data access patterns based on historical information (e.g., the access recency and/or frequency), which frequently results in inefficient, erroneous caching with low hit ratio and a long response time. In this paper, we propose a novel cache replacement policy, Least Reference Count (LRC), which exploits the application-specific DAG information to optimize the cache management. LRC evicts the cached data blocks whose reference count is the smallest. The reference count is defined, for each data block, as the number of dependent child blocks that have not been computed yet. We demonstrate the efficacy of LRC through both empirical analysis and cluster deployments against popular benchmarking workloads. Our Spark implementation shows that, compared with LRU, LRC speeds up typical applications by 60{\%}.",
author = "Yinghao Yu and Wei Wang and Jun Zhang and Khaled Letaief",
year = "2017",
month = "10",
day = "2",
doi = "10.1109/INFOCOM.2017.8057007",
language = "English",
booktitle = "INFOCOM 2017 - IEEE Conference on Computer Communications",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - LRC

T2 - Dependency-aware cache management for data analytics clusters

AU - Yu, Yinghao

AU - Wang, Wei

AU - Zhang, Jun

AU - Letaief, Khaled

PY - 2017/10/2

Y1 - 2017/10/2

N2 - Memory caches are being aggressively used in today's data-parallel systems such as Spark, Tez, and Piccolo. However, prevalent systems employ rather simple cache management policies - notably the Least Recently Used (LRU) policy - that are oblivious to the application semantics of data dependency, expressed as a directed acyclic graph (DAG). Without this knowledge, memory caching can at best be performed by 'guessing' the future data access patterns based on historical information (e.g., the access recency and/or frequency), which frequently results in inefficient, erroneous caching with low hit ratio and a long response time. In this paper, we propose a novel cache replacement policy, Least Reference Count (LRC), which exploits the application-specific DAG information to optimize the cache management. LRC evicts the cached data blocks whose reference count is the smallest. The reference count is defined, for each data block, as the number of dependent child blocks that have not been computed yet. We demonstrate the efficacy of LRC through both empirical analysis and cluster deployments against popular benchmarking workloads. Our Spark implementation shows that, compared with LRU, LRC speeds up typical applications by 60%.

AB - Memory caches are being aggressively used in today's data-parallel systems such as Spark, Tez, and Piccolo. However, prevalent systems employ rather simple cache management policies - notably the Least Recently Used (LRU) policy - that are oblivious to the application semantics of data dependency, expressed as a directed acyclic graph (DAG). Without this knowledge, memory caching can at best be performed by 'guessing' the future data access patterns based on historical information (e.g., the access recency and/or frequency), which frequently results in inefficient, erroneous caching with low hit ratio and a long response time. In this paper, we propose a novel cache replacement policy, Least Reference Count (LRC), which exploits the application-specific DAG information to optimize the cache management. LRC evicts the cached data blocks whose reference count is the smallest. The reference count is defined, for each data block, as the number of dependent child blocks that have not been computed yet. We demonstrate the efficacy of LRC through both empirical analysis and cluster deployments against popular benchmarking workloads. Our Spark implementation shows that, compared with LRU, LRC speeds up typical applications by 60%.

UR - http://www.scopus.com/inward/record.url?scp=85034073995&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85034073995&partnerID=8YFLogxK

U2 - 10.1109/INFOCOM.2017.8057007

DO - 10.1109/INFOCOM.2017.8057007

M3 - Conference contribution

BT - INFOCOM 2017 - IEEE Conference on Computer Communications

PB - Institute of Electrical and Electronics Engineers Inc.

ER -