Improving data availability for better access performance

A study on caching scientific data on distributed desktop workstations

Xiaosong Ma, Sudharshan S. Vazhkudai, Z. Zhang

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Client-side data caching serves as an excellent mechanism to store and analyze the rapidly growing scientific data, motivating distributed, client-side caches built from unreliable desktop storage contributions to store and access large scientific data. They offer several desirable properties, such as performance impedance matching, improved space utilization, and high parallel I/O bandwidth. In this context, we are faced with two key challenges: (1) the finite amount of contributed cache space is stretched by the ever increasing scientific dataset sizes and (2) the transient nature of volunteered storage nodes impacts data availability. In this article, we address these challenges by exploiting the existence of external, primary copies of datasets. We propose a novel combination of prefix caching, collective download, and remote partial data recovery (RPDR), to deal with optimal cache space consumption and storage node volatility. Our evaluation, performed on our FreeLoader prototype, indicates that prefix caching can significantly improve the cache hit rate and partial data recovery is better than (or comparable to) many persistent-data availability techniques.

Original languageEnglish
Pages (from-to)419-438
Number of pages20
JournalJournal of Grid Computing
Volume7
Issue number4
DOIs
Publication statusPublished - 1 Nov 2009
Externally publishedYes

Fingerprint

Availability
Recovery
Bandwidth

Keywords

  • Desktop grids
  • Scientific data
  • Storage scavenging

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems
  • Software

Cite this

Improving data availability for better access performance : A study on caching scientific data on distributed desktop workstations. / Ma, Xiaosong; Vazhkudai, Sudharshan S.; Zhang, Z.

In: Journal of Grid Computing, Vol. 7, No. 4, 01.11.2009, p. 419-438.

Research output: Contribution to journalArticle

@article{5c753bcdd28c46118a8f59a1c05d5791,
title = "Improving data availability for better access performance: A study on caching scientific data on distributed desktop workstations",
abstract = "Client-side data caching serves as an excellent mechanism to store and analyze the rapidly growing scientific data, motivating distributed, client-side caches built from unreliable desktop storage contributions to store and access large scientific data. They offer several desirable properties, such as performance impedance matching, improved space utilization, and high parallel I/O bandwidth. In this context, we are faced with two key challenges: (1) the finite amount of contributed cache space is stretched by the ever increasing scientific dataset sizes and (2) the transient nature of volunteered storage nodes impacts data availability. In this article, we address these challenges by exploiting the existence of external, primary copies of datasets. We propose a novel combination of prefix caching, collective download, and remote partial data recovery (RPDR), to deal with optimal cache space consumption and storage node volatility. Our evaluation, performed on our FreeLoader prototype, indicates that prefix caching can significantly improve the cache hit rate and partial data recovery is better than (or comparable to) many persistent-data availability techniques.",
keywords = "Desktop grids, Scientific data, Storage scavenging",
author = "Xiaosong Ma and Vazhkudai, {Sudharshan S.} and Z. Zhang",
year = "2009",
month = "11",
day = "1",
doi = "10.1007/s10723-009-9122-7",
language = "English",
volume = "7",
pages = "419--438",
journal = "Journal of Grid Computing",
issn = "1570-7873",
publisher = "Springer Netherlands",
number = "4",

}

TY - JOUR

T1 - Improving data availability for better access performance

T2 - A study on caching scientific data on distributed desktop workstations

AU - Ma, Xiaosong

AU - Vazhkudai, Sudharshan S.

AU - Zhang, Z.

PY - 2009/11/1

Y1 - 2009/11/1

N2 - Client-side data caching serves as an excellent mechanism to store and analyze the rapidly growing scientific data, motivating distributed, client-side caches built from unreliable desktop storage contributions to store and access large scientific data. They offer several desirable properties, such as performance impedance matching, improved space utilization, and high parallel I/O bandwidth. In this context, we are faced with two key challenges: (1) the finite amount of contributed cache space is stretched by the ever increasing scientific dataset sizes and (2) the transient nature of volunteered storage nodes impacts data availability. In this article, we address these challenges by exploiting the existence of external, primary copies of datasets. We propose a novel combination of prefix caching, collective download, and remote partial data recovery (RPDR), to deal with optimal cache space consumption and storage node volatility. Our evaluation, performed on our FreeLoader prototype, indicates that prefix caching can significantly improve the cache hit rate and partial data recovery is better than (or comparable to) many persistent-data availability techniques.

AB - Client-side data caching serves as an excellent mechanism to store and analyze the rapidly growing scientific data, motivating distributed, client-side caches built from unreliable desktop storage contributions to store and access large scientific data. They offer several desirable properties, such as performance impedance matching, improved space utilization, and high parallel I/O bandwidth. In this context, we are faced with two key challenges: (1) the finite amount of contributed cache space is stretched by the ever increasing scientific dataset sizes and (2) the transient nature of volunteered storage nodes impacts data availability. In this article, we address these challenges by exploiting the existence of external, primary copies of datasets. We propose a novel combination of prefix caching, collective download, and remote partial data recovery (RPDR), to deal with optimal cache space consumption and storage node volatility. Our evaluation, performed on our FreeLoader prototype, indicates that prefix caching can significantly improve the cache hit rate and partial data recovery is better than (or comparable to) many persistent-data availability techniques.

KW - Desktop grids

KW - Scientific data

KW - Storage scavenging

UR - http://www.scopus.com/inward/record.url?scp=77949656389&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77949656389&partnerID=8YFLogxK

U2 - 10.1007/s10723-009-9122-7

DO - 10.1007/s10723-009-9122-7

M3 - Article

VL - 7

SP - 419

EP - 438

JO - Journal of Grid Computing

JF - Journal of Grid Computing

SN - 1570-7873

IS - 4

ER -