Coupling prefix caching and collective downloads for remote dataset access

Xiaosong Ma, Vincent W. Freeh, Tao Yang, Sudharshan S. Vazhkudai, Tyler A. Simon, Stephen L. Scott

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Scientific datasets are typically archived at mass storage systems or data centers close to supercomputers/instruments. End-users of these datasets, however, usually perform parts of their workflows at their local computers. In such cases, client-side caching can offer significant gains by reducing the cost of wide-area data movement.Scientific data caches, however, traditionally cache entire data-sets, which may not be necessary. In this paper, we propose a novel combination of prefix caching and collective download. Prefix caching allows the bootstrapping of dataset downloads by caching only a prefix of the dataset, while collective download facilitates efficient parallel patching of the missing suffix from an external data source. To estimate the optimal prefix size, we further present an analytical model that considers both the initial download over-head and the downloading speed. We implemented our proposed approach in the FreeLoader distributed cache prototype. Experimental results (using multiple scientific data repositories and data transfer tools, as well as a real-world scientific dataset access trace) demonstrate that prefix caching and collective download can be implemented efficiently, our model can select an appropriate prefix size, and the cache hit rate can be improved significantly without hurting the local access rate of cached datasets.

Original languageEnglish
Title of host publicationProceedings of the International Conference on Supercomputing
Pages229-238
Number of pages10
DOIs
Publication statusPublished - 1 Dec 2006
Externally publishedYes
Event20th Annual International Conference on Supercomputing, ICS 2006 - Cairns, Queensland, Australia
Duration: 28 Jun 20061 Jul 2006

Other

Other20th Annual International Conference on Supercomputing, ICS 2006
CountryAustralia
CityCairns, Queensland
Period28/6/061/7/06

Fingerprint

Supercomputers
Data transfer
Analytical models
Costs

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Ma, X., Freeh, V. W., Yang, T., Vazhkudai, S. S., Simon, T. A., & Scott, S. L. (2006). Coupling prefix caching and collective downloads for remote dataset access. In Proceedings of the International Conference on Supercomputing (pp. 229-238) https://doi.org/10.1145/1183401.1183435

Coupling prefix caching and collective downloads for remote dataset access. / Ma, Xiaosong; Freeh, Vincent W.; Yang, Tao; Vazhkudai, Sudharshan S.; Simon, Tyler A.; Scott, Stephen L.

Proceedings of the International Conference on Supercomputing. 2006. p. 229-238.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ma, X, Freeh, VW, Yang, T, Vazhkudai, SS, Simon, TA & Scott, SL 2006, Coupling prefix caching and collective downloads for remote dataset access. in Proceedings of the International Conference on Supercomputing. pp. 229-238, 20th Annual International Conference on Supercomputing, ICS 2006, Cairns, Queensland, Australia, 28/6/06. https://doi.org/10.1145/1183401.1183435
Ma X, Freeh VW, Yang T, Vazhkudai SS, Simon TA, Scott SL. Coupling prefix caching and collective downloads for remote dataset access. In Proceedings of the International Conference on Supercomputing. 2006. p. 229-238 https://doi.org/10.1145/1183401.1183435
Ma, Xiaosong ; Freeh, Vincent W. ; Yang, Tao ; Vazhkudai, Sudharshan S. ; Simon, Tyler A. ; Scott, Stephen L. / Coupling prefix caching and collective downloads for remote dataset access. Proceedings of the International Conference on Supercomputing. 2006. pp. 229-238
@inproceedings{4e18227e22214291a2a7053ccde93bfa,
title = "Coupling prefix caching and collective downloads for remote dataset access",
abstract = "Scientific datasets are typically archived at mass storage systems or data centers close to supercomputers/instruments. End-users of these datasets, however, usually perform parts of their workflows at their local computers. In such cases, client-side caching can offer significant gains by reducing the cost of wide-area data movement.Scientific data caches, however, traditionally cache entire data-sets, which may not be necessary. In this paper, we propose a novel combination of prefix caching and collective download. Prefix caching allows the bootstrapping of dataset downloads by caching only a prefix of the dataset, while collective download facilitates efficient parallel patching of the missing suffix from an external data source. To estimate the optimal prefix size, we further present an analytical model that considers both the initial download over-head and the downloading speed. We implemented our proposed approach in the FreeLoader distributed cache prototype. Experimental results (using multiple scientific data repositories and data transfer tools, as well as a real-world scientific dataset access trace) demonstrate that prefix caching and collective download can be implemented efficiently, our model can select an appropriate prefix size, and the cache hit rate can be improved significantly without hurting the local access rate of cached datasets.",
author = "Xiaosong Ma and Freeh, {Vincent W.} and Tao Yang and Vazhkudai, {Sudharshan S.} and Simon, {Tyler A.} and Scott, {Stephen L.}",
year = "2006",
month = "12",
day = "1",
doi = "10.1145/1183401.1183435",
language = "English",
isbn = "1595932828",
pages = "229--238",
booktitle = "Proceedings of the International Conference on Supercomputing",

}

TY - GEN

T1 - Coupling prefix caching and collective downloads for remote dataset access

AU - Ma, Xiaosong

AU - Freeh, Vincent W.

AU - Yang, Tao

AU - Vazhkudai, Sudharshan S.

AU - Simon, Tyler A.

AU - Scott, Stephen L.

PY - 2006/12/1

Y1 - 2006/12/1

N2 - Scientific datasets are typically archived at mass storage systems or data centers close to supercomputers/instruments. End-users of these datasets, however, usually perform parts of their workflows at their local computers. In such cases, client-side caching can offer significant gains by reducing the cost of wide-area data movement.Scientific data caches, however, traditionally cache entire data-sets, which may not be necessary. In this paper, we propose a novel combination of prefix caching and collective download. Prefix caching allows the bootstrapping of dataset downloads by caching only a prefix of the dataset, while collective download facilitates efficient parallel patching of the missing suffix from an external data source. To estimate the optimal prefix size, we further present an analytical model that considers both the initial download over-head and the downloading speed. We implemented our proposed approach in the FreeLoader distributed cache prototype. Experimental results (using multiple scientific data repositories and data transfer tools, as well as a real-world scientific dataset access trace) demonstrate that prefix caching and collective download can be implemented efficiently, our model can select an appropriate prefix size, and the cache hit rate can be improved significantly without hurting the local access rate of cached datasets.

AB - Scientific datasets are typically archived at mass storage systems or data centers close to supercomputers/instruments. End-users of these datasets, however, usually perform parts of their workflows at their local computers. In such cases, client-side caching can offer significant gains by reducing the cost of wide-area data movement.Scientific data caches, however, traditionally cache entire data-sets, which may not be necessary. In this paper, we propose a novel combination of prefix caching and collective download. Prefix caching allows the bootstrapping of dataset downloads by caching only a prefix of the dataset, while collective download facilitates efficient parallel patching of the missing suffix from an external data source. To estimate the optimal prefix size, we further present an analytical model that considers both the initial download over-head and the downloading speed. We implemented our proposed approach in the FreeLoader distributed cache prototype. Experimental results (using multiple scientific data repositories and data transfer tools, as well as a real-world scientific dataset access trace) demonstrate that prefix caching and collective download can be implemented efficiently, our model can select an appropriate prefix size, and the cache hit rate can be improved significantly without hurting the local access rate of cached datasets.

UR - http://www.scopus.com/inward/record.url?scp=34547489494&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34547489494&partnerID=8YFLogxK

U2 - 10.1145/1183401.1183435

DO - 10.1145/1183401.1183435

M3 - Conference contribution

AN - SCOPUS:34547489494

SN - 1595932828

SN - 9781595932822

SP - 229

EP - 238

BT - Proceedings of the International Conference on Supercomputing

ER -