Probabilistic communication and I/O tracing with deterministic replay at scale

Xing Wu, Karthik Vijayakumar, Frank Mueller, Xiaosong Ma, Philip C. Roth

Research output: Chapter in Book/Report/Conference proceedingConference contribution

20 Citations (Scopus)

Abstract

With today's petascale supercomputers, applications often exhibit low efficiency, such as poor communication and I/O performance, that can be diagnosed by analysis tools. However, these tools either produce extremely large trace files that complicate performance analysis, or sacrifice accuracy to collect high-level statistical information using crude averaging. This work contributes Scala-H-Trace, which features more aggressive trace compression than any previous approach, particularly for applications that do not show strict regularity in SPMD behavior. Scala-H-Trace uses histograms expressing the probabilistic distribution of arbitrary communication and I/O parameters to capture variations. Yet, where other tools fail to scale, Scala-H-Trace guarantees trace iles of near constant size, even for variable communication and I/O patterns, producing trace iles orders of magnitudes smaller than using prior approaches. We demonstrate the ability to collect traces of applications running on thousands of processors with the potential to scale well beyond this level. We further present the irst approach to deterministically replay such probabilistic traces (a) without deadlocks and (b) in a manner closely resembling the original applications. Our results show either near constant sized traces or only sub-linear increases in trace file sizes irrespective of the number of nodes utilized. Even with the aggressively compressed histogram-based traces, our replay times are within 12% to 15% of the runtime of original codes. Such concise traces resembling the behavior of production-style codes closely and our approach of deterministic replay of probabilistic traces are without precedence.

Original languageEnglish
Title of host publicationProceedings of the International Conference on Parallel Processing
Pages196-205
Number of pages10
DOIs
Publication statusPublished - 7 Nov 2011
Externally publishedYes
Event40th International Conference on Parallel Processing, ICPP 2011 - Taipei City, Taiwan, Province of China
Duration: 13 Sep 201116 Sep 2011

Other

Other40th International Conference on Parallel Processing, ICPP 2011
CountryTaiwan, Province of China
CityTaipei City
Period13/9/1116/9/11

Fingerprint

Tracing
Trace
Communication
Supercomputers
Histogram
Supercomputer
Deadlock
Performance Analysis
Averaging
Compression
Regularity

ASJC Scopus subject areas

  • Software
  • Mathematics(all)
  • Hardware and Architecture

Cite this

Wu, X., Vijayakumar, K., Mueller, F., Ma, X., & Roth, P. C. (2011). Probabilistic communication and I/O tracing with deterministic replay at scale. In Proceedings of the International Conference on Parallel Processing (pp. 196-205). [6047188] https://doi.org/10.1109/ICPP.2011.50

Probabilistic communication and I/O tracing with deterministic replay at scale. / Wu, Xing; Vijayakumar, Karthik; Mueller, Frank; Ma, Xiaosong; Roth, Philip C.

Proceedings of the International Conference on Parallel Processing. 2011. p. 196-205 6047188.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Wu, X, Vijayakumar, K, Mueller, F, Ma, X & Roth, PC 2011, Probabilistic communication and I/O tracing with deterministic replay at scale. in Proceedings of the International Conference on Parallel Processing., 6047188, pp. 196-205, 40th International Conference on Parallel Processing, ICPP 2011, Taipei City, Taiwan, Province of China, 13/9/11. https://doi.org/10.1109/ICPP.2011.50
Wu X, Vijayakumar K, Mueller F, Ma X, Roth PC. Probabilistic communication and I/O tracing with deterministic replay at scale. In Proceedings of the International Conference on Parallel Processing. 2011. p. 196-205. 6047188 https://doi.org/10.1109/ICPP.2011.50
Wu, Xing ; Vijayakumar, Karthik ; Mueller, Frank ; Ma, Xiaosong ; Roth, Philip C. / Probabilistic communication and I/O tracing with deterministic replay at scale. Proceedings of the International Conference on Parallel Processing. 2011. pp. 196-205
@inproceedings{88e3a9f4cbc64477b67ca5480e6404d0,
title = "Probabilistic communication and I/O tracing with deterministic replay at scale",
abstract = "With today's petascale supercomputers, applications often exhibit low efficiency, such as poor communication and I/O performance, that can be diagnosed by analysis tools. However, these tools either produce extremely large trace files that complicate performance analysis, or sacrifice accuracy to collect high-level statistical information using crude averaging. This work contributes Scala-H-Trace, which features more aggressive trace compression than any previous approach, particularly for applications that do not show strict regularity in SPMD behavior. Scala-H-Trace uses histograms expressing the probabilistic distribution of arbitrary communication and I/O parameters to capture variations. Yet, where other tools fail to scale, Scala-H-Trace guarantees trace iles of near constant size, even for variable communication and I/O patterns, producing trace iles orders of magnitudes smaller than using prior approaches. We demonstrate the ability to collect traces of applications running on thousands of processors with the potential to scale well beyond this level. We further present the irst approach to deterministically replay such probabilistic traces (a) without deadlocks and (b) in a manner closely resembling the original applications. Our results show either near constant sized traces or only sub-linear increases in trace file sizes irrespective of the number of nodes utilized. Even with the aggressively compressed histogram-based traces, our replay times are within 12{\%} to 15{\%} of the runtime of original codes. Such concise traces resembling the behavior of production-style codes closely and our approach of deterministic replay of probabilistic traces are without precedence.",
author = "Xing Wu and Karthik Vijayakumar and Frank Mueller and Xiaosong Ma and Roth, {Philip C.}",
year = "2011",
month = "11",
day = "7",
doi = "10.1109/ICPP.2011.50",
language = "English",
isbn = "9780769545103",
pages = "196--205",
booktitle = "Proceedings of the International Conference on Parallel Processing",

}

TY - GEN

T1 - Probabilistic communication and I/O tracing with deterministic replay at scale

AU - Wu, Xing

AU - Vijayakumar, Karthik

AU - Mueller, Frank

AU - Ma, Xiaosong

AU - Roth, Philip C.

PY - 2011/11/7

Y1 - 2011/11/7

N2 - With today's petascale supercomputers, applications often exhibit low efficiency, such as poor communication and I/O performance, that can be diagnosed by analysis tools. However, these tools either produce extremely large trace files that complicate performance analysis, or sacrifice accuracy to collect high-level statistical information using crude averaging. This work contributes Scala-H-Trace, which features more aggressive trace compression than any previous approach, particularly for applications that do not show strict regularity in SPMD behavior. Scala-H-Trace uses histograms expressing the probabilistic distribution of arbitrary communication and I/O parameters to capture variations. Yet, where other tools fail to scale, Scala-H-Trace guarantees trace iles of near constant size, even for variable communication and I/O patterns, producing trace iles orders of magnitudes smaller than using prior approaches. We demonstrate the ability to collect traces of applications running on thousands of processors with the potential to scale well beyond this level. We further present the irst approach to deterministically replay such probabilistic traces (a) without deadlocks and (b) in a manner closely resembling the original applications. Our results show either near constant sized traces or only sub-linear increases in trace file sizes irrespective of the number of nodes utilized. Even with the aggressively compressed histogram-based traces, our replay times are within 12% to 15% of the runtime of original codes. Such concise traces resembling the behavior of production-style codes closely and our approach of deterministic replay of probabilistic traces are without precedence.

AB - With today's petascale supercomputers, applications often exhibit low efficiency, such as poor communication and I/O performance, that can be diagnosed by analysis tools. However, these tools either produce extremely large trace files that complicate performance analysis, or sacrifice accuracy to collect high-level statistical information using crude averaging. This work contributes Scala-H-Trace, which features more aggressive trace compression than any previous approach, particularly for applications that do not show strict regularity in SPMD behavior. Scala-H-Trace uses histograms expressing the probabilistic distribution of arbitrary communication and I/O parameters to capture variations. Yet, where other tools fail to scale, Scala-H-Trace guarantees trace iles of near constant size, even for variable communication and I/O patterns, producing trace iles orders of magnitudes smaller than using prior approaches. We demonstrate the ability to collect traces of applications running on thousands of processors with the potential to scale well beyond this level. We further present the irst approach to deterministically replay such probabilistic traces (a) without deadlocks and (b) in a manner closely resembling the original applications. Our results show either near constant sized traces or only sub-linear increases in trace file sizes irrespective of the number of nodes utilized. Even with the aggressively compressed histogram-based traces, our replay times are within 12% to 15% of the runtime of original codes. Such concise traces resembling the behavior of production-style codes closely and our approach of deterministic replay of probabilistic traces are without precedence.

UR - http://www.scopus.com/inward/record.url?scp=80155183450&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80155183450&partnerID=8YFLogxK

U2 - 10.1109/ICPP.2011.50

DO - 10.1109/ICPP.2011.50

M3 - Conference contribution

SN - 9780769545103

SP - 196

EP - 205

BT - Proceedings of the International Conference on Parallel Processing

ER -