End-to-end I/O monitoring on a leading supercomputer

Bin Yang, Xu Ji, Xiaosong Ma, Xiyang Wang, Tianyu Zhang, Xiupeng Zhu, Nosayba El-Sayed, Haidong Lan, Yibo Yang, Jidong Zhai, Weiguo Liu, Wei Xue

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

This paper presents an effort to overcome the complexities of production system I/O performance monitoring. We design Beacon, an end-to-end I/O resource monitoring and diagnosis system, for the 40960-node Sunway TaihuLight supercomputer, current ranked world No.3. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes and metadata servers. With mechanisms such as aggressive online+offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. Higher-level per-application I/O performance behaviors are reconstructed from system-level monitoring data to reveal correlations between system performance bottlenecks, utilization symptoms, and application behaviors. Beacon further provides query, statistics, and visualization utilities to users and administrators, allowing comprehensive and in-depth analysis without requiring any code/script modification. With its deployment on TaihuLight for around 18 months, we demonstrate Beacon's effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. In addition, we demonstrate Beacon's generality by its recent extension to monitor interconnection networks, another contention point on supercomputers. Both Beacon codes and part of collected monitoring data are released.

Original languageEnglish
Title of host publicationProceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019
PublisherUSENIX Association
Pages379-394
Number of pages16
ISBN (Electronic)9781931971492
Publication statusPublished - 1 Jan 2019
Event16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019 - Boston, United States
Duration: 26 Feb 201928 Feb 2019

Publication series

NameProceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019

Conference

Conference16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019
CountryUnited States
CityBoston
Period26/2/1928/2/19

Fingerprint

Supercomputers
Monitoring
Metadata
Servers
Visualization
Statistics
Defects

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Computer Networks and Communications

Cite this

Yang, B., Ji, X., Ma, X., Wang, X., Zhang, T., Zhu, X., ... Xue, W. (2019). End-to-end I/O monitoring on a leading supercomputer. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019 (pp. 379-394). (Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019). USENIX Association.

End-to-end I/O monitoring on a leading supercomputer. / Yang, Bin; Ji, Xu; Ma, Xiaosong; Wang, Xiyang; Zhang, Tianyu; Zhu, Xiupeng; El-Sayed, Nosayba; Lan, Haidong; Yang, Yibo; Zhai, Jidong; Liu, Weiguo; Xue, Wei.

Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019. USENIX Association, 2019. p. 379-394 (Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yang, B, Ji, X, Ma, X, Wang, X, Zhang, T, Zhu, X, El-Sayed, N, Lan, H, Yang, Y, Zhai, J, Liu, W & Xue, W 2019, End-to-end I/O monitoring on a leading supercomputer. in Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019. Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019, USENIX Association, pp. 379-394, 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019, Boston, United States, 26/2/19.
Yang B, Ji X, Ma X, Wang X, Zhang T, Zhu X et al. End-to-end I/O monitoring on a leading supercomputer. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019. USENIX Association. 2019. p. 379-394. (Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019).
Yang, Bin ; Ji, Xu ; Ma, Xiaosong ; Wang, Xiyang ; Zhang, Tianyu ; Zhu, Xiupeng ; El-Sayed, Nosayba ; Lan, Haidong ; Yang, Yibo ; Zhai, Jidong ; Liu, Weiguo ; Xue, Wei. / End-to-end I/O monitoring on a leading supercomputer. Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019. USENIX Association, 2019. pp. 379-394 (Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019).
@inproceedings{e30339ca27324ae9bc16fef3a6a9125b,
title = "End-to-end I/O monitoring on a leading supercomputer",
abstract = "This paper presents an effort to overcome the complexities of production system I/O performance monitoring. We design Beacon, an end-to-end I/O resource monitoring and diagnosis system, for the 40960-node Sunway TaihuLight supercomputer, current ranked world No.3. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes and metadata servers. With mechanisms such as aggressive online+offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. Higher-level per-application I/O performance behaviors are reconstructed from system-level monitoring data to reveal correlations between system performance bottlenecks, utilization symptoms, and application behaviors. Beacon further provides query, statistics, and visualization utilities to users and administrators, allowing comprehensive and in-depth analysis without requiring any code/script modification. With its deployment on TaihuLight for around 18 months, we demonstrate Beacon's effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. In addition, we demonstrate Beacon's generality by its recent extension to monitor interconnection networks, another contention point on supercomputers. Both Beacon codes and part of collected monitoring data are released.",
author = "Bin Yang and Xu Ji and Xiaosong Ma and Xiyang Wang and Tianyu Zhang and Xiupeng Zhu and Nosayba El-Sayed and Haidong Lan and Yibo Yang and Jidong Zhai and Weiguo Liu and Wei Xue",
year = "2019",
month = "1",
day = "1",
language = "English",
series = "Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019",
publisher = "USENIX Association",
pages = "379--394",
booktitle = "Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019",

}

TY - GEN

T1 - End-to-end I/O monitoring on a leading supercomputer

AU - Yang, Bin

AU - Ji, Xu

AU - Ma, Xiaosong

AU - Wang, Xiyang

AU - Zhang, Tianyu

AU - Zhu, Xiupeng

AU - El-Sayed, Nosayba

AU - Lan, Haidong

AU - Yang, Yibo

AU - Zhai, Jidong

AU - Liu, Weiguo

AU - Xue, Wei

PY - 2019/1/1

Y1 - 2019/1/1

N2 - This paper presents an effort to overcome the complexities of production system I/O performance monitoring. We design Beacon, an end-to-end I/O resource monitoring and diagnosis system, for the 40960-node Sunway TaihuLight supercomputer, current ranked world No.3. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes and metadata servers. With mechanisms such as aggressive online+offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. Higher-level per-application I/O performance behaviors are reconstructed from system-level monitoring data to reveal correlations between system performance bottlenecks, utilization symptoms, and application behaviors. Beacon further provides query, statistics, and visualization utilities to users and administrators, allowing comprehensive and in-depth analysis without requiring any code/script modification. With its deployment on TaihuLight for around 18 months, we demonstrate Beacon's effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. In addition, we demonstrate Beacon's generality by its recent extension to monitor interconnection networks, another contention point on supercomputers. Both Beacon codes and part of collected monitoring data are released.

AB - This paper presents an effort to overcome the complexities of production system I/O performance monitoring. We design Beacon, an end-to-end I/O resource monitoring and diagnosis system, for the 40960-node Sunway TaihuLight supercomputer, current ranked world No.3. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes and metadata servers. With mechanisms such as aggressive online+offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. Higher-level per-application I/O performance behaviors are reconstructed from system-level monitoring data to reveal correlations between system performance bottlenecks, utilization symptoms, and application behaviors. Beacon further provides query, statistics, and visualization utilities to users and administrators, allowing comprehensive and in-depth analysis without requiring any code/script modification. With its deployment on TaihuLight for around 18 months, we demonstrate Beacon's effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. In addition, we demonstrate Beacon's generality by its recent extension to monitor interconnection networks, another contention point on supercomputers. Both Beacon codes and part of collected monitoring data are released.

UR - http://www.scopus.com/inward/record.url?scp=85076144499&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85076144499&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85076144499

T3 - Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019

SP - 379

EP - 394

BT - Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019

PB - USENIX Association

ER -