End-to-end I/O monitoring on a leading supercomputer

Bin Yang, Xu Ji, Xiaosong Ma, Xiyang Wang, Tianyu Zhang, Xiupeng Zhu, Nosayba El-Sayed, Haidong Lan, Yibo Yang, Jidong Zhai, Weiguo Liu, Wei Xue

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

This paper presents an effort to overcome the complexities of production system I/O performance monitoring. We design Beacon, an end-to-end I/O resource monitoring and diagnosis system, for the 40960-node Sunway TaihuLight supercomputer, current ranked world No.3. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes and metadata servers. With mechanisms such as aggressive online+offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. Higher-level per-application I/O performance behaviors are reconstructed from system-level monitoring data to reveal correlations between system performance bottlenecks, utilization symptoms, and application behaviors. Beacon further provides query, statistics, and visualization utilities to users and administrators, allowing comprehensive and in-depth analysis without requiring any code/script modification. With its deployment on TaihuLight for around 18 months, we demonstrate Beacon's effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. In addition, we demonstrate Beacon's generality by its recent extension to monitor interconnection networks, another contention point on supercomputers. Both Beacon codes and part of collected monitoring data are released.

Original languageEnglish
Title of host publicationProceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019
PublisherUSENIX Association
Pages379-394
Number of pages16
ISBN (Electronic)9781931971492
Publication statusPublished - 1 Jan 2019
Event16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019 - Boston, United States
Duration: 26 Feb 201928 Feb 2019

Publication series

NameProceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019

Conference

Conference16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019
CountryUnited States
CityBoston
Period26/2/1928/2/19

    Fingerprint

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Computer Networks and Communications

Cite this

Yang, B., Ji, X., Ma, X., Wang, X., Zhang, T., Zhu, X., El-Sayed, N., Lan, H., Yang, Y., Zhai, J., Liu, W., & Xue, W. (2019). End-to-end I/O monitoring on a leading supercomputer. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019 (pp. 379-394). (Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019). USENIX Association.