Scalable error isolation for distributed systems

Diogo Behrens, Marco Serafini, Sergei Arnautov, Flavio P. Junqueira, Christof Fetzer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

In distributed systems, data corruption on a single node can propagate to other nodes in the system and cause severe outages. The probability of data corruption is already non-negligible today in large computer populations (e.g., in large datacenters). The resilience of processors is expected to decline in the near future, making it necessary to devise cost-effective software approaches to deal with data corruption. In this paper, we present SEI, an algorithm that tolerates Arbitrary State Corruption (ASC) faults and prevents data corruption from propagating across a distributed system. SEI scales in three dimensions: memory, number of processing threads, and development effort. To evaluate development effort, fault coverage, and performance with our library, we hardened two realworld applications: a DNS resolver and memcached. Hardening these applications required minimal changes to the existing code base, and the performance overhead is negligible in the case of applications that are not CPUintensive, such as memcached. The memory overhead is negligible independent of the application when using ECC memory. Finally, SEI covers faults effectively: it detected all hardware-injected errors and reduced undetected errors from 44% down to only 0.15% of the software-injected computation errors in our experiments.

Original languageEnglish
Title of host publicationProceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015
PublisherUSENIX
Pages605-620
Number of pages16
ISBN (Print)9781931971218
Publication statusPublished - 2015
Event12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015 - Oakland, United States
Duration: 4 May 20156 May 2015

Other

Other12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015
CountryUnited States
CityOakland
Period4/5/156/5/15

Fingerprint

Data storage equipment
Outages
Hardening
Hardware
Processing
Costs
Experiments

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Computer Networks and Communications

Cite this

Behrens, D., Serafini, M., Arnautov, S., Junqueira, F. P., & Fetzer, C. (2015). Scalable error isolation for distributed systems. In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015 (pp. 605-620). USENIX.

Scalable error isolation for distributed systems. / Behrens, Diogo; Serafini, Marco; Arnautov, Sergei; Junqueira, Flavio P.; Fetzer, Christof.

Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015. USENIX, 2015. p. 605-620.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Behrens, D, Serafini, M, Arnautov, S, Junqueira, FP & Fetzer, C 2015, Scalable error isolation for distributed systems. in Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015. USENIX, pp. 605-620, 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015, Oakland, United States, 4/5/15.
Behrens D, Serafini M, Arnautov S, Junqueira FP, Fetzer C. Scalable error isolation for distributed systems. In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015. USENIX. 2015. p. 605-620
Behrens, Diogo ; Serafini, Marco ; Arnautov, Sergei ; Junqueira, Flavio P. ; Fetzer, Christof. / Scalable error isolation for distributed systems. Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015. USENIX, 2015. pp. 605-620
@inproceedings{a1ac9520d2a143eeb614450d96b8f319,
title = "Scalable error isolation for distributed systems",
abstract = "In distributed systems, data corruption on a single node can propagate to other nodes in the system and cause severe outages. The probability of data corruption is already non-negligible today in large computer populations (e.g., in large datacenters). The resilience of processors is expected to decline in the near future, making it necessary to devise cost-effective software approaches to deal with data corruption. In this paper, we present SEI, an algorithm that tolerates Arbitrary State Corruption (ASC) faults and prevents data corruption from propagating across a distributed system. SEI scales in three dimensions: memory, number of processing threads, and development effort. To evaluate development effort, fault coverage, and performance with our library, we hardened two realworld applications: a DNS resolver and memcached. Hardening these applications required minimal changes to the existing code base, and the performance overhead is negligible in the case of applications that are not CPUintensive, such as memcached. The memory overhead is negligible independent of the application when using ECC memory. Finally, SEI covers faults effectively: it detected all hardware-injected errors and reduced undetected errors from 44{\%} down to only 0.15{\%} of the software-injected computation errors in our experiments.",
author = "Diogo Behrens and Marco Serafini and Sergei Arnautov and Junqueira, {Flavio P.} and Christof Fetzer",
year = "2015",
language = "English",
isbn = "9781931971218",
pages = "605--620",
booktitle = "Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015",
publisher = "USENIX",

}

TY - GEN

T1 - Scalable error isolation for distributed systems

AU - Behrens, Diogo

AU - Serafini, Marco

AU - Arnautov, Sergei

AU - Junqueira, Flavio P.

AU - Fetzer, Christof

PY - 2015

Y1 - 2015

N2 - In distributed systems, data corruption on a single node can propagate to other nodes in the system and cause severe outages. The probability of data corruption is already non-negligible today in large computer populations (e.g., in large datacenters). The resilience of processors is expected to decline in the near future, making it necessary to devise cost-effective software approaches to deal with data corruption. In this paper, we present SEI, an algorithm that tolerates Arbitrary State Corruption (ASC) faults and prevents data corruption from propagating across a distributed system. SEI scales in three dimensions: memory, number of processing threads, and development effort. To evaluate development effort, fault coverage, and performance with our library, we hardened two realworld applications: a DNS resolver and memcached. Hardening these applications required minimal changes to the existing code base, and the performance overhead is negligible in the case of applications that are not CPUintensive, such as memcached. The memory overhead is negligible independent of the application when using ECC memory. Finally, SEI covers faults effectively: it detected all hardware-injected errors and reduced undetected errors from 44% down to only 0.15% of the software-injected computation errors in our experiments.

AB - In distributed systems, data corruption on a single node can propagate to other nodes in the system and cause severe outages. The probability of data corruption is already non-negligible today in large computer populations (e.g., in large datacenters). The resilience of processors is expected to decline in the near future, making it necessary to devise cost-effective software approaches to deal with data corruption. In this paper, we present SEI, an algorithm that tolerates Arbitrary State Corruption (ASC) faults and prevents data corruption from propagating across a distributed system. SEI scales in three dimensions: memory, number of processing threads, and development effort. To evaluate development effort, fault coverage, and performance with our library, we hardened two realworld applications: a DNS resolver and memcached. Hardening these applications required minimal changes to the existing code base, and the performance overhead is negligible in the case of applications that are not CPUintensive, such as memcached. The memory overhead is negligible independent of the application when using ECC memory. Finally, SEI covers faults effectively: it detected all hardware-injected errors and reduced undetected errors from 44% down to only 0.15% of the software-injected computation errors in our experiments.

UR - http://www.scopus.com/inward/record.url?scp=84966762423&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84966762423&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781931971218

SP - 605

EP - 620

BT - Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015

PB - USENIX

ER -