Whole Program Adaptive Error Detection and Mitigation (AEDAM) : AEDAM

Whole-program Adaptive Error Detection and Mitigation

PI: S. Krishnamoorthy, Pacific Northwest National Laboratory

CoPIs: S. Mahlke, University of Michigan; S. Amarasinghe, Massachusetts Institute of Technology; P. Sadayappan, Ohio State University; M. Erez, University of Texas, Austin; G. Gopalakrishnan, University of Utah; Gagan Agrawal, Ohio State University; Michael Carbin, Massachusetts Institute of Technology.

Errors in application state resulting from faults in hardware are an increasing concern on extreme-scale computing systems. Errors that escape detection and lead to silent data corruption are particularly problematic. Detecting errors is an important first step toward fault tolerant program execution. In contrast to performance optimization, there is limited understanding of scientific application resilience strategies. Existing approaches dealing with errors often are “point studies,” or techniques that typically address a particular class of errors (errors in memory, instruction execution, control flow, etc.) under specific assumptions about the hardware vulnerability for a precise or narrow class of applications.

To improve application resilience strategies, there is a pressing need to investigate: (1) how errors affecting different portions of the execution state for a scientific application can be effectively detected, (2) how individual detectors and hardware can be characterized and composed in an automated fashion to design the most efficient full- application solution, (3) how detectors and their composition can be evaluated to provide the most comprehensive insights into their, and (4) what errors and fault rates must be tackled primarily in hardware for effective execution of scientific applications. We propose a comprehensive approach to error detection and mitigation for scientific applications (Topics 1 and 2 in the solicitation) that combines configurable error detectors, a unified reliability specification, and whole-program detector composition.

We will design and characterize configurable error detection techniques while accounting for hardware vulnerability characteristics, application resilience requirements, and cost/capabilities of individual detector configurations. We will clarify the fault behavior of scientific applications and target hardware in terms of a unified reliability specification to be used for composing individual detectors—factoring the cost and coverage of each detector—and to develop an end-to-end error detection approach, characterized by the best detector composition for the entire application with respect to classes of errors being handled.

Publications

E. Atkinson and M. Carbin. “Towards Correct-By-Construction Probabilistic Programming.” NIPS Workshop on Machine Learning Systems 2017.
J. Liu and G. Agrawal. “Supporting Fault-Tolerance in Presence of In-Situ Analytics.” CCGRID 2017.
M. Laurenzano, P. Hill, M. Samadi, S. Mahlke, J. Mars, L. Tang. “Input Responsive Approximation: Using Canary Inputs to Dynamically Steer Software Approximation.” PLDI 2016.
W. Bao, S. Krishnamoorthy, L.-N. Pouchet, F. Rastello, and P. Sadayappan. “Polycheck: Dynamic verification of iteration space transformations on affine programs.” POPL 2016.
V. Sharma, G. Gopalakrishnan, and S. Krishnamoorthy. “Towards resiliency evaluation of vector programs.” DPDNS 2016.
D. Tao, S. Song, S. Krishnamoorthy, P. Wu, X. Liang, E. Zhang, D. Kerbyson, and Z. Chen. “New-Sum: a novel online ABFT scheme for general iterative methods.” HPDC 2016.
V. Sharma, G. Gopalakrishnan, and S. Krishnamoorthy. “PRESAGE: protecting structured address generation against soft errors.” International Conference on High Performance Computing, Data, and Analytics, December 2016.
J. Liu, G. Agrawal. “Soft Error Detection for Iterative Applications Using Offline Training.” International Conference on High Performance Computing, Data, and Analytics, December 2016.