A short course given at Beihang University, June/July 2015.


The march toward supercomputing performance that reaches exaflops and the scientific applications that scale to utilize them continues at a steady pace. Along the way, resilience concerns and fears are increasing in importance and prominence. In this short course I will give an overview of different fault, error, and failure modes, and discuss their importance and some common and widely-diverging estimates of their expected rates. I will then discuss various solutions that span both incremental and more revolutionary potential solutions. I will describe the memory system in detail as it plays a major role in current resilient platforms and then describe various proposed techniques that span multiple system layers, from hardware to runtimes to the programmer, algorithm, and tools. This short course will be researchy in nature as this is a relatively new and rapidly evolving topic.

Topics to be covered (tentatively):

Lectures / syllabus

Slides that are marked with © Mattan Erez are licensed under the Creative Commons CC BY license (https://creativecommons.org/licenses/by/4.0/). This means you can freely share this material (copy and redistribute the material in any medium or format) as well as adapt it (remix, transform, and build upon the material for any purpose, even commercially) provided that you credit the authors (generally Mattan Erez) as well as that you not in any way suggests that I endorse you or your use of this material. Some images used were grabbed off the web, papers, and other presentations and are used for educational purposes — they should not be redistributed or modified and those slides do not contain my copyright and do include the correct attribution when known.

  • Part 1: Introduction and trends (pptx|pdf).
  • Part 2: Resilience terminology and fundamentals (pptx|pdf).
  • Part 3: Fault/Error Modes and Models (pptx|pdf).
  • Part 4: Resilient Memories (pptx|pdf).
  • Part 5: Resilient Processors and Networks (basics) (pptx|pdf).
  • Part 6: Resilient Systems: Redundancy + Checkpoint-Restart (pptx|pdf).
  • Part 7: Level of Paranoia and Cloud/HPC Divergence? (pptx|pdf)
  • Part 8: Containment Domains and Other Cross-Layer Approaches (pptx|pdf)