Description: As supercomputers become larger and more powerful, they are growing increasingly complex. This is reflected both in the exponentially increasing numbers of components in HPC systems (LLNL is currently installing the 1.6 million core Sequoia system) as well as the wide variety of software and hardware components that a typical system includes. At this scale it becomes infeasible to make each component sufficiently reliable to prevent regular faults somewhere in the system or to account for all possible cross-component interactions. The resulting faults and instability cause HPC applications to crash, perform sub-optimally or even produce erroneous results. As supercomputers continue to approach Exascale performance and full system reliability becomes prohibitively expensive, we will require novel techniques to bridge the gap between the lower reliability provided by hardware systems and users unchanging need for consistent performance and reliable results. Previous research on HPC system reliability has developed various techniques for tolerating and detecting various types of faults. However, these techniques have seen very limited real applicability because of our poor understanding of how real systems are affected by complex faults such as soft fault-induced bit flips or performance degradations. Prior work on such techniques has had very limited practical utility because it has generally focused on analyzing the behavior of entire software/hardware systems both during normal operation and in the face of faults. Because such behaviors are extremely complex, such studies have only produced coarse behavioral models of limited sets of software/hardware system stacks. Since this provides little insight into the many different system stacks and applications used in practice, this work has had little real-world impact. My project addresses this problem by developing a modular methodology to analyze the behavior of applications and systems during both normal and faulty operation. By synthesizing models of individual components into a whole-system behavior ...
Date: April 2, 2012
Creator: Bronevetsky, G
Item Type: Refine your search to only Report
Partner: UNT Libraries Government Documents Department