Automatic Fault Characterization via Abnormality-Enhanced Classification Page: 4 of 14
This article is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided to Digital Library by the UNT Libraries Government Documents Department.
The following text was automatically extracted from the image on this page using optical character recognition software:
latency of fault diagnosis and resolution can be roughly
that long. When presented with a specific execution, the
model can then determine whether the execution is more
consistent with normal or abnormal behavior and, if ab-
normal, the class of faulty runs to which it is most similar.
This paper makes two fundamental contributions. First,
we show that this intuitive statistical modeling fails for
common types of system faults. Second, we overcome
this limitation by using event probability information that
requires no additional monitoring. Intuitively, the naive
application of machine learning classification algorithms
cannot detect accurately complex system faults. Consider,
for example, a fault that causes a reduction in CPU perfor-
mance such as a CPU-hang or a change in core frequency.
Traditional classification algorithms cannot detect or char-
acterize this fault because it affects software inconsistently.
This fault will affect CPU-bound code regions significantly
but will have little impact on memory-bound regions.
Further, if a misbehaving piece of software is the root
cause of the fault, the operating system will schedule this
software into discrete time periods, leading to sporadic
effects. This fact that many events during faulty execution
behave normally can cause traditional techniques, which
label all events during the faulty time period as faulty, to
train inaccurate models that cannot differentiate between
normal and faulty behavior.
Our novel solution can employ traditional statistical
classifiers for complex fault detection and analysis. It
enhances the quality of the information being classified
by building a secondary statistical model that captures the
probability that a given event came from a normal or faulty
run. We use these probabilities to filter the original labeling
presented to the classification algorithms, which focuses
their power on the abnormal events. Specifically, it enables
the classifier to correctly identify many more faulty events
at the cost of a small number of false positive predictions,
while reducing the number of false negatives. Figure 1
shows a diagram that compares the naive classification
approach with our refined approach.
The naive model, which only classifies individual
events, can overwhelm system administrators with many
individual reports that correspond to the same fault. Our
approach eliminates this problem by clustering fault detec-
tions to provide administrators with just one notification
for each system fault. In another major result, we show
that even a small rate of false fault detections when
traditional classifiers correctly differentiate most normal
and faulty events can cause clustering to produce incorrect
fault predictions. Our technique addresses this problem
by focusing attention to fault detections that correspond
to very low probability events, which improves accuracy
of fault detection from 5% to 65% on faulty runs, while
maintaining a 5% false positive rate.
Non-faulty Runs U L U
0 0 ..............................;Event"
U . 0 Probability
Faulty Runs Training Classifier
0 0Faulty Non-faulty
U Normal event f Faulty event ......." Naive classifier steps
Figure 1. Naive- versus new-classification approach.
This paper is organized as follows. Section II presents
our experimental set up, describing the applications and
statistical methods that our analysis uses our behavior
monitoring infrastructure. Section III describes our general
approach to model application behavior and shows that the
intuitive approach results in very poor accuracy. Section IV
then explains the causes of this inaccuracy and shows how
to combine event abnormality information with classifica-
tion algorithms to improve detection accuracy significantly.
We examine this approach in detail in Section V to show
that information about the distance of an event from
its normal behavior is useful for fault characterization.
Section VI then shows how to aggregate individual fault
predictions into a single statement of the fault's starting
and ending times, location and type.
II. Experimental Setup
For our experimental analyses we have chosen to fo-
cus on detecting faults that occur on High-Performance
Computing (HPC) systems during the execution of the
scientific applications that typically run on these systems.
HPC systems are among the largest and most powerful
in the world, with the Top 500 most powerful systems
capable of sustained 31 to 2,500 TeraFlops of computa-
tional power . The largest systems have over 200,000
processors, 200TB of RAM and Petabytes of disk storage.
Even though HPC systems are built from high-quality
components and use light-weight software stacks, the
very large scale of these machines means that they fail
very frequently. Major systems like the ASCI Q machine
experienced 26.1 CPU failures per week , and the
100,000 node BlueGene/L machine at Lawrence Livermore
National Laboratory suffers from one L1 cache bit flip
every 4 hours. From the perspective of applications, HPC
systems fail 10-20 times each day due to failures in system
hardware and software .
HPC systems are primarily used to run large-scale
scientific applications written using the Message Passing
Here’s what’s next.
This article can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Article.
Bronevetsky, G; Laguna, I & de Supinski, B R. Automatic Fault Characterization via Abnormality-Enhanced Classification, article, December 20, 2010; Livermore, California. (https://digital.library.unt.edu/ark:/67531/metadc837078/m1/4/: accessed March 25, 2019), University of North Texas Libraries, Digital Library, https://digital.library.unt.edu; crediting UNT Libraries Government Documents Department.