Inside Out | Paper - Understanding, Detecting and Localizing Partial Failures

We’ll talk about a computer science paper today: Understanding, Detecting and Localizing Partial Failures in Large System Software↗ by Chang Lou et al. This paper was awarded the best paper in NSDI, 2020. You can find the slides and talk here↗.

Taxonomy of failures

Total failures are straightforward and result in a crash, an abort, network disconnection etc. Such errors are easier to detect because they’ll bring down an entire system or a component.

Gray failures↗ or partial failures are subtle, and often go unnoticed. They affect a part of the system, overall health may still appear to be not affected, thus masking the reality.

The paper studies 100 such partial failures and tries to draw patterns around the symptoms and impact. As an example, see blocked write issue↗ in Zookeeper. Later the authors propose a novel mechanism to detect and localize such partial failures.

Properties of Partial Failures

48% of the failures studied can be categorized into a) uncaught errors, b) indefinite blocking, and c) buggy error handling. Hangs or extreme slowness is the usual manifestation.

15% of the errors are silent including cases of data loss, corruption, inconsistency and incorrect results.

Analyzing the cause, the authors found 71% of failures to be caused by a) environment condition, b) bad input, c) faults in other processes. This makes the reproduction of an issue hard in a local pre-production environment.

Recovery patterns are more alarming. 68% of the failures are sticky i.e. program will not recover from the fault by itself. Median diagnosis time is 6 days and 5 hours. Issues may have confusing symptoms that mislead diagnosis or insufficient information causes delays for root cause analysis.

Such faults affect the liveness (i.e hangs), and safety characteristics of the program. An automated detector may be able to capture these. Second, given the complexity around diagnosability of such issues, it will be interesting if the solution can localize the fault to a smaller radius e.g. instruction, or a method in the call stack.

Solution approaches

Extrinsic detectors like heartbeats or health checks are disjoint and generic, thus may not be able to model the entire program surface area. Signal based checkers (e.g. look for errors in logs or metrics) either have poor accuracy or come with environment noise. None of these can localize the error.

Intrinsic detectors run within the program. They can be enhanced clients which report metrics on calls to a downstream system (see Panorama↗). Or a probe based detector like Falcon↗ which uses spies to monitor various parts of the program. We can also have daemon resource monitor threads that keep reporting the characteristics from within the program (e.g. thread count, GC stats etc.). Such detectors are good at detection, and come with various levels of fault localization. We will cover this more later.

This paper proposes a watchdog based intrinsic detector mechanism. Watchdogs are mimic checkers i.e. they will observe the context of certain critical calls, and try to mimic them to reproduce faults. Since they capture the context, and the calls, localization of the fault is straightforward.

Designing a Watchdog

Given the dependency of Watchdogs with the program internals, it is hard to author and maintain them. Authors choose to generate Watchdogs based on the source code analysis of the program. Authors name this generation program as OmegaGen.

Requirements for the Watchdog are not quite simple. It must identify the right critical paths in the program. It needs to capture the state for those calls, and execute them concurrently without significantly altering the runtime characteristics of the program. The algorithm may be enumerated as follows.

Input is the source code of a Program P.
Find the long running regions in P.
Encapsulate them into executable checkers, generate watchdogs W.
Insert hooks into P to capture the context state of calls into W’s context.
Package a driver in P to execute W based on heuristics.
Watchdog finds liveness or safety violations along with the location of fault.

Let’s peep inside the internals of a Watchdog.

Finding long running methods requires a traversal of the program call graph. Certain loops like while(true) may be marked as critical to the liveness requirement. Paths calling such critical code may be marked as well.

Vulnerable operations in such long running methods can be a) threading based on critical sections, polling etc., b) File IO, or c) Network IO based. OmegaGen identifies potential operations from the standard library or from custom annotations.

Next the main program is reduced by walking through the control flow and retaining instructions marked vulnerable. Original structure is preserved as much to help with localization. A reduced program is encapsulated with synthesized arguments for the vulnerable instructions.

Finally, the checks are added. Liveness related checks are based on timeouts wrapping the mimicked calls in Watchdog. Slowness may be measured by capturing a moving average of latencies. Safety violations are hard, the watchdogs capture explicit error signals like exceptions, asserts etc. and provide an API wd_assert for developers to manually add assertions.

Side effects

An instrinsic detector presents an unique challenge of not adding overheads, and not changing the runtime characteristics of original program with side effects. Authors use two interesting primitives to mitigate side effect risk.

Memory side effects may arise from errors in copying the context of a call from the original program into the Watchdog. E.g. copying entire state will increase memory cost and slowness. Watchdogs use immutability analysis to refer immutable objects by reference. Mutable objects are lazily copied, several heuristics like weak_ref, hash and versions are used to ensure high fidelity between the actual call and mimicked call in watchdog.

IO side effects are mitigated by using safe redirections to ensure the same resource doesn’t face contention. E.g. redirect calls to actual file to a test only file. Or, for socket IO, rewrite as a ping operation.

Idempotency is a requirement since the Watchdogs may end up calling a non-idempotent resource whose state may have changed by the actual call. This is solved by generating idempotent wrappers over the non-idempotent calls e.g. streams etc. and modifying the original program to use the wrapper. The wrapper guards the original call with a critical section and notes the context of the caller. Both the original program and watchdog calls the wrapper. If original program’s call hangs, watchdog will timeout due to the critical section.

Conclusion

Authors evaluate Watchdogs with the baseline detectors like heartbeats, signals, probes or clients. All the detectors are tested against 22 realworld failures. Watchdogs are able to detect 90% of the faults and can pinpoint an error to the faulty instruction in 55% of the cases.

Overall, I found the paper to be a great read. Articulation of the problem with real world data from complex opensource projects is informative. Also appreciate the breadth of tools considered for evaluation, and the special notes around the challenges in side effects and idempotency.

As a practitioner, I’ve seen most of the extrinsic detectors mentioned in the paper being used in large scale systems. In my experience, Heartbeats are often used as macro indicators of a service health. Seen them in a few other forms like test in production. I’ve used signal based analysis mostly for observing rollouts.

On the intrinsic side, I’ve had positive experiences with Client based mechanisms in one of the past projects. We used to measure a Quality of Service (qos) metric for each calls to downstream systems. Such metrics will have associated alarms to notify any errors. This is one mechanism that I continue to be a big fan of :) We will look at the Panorama paper sometime in future.

My only wishlist for the paper will be around making the OmegaGen tool available for the community to try out.