Inside Out

Notes on seeking wisdom and crafting software.

Paper - Crash-only Software

Today most of the large scale distributed systems are composed of commodity hardware components. It is extremely hard to model all the multitude ways the components can fail. Contrast this with the physical world where interaction of objects and their state can be reasoned over by Newtonian laws. Can we have an equivalent predictable way to reason over failures in distributed systems? What if there were an on-off switch, and we could simply turn off and turn on a suspicious component to bring it back to pre-failure state?

We will talk about ideas in Candea, George, and Armando Fox. “Crash-Only Software.” In HotOS, vol. 3, pp. 67-72. 2003.

Motivation

Clean shutdown procedures in components slows the crash reboot cycle. It is trade-off to help steady state performance over shutdown performance. E.g. checks like fsck may not be necessary in a system is shutdown cleanly.

Paper proposes a simple semantics crash = stop, start = recover. This restart/retry architecture will lead to reliable, predictable code and faster recovery.

Can we do safe crashes and fast recoveries?

Why crash-only systems?

With this semantics, systems can be coerced to have two idempotent states - on or off. Both the switches are externally controlled.

The implications on design are interesting. Every component must be ready for unplanned deactivation with a power off. Second, the recovery path is always exercised at start-up.

Preemptively restart components before they fail. This will help with resource starvation e.g. infinite delays. Further, if every failure can be recovered with simple restarts, we shorten the fault detection and diagnosis time.

Solution

How do we create crash only systems? Paper presents a few guidelines at two levels - within a component and inter-component interactions.

Some thoughts on managing crashes.

Finally, crash recovery is complementary to the redundancy and monitoring approaches for failure detection. Faster recovery may mean lesser redundancy is required for a system as it will help reintegrate membership changes fast e.g. reintegrate failed component or add/remove/upgrade new components.

There is a gotcha. We may have to choose low throughput (remember the stalling proxy above). Depending upon the use case, this may be a secondary concern given the high availability and predictability.


A decade later, we seem to have settled on the intra-component semantics around statelessness and choosing the right storage abstractions. Crash-only components may be a powerful mental model to build for the worst case (or may be the inevitable given the scale and variables in a datacenter)

Read next: we covered a few more truisms around distributed systems in the a note on distributed computing paper.