Notes on seeking wisdom and crafting software

Paper - Crash-only Software

Today most of the large scale distributed systems are composed of commodity hardware components. It is extremely hard to model all the multitude ways the components can fail. Contrast this with the physical world where interaction of objects and their state can be reasoned over by Newtonian laws. Can we have an equivalent predictable way to reason over failures in distributed systems? What if there were an on-off switch, and we could simply turn off and turn on a suspicious component to bring it back to pre-failure state?

We will talk about ideas in Candea, George, and Armando Fox. “Crash-Only Software.” In HotOS, vol. 3, pp. 67-72. 2003.

Motivation

Clean shutdown procedures in components slows the crash reboot cycle. It is trade-off to help steady state performance over shutdown performance. E.g. checks like fsck may not be necessary in a system is shutdown cleanly.

Paper proposes a simple semantics crash = stop, start = recover. This restart/retry architecture will lead to reliable, predictable code and faster recovery.

Can we do safe crashes and fast recoveries?

Why crash-only systems?

With this semantics, systems can be coerced to have two idempotent states - on or off. Both the switches are externally controlled.

The implications on design are interesting. Every component must be ready for unplanned deactivation with a power off. Second, the recovery path is always exercised at start-up.

Preemptively restart components before they fail. This will help with resource starvation e.g. infinite delays. Further, if every failure can be recovered with simple restarts, we shorten the fault detection and diagnosis time.

Solution

How do we create crash only systems? Paper presents a few guidelines at two levels - within a component and inter-component interactions.

  • Intra-component - persist all non-volatile state in state stores which are also crash only
    • Design applications/components to be stateless
    • State stores must be crash-only, otherwise we have just pushed the problem to one level down
    • Recommend choosing the right abstraction for state stores. E.g. don’t use ACID semantics for simple key-value requirements. Further, standardize on the interface and contracts of state stores
  • Inter-component - ensure components can tolerate crashes and temporary unavailability of other components
    • Enforce external boundaries like VMs, Processes etc. to isolate the components, contain failures and enable independent recovery
    • Use timeouts to turn non Byzantine faults (cases where the components don’t lie) into fail stop events i.e. components either stop or respond. Given crash recovery is fast, simply reboot upon suspicion
    • Decouple resources from components by enforcing leases
    • Self describing requests enable statelessness as any recovered component can pick it up and execute with all context. Idempotency is helpful

Some thoughts on managing crashes.

  • Send a RetryAfter(n) response to clients in addition to eventual timeouts
  • Use policies for max number of retries along with timeouts and lease durations
  • Stop incoming requests while system is recovering with crash-reboot with stall proxies

Finally, crash recovery is complementary to the redundancy and monitoring approaches for failure detection. Faster recovery may mean lesser redundancy is required for a system as it will help reintegrate membership changes fast e.g. reintegrate failed component or add/remove/upgrade new components.

There is a gotcha. We may have to choose low throughput (remember the stalling proxy above). Depending upon the use case, this may be a secondary concern given the high availability and predictability.


A decade later, we seem to have settled on the intra-component semantics around statelessness and choosing the right storage abstractions. Crash-only components may be a powerful mental model to build for the worst case (or may be the inevitable given the scale and variables in a datacenter)

Read next: we covered a few more truisms around distributed systems in the a note on distributed computing paper.