Inside Out

Notes on seeking wisdom and crafting software.

Paper - Autopilot

This time we’re picking up an old paper on infrastructure management for some large scale services at Microsoft. Our goal is to learn some of the core ideas in it and try to contrast how things may have changed.

Isard, Michael. “Autopilot: Automatic Data Center Management.” ACM SIGOPS Operating Systems Review 41, no. 2 (April 2007): 60–67. https://doi.org/10.1145/1243418.1243426.

Autopilot promises to automate software provisioning, deployment, monitoring and repair. The paper highlights a few benefits of automation: lower cost by replacing repetitive human actions with software, and higher reliability.

Principles

Architecture

Topology

Within a machine

Within an app

System Components

Autopilot System Diagram

Autopilot chooses to keep the shared state small and strongly consistent with a quorum based mechanism in a few set of machines (5-10) in Device Manager.

  1. Device Manager (DM)
    • Replicated state machine library to store ground truth state of the system.
    • Uses paxos for consensus across replicas.
    • Delegates the actions to keep cluster synchronized with ground truth to other satellite services.
      • Satellites pull state from Device Manager and update their/client state. Pull keeps the Device Manager lightweight without the need to track which satellites have received a state etc. (as in push model).
      • Satellites keep state eventually consistent with Device Manager through heartbeats.
  2. Provisioning Service (PS)
    • Replicated for redundancy, and a leader runs the actions.
    • Probes network for new machines which have joined.
    • Gets the desired state of the machine from DM and installs/boots/runs burn-in tests. Update state in DM.
  3. Deployment Service (DS)
    • Replicated service that hosts the manifest directories with app files.
    • Machines run a periodic task to connect to DM and fetch any missing manifests from a replica of the DS.
  4. Repair Service (RS)
    • RS asks DM for machine status and runs the repair actions as instructed by DM.
    • Note that DM decides how many and which machines are repaired, this allows the cluster to remain functional as a whole.
  5. Monitoring Services (MS)
    • Collection Service (CS) aggregates counters and logs and dumps to a distributed file store
    • Realtime counters are stored in SQL for low latency query
    • Cockpit is a visualization tool to view counters aggregated per cluster
    • Alert Service sends emails based on queries to Cockpit

Workflows

  1. Deployment
    • New version of app is stored in DS
    • DM is instructed to rollout. It adds the manifest to storage list for each machine with required role. Kicks the machine to pull and deploy new manifest.
      • If a machine doesn’t get kicked, it will be updated through periodic task.
    • Rollout proceeds in scale units for safety. Number of concurrent scale units is app defined. Within each scale unit, successful deployment is defined as a percent of machines upgraded.
  2. Fault detection and repair
    • Machine is the unit of repair.
    • Fault detection
      • Watchdog probes are used. Report the result of probe to DM on a specific protocol.
      • Machine is marked as error if any Watchdog reports a failure.
    • Fault recovery
      • Machine states: Healthy, Probation and Failure.
      • Machine moved from H to P on deployment. Moves back to H if no errors, or to F otherwise.
      • Recovery actions: Donothing, Reboot, Reimage, Replace. Action is chosen based on history and severity. E.g. reboot for non-fatal, Donothing if machines doesn’t have any recent F history.
  3. Metrics
    • Apps and Autopilot dump counters on the local machine.
    • CS collects these and writes onto a large scale distributed file store.

Application

Brief notes on an example Windows Live Search app in the Autopilot (AP) ecosystem.

Lessons

Historically, AP has been in development since the first deployments of Windows Live Search engine in 2004.