Inside Out | Paper - The Tail at Scale

We’ll cover a classic paper on designing large scale low latency systems.

The Tail at Scale, Communications of the ACM, 2013 by Jeffrey Dean, Luiz André Barroso. 1400 citations.

https://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext↗

Motivation

One of the key constraints of a large scale system is low latency. This is especially true for interactive services (ones that are live to the end user). Large scale data necessitates a partitioning scheme for both data and compute; this further requires queries to fan out to multiple systems.

Imagine a system with 10ms latency at p50 and 1000ms at p99. If we’re fanning out to 100 nodes to serve a request, the probability of a query hitting the p99 bottleneck is 63% (= 1-0.99¹⁰⁰).

Root causes

Why does tail latency occur?

Shared resources on the local node, or global resources at the cluster. Race between interactive and non interactive (i.e. batch or offline computations) to acquire and hold resources. Chattiness and other inherent network faults may slow down the communication. Finally, the language characteristics like stop the world garbage collection mechanisms or hardware limitations due to power/energy constraints.

Solutions

Authors provide a nice taxonomy of solutions to this problem.

Reduce component variability (i.e. attempt to fix tail latency)
- Differentiate the service classes: interactive vs non interactive and according divide the resources.
- Reduce head of line blocking i.e. break requests into smaller chunks that can execute in an interleaved fashion.
- Manage non interactive services with throttling, timed triggers based on load patterns, fair scheduling across all nodes
Develop tail-tolerant services (i.e. living with reality)
- Within request immediate response by using replicas for requests
  - When to use? Cause of variability does not affect all request replicas
  - Hedged request pattern - send request to multiple nodes, use the fastest response
  - Tied requests pattern - avoid duplicate processing with hedged request by sending cancellation once a node starts processing
  - Probe and queue pattern - find the least loaded server with a probe and then queue the request
- Cross request long term adaptation
  - When to use? Partitions are imbalanced. Static partitioning scheme doesn’t work.
  - Micro partition pattern - use more partitions than nodes. Use dynamic assignment and load balancing.
  - Selective replication pattern - detect and predict items that create imbalance, then selectively create replicas for these
  - Latency induced probation pattern - observe latency distribution, detect if the system performs better by excluding certain nodes
Techniques for Information Retrieval Systems
- Good enough results - trade-off between completeness or performance of a response. Prefer performance by ignoring the set of slow leaf nodes
- Canary requests pattern - send a request to few nodes, and then query the remaining nodes if earlier request succeeds. Useful if the queries are on a potentially untested path
Mutations or update requests
- Scale of latency critical writes are usually small
- Solve this by doing writes off the critical path, or using eventually consistent models, or using a quorum based consistency (since number of nodes is usually 3-5, they are tail tolerant by design)

This is an amazing paper; highly recommend reading it if you’re designing distributed systems. In my experience, the service class differentiation between interactive vs non interactive gives huge payoffs; but requires good observability and monitoring to decide on timed triggers or throttling. As usual, measure and then cut.

Read next: Achieving Rapid Response Times in Large Online Services↗ by Jeff Dean on more details of building low latency systems at scale.