Inside Out | Paper - Spanner: Google's Globally-Distributed Database

We covered Bigtable - a distributed structured store in the last paper. What were the shortcomings of Bigtable that Spanner tries to address? What are challenges of building a globally consistent database? We’ll try to gather a few interesting patterns.

Corbett, James C, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, et al. “Spanner: Google’s Globally-Distributed Database,” 2012, 14.

Motivation

Scalable, globally distributed across data centers, multi-version and synchronously-replicated.
Replication for global availability and geographic locality (key contribution)
Shards data across paxos state machines (i.e. different data centers), dynamic
Auto migration of data across machines/data centers to balance load or handle failure
Large scale: million machines, trillion rows, hundred data centers

Why not bigtable? Complex, evolving schemas are painful. Need for strong consistency in wide area replication.

Why not megastore? Supports semi-relational data model and sync replica. But relatively poor write throughput.

Features

Client side replication configurations
- Read latency: which data center should replicate, how far from user?
- Write latency: how far replicas are from each other?
- Durability, availability, performance: how many replicas?
Externally consistent reads and writes
Globally consistent reads across the database at a timestamp

Secret sauce: globally meaningful timestamps support linearizability. E.g. if transaction t1 commits before t2, timestamp guarantee ts1 < ts2. Spanner is first system to provide this globally. TrueTime api.

Data model

Schematized, semi-relational tables. Why? Popularity of megastore.
- Synchronous replication: supported by Megastore, but Bigtable supports eventually consistent replication
Query language: sql like. Why? Popularity of Dremel, interactive data analysis
Transactions: cross row transactions were not supported in Bigtable. Provide support for transactions and let app developers decide to deal with performance bottlenecks (instead of working around transactions)

Application data model:

Database created by the application
DB contains unlimited schematized Tables
Tables
- Rows, Columns and Versioned values
- Required: one or more primary key columns, ordered
- Mapping from primary key columns to non primary key columns. Why? Control of data locality through keys (note: common pattern across KV stores)
Query language: sql like with extensions for protobuf
Tables are hierarchical in nature
- Top level is a directory table with row key K
- Each child table contributes its rows to one row in parent
- E.g. Users is directory table, Albums is child table. So /Users(1)/Album(1, 2) represents album with id 2 under user 1.

Spanner calls this interleaving of tables and it is crucial for data locality of multiple tables across shards.

Design

Key abstractions:

Spanner deployment is called universe.
- Universe has a universemaster and placementdriver
- Placementdriver handles automated movement of data across zones in minutes. Also talks to zoneservers to find data that needs to be moved
- Both are singletons
Universe consists of zone (compare: equivalent of a Bigtable deployment)
- Zone has a zonemaster. Zonemaster assigns data to spanservers
- 100-1000 spanservers. Spanservers serve data to clients
- Zone has locationproxies, used by clients to locate spanservers for their data
Zones consists of spanservers (compare: tabletserver in Bigtable)
- 100-1000 instances of tablets similar to Bigtable’s tablet

Paper describes spanservers in detail.

Spanserver (storage)

Tablet is data structure, no compute
Tablet implements a bag of (key: string, timestamp: int64) -> string
Tablet state is stored in B-tree like files and a WAL on a distributed file system called Colossus (successor of GFS)
Paxos statemachine log and metadata is stored on the tablet
Paxos group: set of replicas for a tablet
Supports long lived transactions (order of minutes), similar to Bigtable
Directory is a bucketing abstraction on top of key-value bag
- Helps client control data locality by choosing appropriate prefix
- Unit of data placement, replication configuration (also called placement)
- Data is moved across Paxos Groups directory by directory
  - When? Load shedding, keep directory access together in same group, or near the client
  - Move can happen during active client operations
  - Movedir background task implements these moves
    - Runs in background until all but nominal data is left
    - Then uses a transaction to move the last bit of data
    - Why? Ongoing client operations must be allowed during moves
- One Paxos Group can contain multiple directories (compare: Bigtable is homogeneous). Why? Collocate directories accessed together

Spanserver (compute)

Paxos state machine on top of each tablet to support replication
- State machine metadata and logs stored on the tablet
- Every write is recorded twice: a) paxos log, b) tablet log
Writes
- Initiate paxos protocol at Paxos Group leader
Reads
- Access state from any up to date replica
A replica can play two roles: a) leader, b) follower
Leader replica
- Long lived, with time based leader leases
- 2 key components: lock table, transaction manager
- Lock table
  - Used for concurrency control
  - Stores state of 2-phase locks: maps range of keys to lock states. Why 2PC? Optimistic concurrency control works poorly for long lived transactions
- Transaction manager (TM)
  - Support distributed transactions
  - State is stored in underlying Paxos Group, so replicated
  - In one Paxos Group
    - TM implements a participant leader
    - Other replicas are called participant slaves
  - Across Paxos Group
    - TM implements a group as the leader, participant leader in that group is called coordinator leader
    - Other group’s participant leaders are called coordinator slaves
    - Required for transactions across Paxos Groups

Paper describes the TrueTime API and how it is used for distributed concurrency control.

TrueTime

TTstamp denotes a timestamp
TT.now() provides the current TTstamp with an error bound of epsilon
Implemented using GPS and atomic clocks. Why two source? Different failure modes.
API is provided by a set of masters per data center
- Masters compare time against each other
Each machine in data center runs a daemon to synchronize from the masters
- Poll variety of masters from current data center and nearby data centers
- Use marzullo’s algorithm to detect and reject liars
- Between syncs the daemon advertises a slowly increasing epsilon

How is TT used to provide concurrency control?

Concurrency control

3 types of transactions

Read-only transactions execute lock free at a system chosen time. Comprise of reads only, no writes
Snapshot read is read at a past timestamp (client provided timestamp, or upperbound on staleness). Executes lock free
Read-write transactions require a lock

Both (1) and (2) require a commit to be completed before the chosen timestamp.

System imposed invariants:

For each Paxos Group, each Paxos leader’s lease interval is disjoint from every other leader’s
For each Paxos Group, Spanner assigns timestamps to Paxos writes in monotonically increasing order, even across leaders. How? Use (1) and let a leader assign timestamps within interval of leader lease
External consistency invariant: if start of transaction T2 occurs after commit of transaction T1, then commit timestamp(T2) must be greater than commit timestamp(T1)

How to determine if a replica can satisfy a read at a timestamp? Compare the timestamp to < min(paxos safe ts, transaction mgr safe ts). Paxos Safe TS is the timestamp of highest write recorded in Paxos. Transaction Mgr Safe TS is the smallest prepared (not committed) timestamp across a Paxos Group. This simply ensures that any in-flight transactions which are prepared will be committed after provided timestamp, so our read is safe.

For read-only transactions, Spanner chooses the oldest timestamp that preserves external consistency.

Read-write transactions

Buffered at client until commit (compare: similar to Bigtable). So reads in a transaction do not see the effects of its writes.
Client drives the transaction. Why? Avoid sending data twice over WANs
Reads within RW transaction use wound waits to avoid deadlocks.
- Client issues reads to leader replica, leader acquires read locks and reads data
- While read request is open, client keeps sending keepalive to participant leaders to block timeouts
Client completes all reads and buffers all writes
Client begins 2PC
- Client chooses a coordinator group and sends commit to each participant’s leader with coordinator identity and buffered writes
- Non coordinator participant leader acquires write locks
  - It chooses a prepare timestamp preserving monotonicity
  - Log a prepare record through Paxos
  - Each participant notifies coordinator of prepare timestamp
- Coordinator leader also acquire write locks, but skips prepare phase
  - Waits for all participant leader’s prepare timestamps
  - Choose a commit timestamp greater than the above timestamps (preserve monotonicity)
  - Logs a commit record through paxos
  - After commit wait, send commit timestamp to client and other participant leaders
  - Each participant leader logs the transaction’s outcome through paxos
- All participants apply at the same timestamp and then release locks

Read-only transactions

Assigning timestamp for read requires negotiation across all Paxos Groups involved, thus require a scope expression
Scope = summary of keys to be read
If scope == single Paxos Group
- Client issues the query to the leader in the Group
- Leader assigns read timestamp <= last committed write timestamp for the group
- Commits the read
If scope spans multiple groups
- Avoid negotiation, instead wait for a safe timestamp at the coordinator leader
- Next reads can be sent to replicas which are sufficiently up to date

Schema-change transactions

Spanner supports non-blocking atomic schema changes (compare: bigtable supports atomic schema changes, but they block other transactions)
Assign the schema change transaction a future timestamp
Reads and writes that depend on the schema synchronize with the registered schema change timestamp
- These proceed if their timestamp < schema change timestamp
- Else they wait

Evaluation

What to measure?

Replication, Transactions and Availability

How to measure?

Standard config for spanservers, clients on different machines
Operations: read/write of 4KB
Reads were done in-memory after a compaction to ensure only Spanner’s callstack overhead is measured
- Warm read done to ensure warmed up location caches
Availability evaluations
- Measure impact on throughput in the face of failure
- 5 zones each with 25 span servers. Test DB shared into 1250 paxos groups. 100 test clients issued non-snapshot reads at 50k reads/s
- 5s into each test, kill one zone
- non leader kills Z2, leader hard-kills Z1, leader soft-kills Z1

Conclusion

The paper doesn’t explicitly mention the pains with Bigtable’s schema management. Note that Bigtable provides a lexicographically sorted map of key-value like (row, column-family:qualifier, timestamp) -> value. Spanner improves upon this by providing locality hierarchical tables. This is awesome because multiple applications can store their data with a per-user or a per-usecase sharding mechanism. E.g. both Liked Youtube videos and Starred GMail entries for the user available on the same shard could enable cool data mashups. Second, the benefits of data locality extend to low latency lookups etc.

It is interesting to see how the mental model has changed since GFS. Bigtable treated GFS as a simple file store for a cluster. Spanner treats the equivalent of Bigtable as a zone while it operates at the scope of an universe (multiple zones). Additionally the use cases have gradually changes from batch workloads to low latency systems and finally globally replicated consistent systems.

SQL is too awesome to leave behind. Most data storage systems end up building an abstraction to support the query language even if they have dropped the ACID notion.

Externally consistent writes are a key contribution of this paper. One of the key insights for me was reducing the challenge of consistency to be based on safe timestamps and a set of invariants that the system maintains. Second, the idea of Client driving a two phase commit is quite interesting. It simplifies the worst case drastically, failure at client implies retry, no inconsistencies in the server state.