Inside Out

Notes on seeking wisdom and crafting software.

Paper - Spanner: Google's Globally-Distributed Database

We covered Bigtable - a distributed structured store in the last paper. What were the shortcomings of Bigtable that Spanner tries to address? What are challenges of building a globally consistent database? We’ll try to gather a few interesting patterns.

Corbett, James C, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, et al. “Spanner: Google’s Globally-Distributed Database,” 2012, 14.

Motivation

Why not bigtable? Complex, evolving schemas are painful. Need for strong consistency in wide area replication.

Why not megastore? Supports semi-relational data model and sync replica. But relatively poor write throughput.

Features

Secret sauce: globally meaningful timestamps support linearizability. E.g. if transaction t1 commits before t2, timestamp guarantee ts1 < ts2. Spanner is first system to provide this globally. TrueTime api.

Data model

Application data model:

Spanner calls this interleaving of tables and it is crucial for data locality of multiple tables across shards.

Design

Key abstractions:

Paper describes spanservers in detail.

Spanserver (storage)

Spanserver (compute)

Paper describes the TrueTime API and how it is used for distributed concurrency control.

TrueTime

How is TT used to provide concurrency control?

Concurrency control

3 types of transactions

  1. Read-only transactions execute lock free at a system chosen time. Comprise of reads only, no writes
  2. Snapshot read is read at a past timestamp (client provided timestamp, or upperbound on staleness). Executes lock free
  3. Read-write transactions require a lock

Both (1) and (2) require a commit to be completed before the chosen timestamp.

System imposed invariants:

  1. For each Paxos Group, each Paxos leader’s lease interval is disjoint from every other leader’s
  2. For each Paxos Group, Spanner assigns timestamps to Paxos writes in monotonically increasing order, even across leaders. How? Use (1) and let a leader assign timestamps within interval of leader lease
  3. External consistency invariant: if start of transaction T2 occurs after commit of transaction T1, then commit timestamp(T2) must be greater than commit timestamp(T1)

How to determine if a replica can satisfy a read at a timestamp? Compare the timestamp to < min(paxos safe ts, transaction mgr safe ts). Paxos Safe TS is the timestamp of highest write recorded in Paxos. Transaction Mgr Safe TS is the smallest prepared (not committed) timestamp across a Paxos Group. This simply ensures that any in-flight transactions which are prepared will be committed after provided timestamp, so our read is safe.

For read-only transactions, Spanner chooses the oldest timestamp that preserves external consistency.

Read-write transactions

Read-only transactions

Schema-change transactions

Evaluation

What to measure?

How to measure?

Conclusion

The paper doesn’t explicitly mention the pains with Bigtable’s schema management. Note that Bigtable provides a lexicographically sorted map of key-value like (row, column-family:qualifier, timestamp) -> value. Spanner improves upon this by providing locality hierarchical tables. This is awesome because multiple applications can store their data with a per-user or a per-usecase sharding mechanism. E.g. both Liked Youtube videos and Starred GMail entries for the user available on the same shard could enable cool data mashups. Second, the benefits of data locality extend to low latency lookups etc.

It is interesting to see how the mental model has changed since GFS. Bigtable treated GFS as a simple file store for a cluster. Spanner treats the equivalent of Bigtable as a zone while it operates at the scope of an universe (multiple zones). Additionally the use cases have gradually changes from batch workloads to low latency systems and finally globally replicated consistent systems.

SQL is too awesome to leave behind. Most data storage systems end up building an abstraction to support the query language even if they have dropped the ACID notion.

Externally consistent writes are a key contribution of this paper. One of the key insights for me was reducing the challenge of consistency to be based on safe timestamps and a set of invariants that the system maintains. Second, the idea of Client driving a two phase commit is quite interesting. It simplifies the worst case drastically, failure at client implies retry, no inconsistencies in the server state.