We covered Bigtable - a distributed structured store in the last paper. What were the shortcomings of Bigtable that Spanner tries to address? What are challenges of building a globally consistent database? We’ll try to gather a few interesting patterns.
Corbett, James C, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, et al. “Spanner: Google’s Globally-Distributed Database,” 2012, 14.
Motivation
- Scalable, globally distributed across data centers, multi-version and synchronously-replicated.
- Replication for global availability and geographic locality (key contribution)
- Shards data across paxos state machines (i.e. different data centers), dynamic
- Auto migration of data across machines/data centers to balance load or handle failure
- Large scale: million machines, trillion rows, hundred data centers
Why not bigtable? Complex, evolving schemas are painful. Need for strong consistency in wide area replication.
Why not megastore? Supports semi-relational data model and sync replica. But relatively poor write throughput.
Features
- Client side replication configurations
- Read latency: which data center should replicate, how far from user?
- Write latency: how far replicas are from each other?
- Durability, availability, performance: how many replicas?
- Externally consistent reads and writes
- Globally consistent reads across the database at a timestamp
Secret sauce: globally meaningful timestamps support linearizability. E.g. if
transaction t1
commits before t2
, timestamp guarantee ts1 < ts2
. Spanner
is first system to provide this globally. TrueTime api.
Data model
- Schematized, semi-relational tables. Why? Popularity of megastore.
- Synchronous replication: supported by Megastore, but Bigtable supports eventually consistent replication
- Query language: sql like. Why? Popularity of Dremel, interactive data analysis
- Transactions: cross row transactions were not supported in Bigtable. Provide support for transactions and let app developers decide to deal with performance bottlenecks (instead of working around transactions)
Application data model:
- Database created by the application
- DB contains unlimited schematized Tables
- Tables
- Rows, Columns and Versioned values
- Required: one or more primary key columns, ordered
- Mapping from primary key columns to non primary key columns. Why? Control of data locality through keys (note: common pattern across KV stores)
- Query language: sql like with extensions for protobuf
- Tables are hierarchical in nature
- Top level is a
directory table
with row key K - Each child table contributes its rows to one row in parent
- E.g.
Users
is directory table,Albums
is child table. So/Users(1)/Album(1, 2)
represents album with id 2 under user 1.
- Top level is a
Spanner calls this interleaving of tables and it is crucial for data locality of multiple tables across shards.
Design
Key abstractions:
- Spanner deployment is called
universe
.- Universe has a
universemaster
andplacementdriver
Placementdriver
handles automated movement of data across zones in minutes. Also talks to zoneservers to find data that needs to be moved- Both are singletons
- Universe has a
- Universe consists of
zone
(compare: equivalent of a Bigtable deployment)- Zone has a zonemaster. Zonemaster assigns data to spanservers
- 100-1000 spanservers. Spanservers serve data to clients
- Zone has locationproxies, used by clients to locate spanservers for their data
- Zones consists of
spanservers
(compare: tabletserver in Bigtable)- 100-1000 instances of tablets similar to Bigtable’s tablet
Paper describes spanservers in detail.
Spanserver (storage)
- Tablet is data structure, no compute
- Tablet implements a bag of
(key: string, timestamp: int64) -> string
- Tablet state is stored in B-tree like files and a WAL on a distributed file system called Colossus (successor of GFS)
- Paxos statemachine log and metadata is stored on the tablet
- Paxos group: set of replicas for a tablet
- Supports long lived transactions (order of minutes), similar to Bigtable
Directory
is a bucketing abstraction on top of key-value bag- Helps client control data locality by choosing appropriate prefix
- Unit of data placement, replication configuration (also called placement)
- Data is moved across Paxos Groups directory by directory
- When? Load shedding, keep directory access together in same group, or near the client
- Move can happen during active client operations
Movedir
background task implements these moves- Runs in background until all but nominal data is left
- Then uses a transaction to move the last bit of data
- Why? Ongoing client operations must be allowed during moves
- One Paxos Group can contain multiple directories (compare: Bigtable is homogeneous). Why? Collocate directories accessed together
Spanserver (compute)
- Paxos state machine on top of each tablet to support replication
- State machine metadata and logs stored on the tablet
- Every write is recorded twice: a) paxos log, b) tablet log
- Writes
- Initiate paxos protocol at Paxos Group leader
- Reads
- Access state from any up to date replica
- A replica can play two roles: a) leader, b) follower
- Leader replica
- Long lived, with time based leader leases
- 2 key components: lock table, transaction manager
- Lock table
- Used for concurrency control
- Stores state of 2-phase locks: maps range of keys to lock states. Why 2PC? Optimistic concurrency control works poorly for long lived transactions
- Transaction manager (TM)
- Support distributed transactions
- State is stored in underlying Paxos Group, so replicated
- In one Paxos Group
- TM implements a participant leader
- Other replicas are called participant slaves
- Across Paxos Group
- TM implements a group as the leader, participant leader in that group is called coordinator leader
- Other group’s participant leaders are called coordinator slaves
- Required for transactions across Paxos Groups
Paper describes the TrueTime API and how it is used for distributed concurrency control.
TrueTime
TTstamp
denotes a timestampTT.now()
provides the currentTTstamp
with an error bound of epsilon- Implemented using GPS and atomic clocks. Why two source? Different failure modes.
- API is provided by a set of masters per data center
- Masters compare time against each other
- Each machine in data center runs a daemon to synchronize from the masters
- Poll variety of masters from current data center and nearby data centers
- Use marzullo’s algorithm to detect and reject liars
- Between syncs the daemon advertises a slowly increasing epsilon
How is TT used to provide concurrency control?
Concurrency control
3 types of transactions
- Read-only transactions execute lock free at a system chosen time. Comprise of reads only, no writes
- Snapshot read is read at a past timestamp (client provided timestamp, or upperbound on staleness). Executes lock free
- Read-write transactions require a lock
Both (1) and (2) require a commit to be completed before the chosen timestamp.
System imposed invariants:
- For each Paxos Group, each Paxos leader’s lease interval is disjoint from every other leader’s
- For each Paxos Group, Spanner assigns timestamps to Paxos writes in monotonically increasing order, even across leaders. How? Use (1) and let a leader assign timestamps within interval of leader lease
- External consistency invariant: if start of transaction T2 occurs after commit of transaction T1, then commit timestamp(T2) must be greater than commit timestamp(T1)
How to determine if a replica can satisfy a read at a timestamp? Compare the
timestamp to < min(paxos safe ts, transaction mgr safe ts)
. Paxos Safe TS is
the timestamp of highest write recorded in Paxos. Transaction Mgr Safe TS is the
smallest prepared (not committed) timestamp across a Paxos Group. This simply
ensures that any in-flight transactions which are prepared will be committed
after provided timestamp, so our read is safe.
For read-only transactions, Spanner chooses the oldest timestamp that preserves external consistency.
Read-write transactions
- Buffered at client until commit (compare: similar to Bigtable). So reads in a transaction do not see the effects of its writes.
- Client drives the transaction. Why? Avoid sending data twice over WANs
- Reads within RW transaction use wound waits to avoid deadlocks.
- Client issues reads to leader replica, leader acquires read locks and reads data
- While read request is open, client keeps sending keepalive to participant leaders to block timeouts
- Client completes all reads and buffers all writes
- Client begins 2PC
- Client chooses a coordinator group and sends commit to each participant’s leader with coordinator identity and buffered writes
- Non coordinator participant leader acquires write locks
- It chooses a prepare timestamp preserving monotonicity
- Log a prepare record through Paxos
- Each participant notifies coordinator of prepare timestamp
- Coordinator leader also acquire write locks, but skips prepare phase
- Waits for all participant leader’s prepare timestamps
- Choose a commit timestamp greater than the above timestamps (preserve monotonicity)
- Logs a commit record through paxos
- After commit wait, send commit timestamp to client and other participant leaders
- Each participant leader logs the transaction’s outcome through paxos
- All participants apply at the same timestamp and then release locks
Read-only transactions
- Assigning timestamp for read requires negotiation across all Paxos Groups involved, thus require a scope expression
- Scope = summary of keys to be read
- If scope == single Paxos Group
- Client issues the query to the leader in the Group
- Leader assigns read timestamp
<= last committed write timestamp
for the group - Commits the read
- If scope spans multiple groups
- Avoid negotiation, instead wait for a safe timestamp at the coordinator leader
- Next reads can be sent to replicas which are sufficiently up to date
Schema-change transactions
- Spanner supports non-blocking atomic schema changes (compare: bigtable supports atomic schema changes, but they block other transactions)
- Assign the schema change transaction a future timestamp
- Reads and writes that depend on the schema synchronize with the registered
schema change timestamp
- These proceed if their
timestamp < schema change timestamp
- Else they wait
- These proceed if their
Evaluation
What to measure?
- Replication, Transactions and Availability
How to measure?
- Standard config for spanservers, clients on different machines
- Operations: read/write of 4KB
- Reads were done in-memory after a compaction to ensure only Spanner’s
callstack overhead is measured
- Warm read done to ensure warmed up location caches
- Availability evaluations
- Measure impact on throughput in the face of failure
- 5 zones each with 25 span servers. Test DB shared into 1250 paxos groups. 100 test clients issued non-snapshot reads at 50k reads/s
- 5s into each test, kill one zone
- non leader kills Z2, leader hard-kills Z1, leader soft-kills Z1
Conclusion
The paper doesn’t explicitly mention the pains with Bigtable’s schema
management. Note that Bigtable provides a lexicographically sorted map of
key-value like (row, column-family:qualifier, timestamp) -> value
. Spanner
improves upon this by providing locality hierarchical tables. This is awesome
because multiple applications can store their data with a per-user or
a per-usecase sharding mechanism. E.g. both Liked Youtube videos and Starred
GMail entries for the user available on the same shard could enable cool data
mashups. Second, the benefits of data locality extend to low latency lookups
etc.
It is interesting to see how the mental model has changed since GFS. Bigtable treated GFS as a simple file store for a cluster. Spanner treats the equivalent of Bigtable as a zone while it operates at the scope of an universe (multiple zones). Additionally the use cases have gradually changes from batch workloads to low latency systems and finally globally replicated consistent systems.
SQL is too awesome to leave behind. Most data storage systems end up building an abstraction to support the query language even if they have dropped the ACID notion.
Externally consistent writes are a key contribution of this paper. One of the key insights for me was reducing the challenge of consistency to be based on safe timestamps and a set of invariants that the system maintains. Second, the idea of Client driving a two phase commit is quite interesting. It simplifies the worst case drastically, failure at client implies retry, no inconsistencies in the server state.