Inside Out | Article - ZippyDB at Facebook

Table of contents

ZippyDB is a distributed key value store based on RocksDB as an engine. Like the other distributed storage systems we have studied here, this also provides tons of configurations for unique app use cases. Our primary motivation in this paper is to understand the design and trade-offs in ZippyDB. Second, a desire to map out the common building blocks of distributed KV stores.

Here’s the reference article: https://engineering.fb.com/2021/08/06/core-data/zippydb/↗

Historically, RocksDB precedes ZippyDB, and later emerged as a system to reduce common problems solved by teams using RocksDB. ZippyDB started with the data (state) replication problem in its initial avatar as a reusable library (Data Shuttle).

Our summary below looks at three different aspects. We start with the topology, study the storage API/replication etc. and finally the fundamental guarantees in the system.

Topology

ZippyDB service runs in deployment units called Tiers. Each Tier is comprised of Regions spread across geographically for fault tolerance. Tiers are multi-tenant.

Shard is the unit of data management. Each Shard is replicated across regions. Data shuttle library manages the replication. Replicas play two roles.

Global scope replicas use multi-paxos for synchronized replication. This helps provide the high availability, durability guarantees. Other replicas are followers. They receive data through asynchronous replication. These two roles provide varying capabilities of durability and write/read performance.

Applications can configure <replica, region> placement for latency reasons. Additionally, ZippyDB provides caching and pub/sub events for data mutation.

Storage

API model

Usual CRUD operations are supported through get, put, delete.
Both client side and server side transactions (conditional writes) are supported.
APIs support a batch mode with multiple keys.
Data lifetime can be managed on a per object basis using TTLs. This piggybacks on the RocksDB compaction for efficient cleanups.

Shard management

Shard is an unit of data management. Shard Manager is an external service responsible for management.
Shard placement depends on load, failure domains and user constraints (e.g. region/replica configuration).

There are two levels of shards. p-shards represent the physical data unit, and mu-shards are the logical. p-shards are not exposed to the application, however, the mu-shards are app specified and control the data locality of keys. Each p-shard is 50-100GB and comprises ~10K mu-shards. This classification enables resharding of data transparently based on the load and other characteristics.

Mapping from mu to p shards is possible in two ways.

Compact mapping uses static assignment, and mapping changes only on shard splits.
Akkio uses dynamic assignment based on traffic patterns. It places mu-shards in the region where it is frequently accessed. Provide low latency and optimal cost by avoiding data duplication across all regions.

Replication

Data shuttle library manages replication. Replication process splits time into units called as epoch, with the invariant that each epoch has one leader.

Shard manager, an external service, performs the leader election. Leases are managed by heartbeats throughout the epoch.

Happy path: leader replica generates a total order of writes with monotonically increasing sequence numbers. These are persisted in the replicated durable log. Once consensus is reached, each follower replica drains the log.

Shard manager detects the failure path and reassigns leader/restores write availability.

Fundamentals

Consistency

Consistency levels are configurable per request.

Writes support two modes with different trade-offs. Default writes require ack from majority (paxos) and persisted to primary. This provides strong consistency for primary reads. Fast ack writes, on the other hand, send a response back with the write queued to primary. This helps in the low latency write scenarios.

Reads support three consistency levels.

Eventual reads support bounded staleness, i.e., reads are not served by replicas lagging beyond a threshold.
Read your writes consistency is supported using sequence numbers. Clients cache the write sequence number and only service later sequence number queries.
Strong consistency is provided by routing reads to the primary replica. This helps save the cost of read quorum by assuming primary owns the lease. In worst case, quorum is checked.

Transactions

Transactions provide atomic read-modify-write operations on a set of keys within a single shard.

Client side transactions

Client reads data from a replica, modifies it and sends both read/write pair to a primary.
Primary compares if there are conflicting transactions committed by comparing the read snapshot with a tracked minimum version.
- Read snapshot earlier to tracked version: reject.
- Transaction spans epochs: reject.
- No conflicts: commit.
Read only transactions are similar with empty write data sent by client.

Server side transactions are similar but with better client API.

Client specifies a pre-condition with write data: key present, key not present, value matches or key not present.
Server creates a transaction with pre-conditions.
More efficient if the client can compute pre-condition without an extra read.

Future work

Disaggregated storage (also featured on the evolution of RocksDB paper) is an interesting take on distributed Key-Value store architecture.

A few noteworthy patterns to learn from this paper.

Dividing the responsibilities: State management (Data Shuttle) and Orchestration (Shard manager) are the two key building blocks in addition to Core Storage (RocksDB).

State management ensures high availability and durability with global scope replicas built atop consensus (paxos). However, scale is managed through asynchronous replication for rest of the replicas (followers). This is a sweet spot.

Orchestration must ensure optimal placement of replicas to guarantee low latency. ZippyDB allows app to configure replica affinity to a region. How does the system scale with such a constraint? Expose the logical replica (mu-shard) to the app and let the system control the physical data units (p-shards). Later contains the former. System can reshard physical location without impacting app.

Shard manager also owns the truth of leader election and memberships. It also takes care of failure detection through heartbeats.

Bounded staleness is an interesting pattern for eventual reads. I’m curious how the system manages metadata, i.e. a replica is stale. Where is this truth stored? (Shard manager or Data Shuttle). It could be based on the replication cursor on the primary.

Atomic read-modify-write within a single shard is an interesting trade-off which favors simplicity in client side API and server implementation. Server side transactions make this even more usable with a better latency.