Inside Out | Paper - Evolution of Development Priorities in Key-Value Stores

Key value stores are a fundamental abstraction in distributed systems and serve as a backbone for many other services (blob stores, metadata etc.) or applications. What are the typical hotspots in a LSM tree based engine? What do you optimize? What works well? What are the axes of change? We will study a paper with a behind the scenes commentary on evolution of a key value store along with the rationale.

Dong, Siying, Andrew Kryczka, Yanqin Jin, and Michael Stumm. “Evolution of Development Priorities in Key-Value Stores Serving Large-Scale Applications: The RocksDB Experience,” n.d., 18.

Let’s set the stage with some background context.

Background

RocksDB is an embedded storage engine heavily optimized for SSDs. Doesn’t handle multiple host concerns like state replication etc.
Highly configurable to serve various workloads including stream processing, logging, caching etc.
SSDs are thousand times faster than hard disks. High write bandwidth limited by smaller program/erase cycles.
Shifts performance bottleneck from disks to network! Applications designed to store data locally.

Architecture notes

Writes go to memtable, an in-memory ordered Skip List with O(logN) insert and search. WAL is used for fault recovery.
Compaction process moves data from memory to various levels of SSTables (Sorted Sequence Tables). Bloom filters are used to eliminate unnecessary search and binary search is used on candidate levels.
Multiple types of compaction are supported - secret sauce for supporting various workloads.
- Leveled compaction uses exponentially increasing size targets.
- Tiered (universal) compaction uses multiple sorted runs are lazily compacted together.
- FIFO compaction discards old files after a size. Used for in-memory caching apps.

A lazier compaction algorithm improves write amplification and write throughput, but read performance suffers, while a more aggressive compaction sacrifices write performance but allows for faster reads.

The paper points out three axes that influenced the evolution of RocksDB. First, the desire to optimize resources - SSD lifetime, space or CPU. The large scale workloads powered by RocksDB were the second dimension of change. Finally, changes for improving failure handling.

Resource optimizations

Chronological order of optimizations in RocksDB.

Write amplification
- Goal: save erase cycles for write heavy workloads.
- Caused by SSDs themselves, a factor of 1.1-3. Or, by the storage engine, a factor of up to 100.
- Changes: added Tiered compaction. Brought down write amplification from 10-30 to 4-10.
Space amplification
- Observed that most applications are space constrained since true SSD IOPS are rarely utilized to the maximum, so erase cycles doesn’t matter a lot.
- Caused by leveled compaction, up to 25% space wastage. Worst case 90%.
- Changes: added Dynamic Leveled Compaction, where size of each level is automatically adjusted based on last level’s size.
CPU utilization
- Misnomer/fad: SSDs are so fast that bottleneck is now the CPU. Not seen in well balanced configurations. 42 deployments of ZippyDB were surveyed to validate this.
- Intensive write workloads may have high CPU. Mitigation: use a lightweight compression.
- Changes: prefix bloom filters, apply bloom filter before index lookups etc.
Adapt to newer technologies
- SSD innovations to improve latency and save erase cycles e.g. open-channel SSDs. Not relevant since apps are space constrained.
- In-storage computing may disrupt the unified RocksDB experience.
- Disaggregated (remote) storage where the drive and compute are separate, so both can be fully utilized at the same time. Both resources can scale independently which is not possible with locally attached drives. Ongoing work in RocksDB to parallelize I/Os, transient failure tolerance etc.
- Storage class memory is another interesting area of research. Can it be used as the primary persistent storage, or with DRAM for memtables?
LSM tree improvements
- Changes: separation of keys and values (WiscKey).

Network doesn’t feature in this list given the embedded nature of RocksDB. It is likely a concern of a higher level abstraction, e.g. ZippyDB or the application itself which manages the RocksDB instances in a cluster.

Large scale workloads

Downstream applications influenced following changes in RocksDB.

Resource management
- Distributed KV stores built on RocksDB partition data into shards. Each server node hosts multiple shards.
- Challenge: Multiple instances of RocksDB impact resource consumption on the node. Bounds are necessary in both node level or at each process level. Resources must be prioritized for the most critical instance.
- Use thread pools to cap size and resource usage.
- Node level resource management has two broad strategies: a) conservative resource allocation, and b) share resource information amongst instances to let each adapt.
Write ahead logs
- Distributed systems redundantly store data in replicas, or paxos logs etc.
- Challenge: WAL writes are not critical or irrelevant in some cases.
- Changes: allow WAL to be sync, buffered or turned off.
Throttle file deletions
- Challenge: file deletions trigger unnecessary data movements in SSD due to internal garbage collection in the drive. Increases latency.
- Changes: rate limit simultaneous file deletion after compaction.
Data format compatibility
- Challenge: continuous deployment or replication of RocksDB may require new versions to work with old data format (back compat), or older RocksDB version to work with new data format (forward compat) in case of rollbacks.
- Except new feature, ensure data is both backward/forward compatible.
Configuration management
- Challenge: LevelDB used config embedded in code, data unusable in diff RocksDB instances. Too many options, choice depends on workload and underlying system.
- Changes: automatically adapt configuration while supporting explicit tuning.
Replication and backup
- Replication is supported in two modes. Logical copying copies records from another replica through a scanner/iterator API. Physical copying copies the underlying tables/metadata directly to a replica.
  - Changes: allow logical caching to not trash the cache (for online queries). Allow physical copying by not coupling RocksDB with the SSD directly, e.g. using block device or FTLs instead of file system.
- Backups are similar (logical and physical modes), however apps need multiple backups. RocksDB provides a backup engine.

Failure handling

Detect data corruptions early to reduce probability of data loss.
Data integrity at multiple layers to prevent propagation of corruptions to replicas or clients.
Automated fault recovery for local faults e.g. from underlying storage system. Use periodic retries for transient faults. Reduced operational load on engineers.

RocksDB level corruptions introduced once every three months for 100PB data size. 40% of corruptions propagated to replicas. Corruptions happen due to data transfer software bugs.

Levels of protection: a) checksum for every SSTable block or WAL fragment, b) file level for every SSTable file, c) handoff checksum for underlying storage.

Challenge: data unprotected above file IO layer e.g. memtable.

Solution: per-record checksums.

Extensions to Key/Value API

Transaction support in RocksDB to enable to higher layer 2PC commits.
Versions and timestamps for each record (key-value pair).
- 56-bit seq numbers used to identify versions for a record.
- Client can take a snapshot, RocksDB ensures all versions in snapshot remain alive until snapshot is released.
- Challenge: cannot read past version without a snapshot. Seq number is local to every RocksDB instance, complex for replicated shards since each shard is another instance.
- Workaround: client can encode version/timestamp within key or value. Costs performance.
- Change: add feature to support client specified timestamps in records. Optimizations: use timestamp as a metadata in bloom filter or SSTable properties for filtering.

Future work

Paper concludes with a set of open questions for further research.

How can we use SSD/HDD hybrid storage to improve efficiency?

How can we mitigate the performance impact on readers when there are many consecutive deletion markers?

How should we improve our write throttling algorithms?

Can we develop an efficient way of comparing two replicas to ensure they contain the same data?

How can we best exploit SCM? Should we still use LSM tree and how to organize storage hierarchy?

Can there be a generic integrity API to handle data hand- off between RocksDB and the file system layer?

It is interesting to observe the journey from resources (e.g. SSD/CPU, a core layer concern) to space amplification (a workload dependent concern). Workloads are the king. We must begin with the end in mind. The challenge may be a tremendous push for the core layers to expand well beyond their logical boundaries. This leads to feature creep and a cycle of supporting custom-built features for one workload that the team ends up spending energy throughout.

How do we decide where a feature should be built? E.g. in the core storage vs a downstream consumer. This is specifically hard when both are platform pieces. Paper doesn’t explicitly call it out. I believe one of the priority call must be to keep the core as small as possible. Let the core be an enabler for others to build upon while preserving the right encapsulation. This may be evident with multitude of configuration options/compaction mechanisms at one end and supporting data compatibility/integrity in core instead of letting the app manage those.

Read next: ZippyDB builds on the RocksDB instances to provide a large scale fault tolerant/replicated storage service.