LSM trees are the backbone of modern key value stores. We’ll talk about a paper based on a simple (yet, powerful) idea of storing keys and values separately in LSM trees. The paper dives into the challenges with range query, garbage collection and consistency with this fundamental change. A side note: although this paper has been on my to-read list for long time now, a talk this week provided the final push to study this (lest I forget the context :)).
Lu, Lanyue, Thanumalayan Sankaranarayana Pillai, Hariharan Gopalakrishnan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. “WiscKey: Separating Keys from Values in SSD-Conscious Storage.” ACM Transactions on Storage 13, no. 1 (March 24, 2017): 1–28. https://doi.org/10.1145/3033273.
B-Trees are the goto data structures for databases and file systems. Small changes in B-Tree may lead to large number of random accesses. Random accesses are 100x slower on HDDs and impact performance. LSM trees solve this by batching writes and using sequential disk accesses. However, LSM trees use merge sort to arrange ingested data in various data tiers through the compaction process. This may lead to upto 50x IO amplification i.e., data accesses vs data retrieved.
Are the assumptions in LSM tree design valid today? Can we leverage the capabilities of modern SSDs (e.g. parallel access) and reduce IO amplification? Will it lead to a better performance?
Wisckey proposes a design that beats leveldb and rocksdb on all YCSB benchmarks.
LSM trees store records sorted by the key in multiple tiers (or levels) of exponentially increasing sizes. First tier is the smallest and is stored in-memory. Incoming writes are written to a disk log (for crash recovery) and are appended to the first tier. Upon crossing a threshold, data from tier 1 are merged with the tier 2 disk file through merge sort (this is called compaction). Compaction happens across any two adjacent tiers in the background and deletes the source tier file after a successful merge.
Writes are fast. Reads may need to search multiple tiers for a key. Compactions run in the background to rearrange data with merge sorts.
LevelDB uses two in-memory data structures (called
memtable) for tier 1 with
one of them being immutable. Both are sorted skiplists. Seven tiers of disk
sstable for Sorted String Table) are used. SSTables comprise of
data blocks followed by metadata and index blocks. Writes are stored on the
mutable memtable. Once mutable memtable is full, a new mutable memtable is
created and previous is marked immutable. Compaction pushes the immutable
memtable to disk and discards the log file.
LevelDB provides a few interesting optimizations.
SSTable data blocks store keys as prefix compressed i.e, common prefix are ignored and suffix is stored. Full keys are stored in restart point at end of the sequence, allowing a quick binary search of keys in a lookup. Second, index blocks at end of SSTable are pointers to the data blocks, used to identify which blocks to look. Third, filter block keeps bloom filters, used to eliminate seeks in case the SSTable doesn’t have a key.
Max lookups for a key is bounded by number of tiers (or levels). First disk tier (level 0) may contain overlapping keys. So each lookup includes all files in level 0, and one file each from subsequent levels (worst case). LevelDB throttles foreground writes to L0 if count of files cross a threshold.
L1-Lk levels on disk have manifest files indicating the min and max keys. This is used to eliminate seeks during lookups.
Amplification is the ratio between total size of data written/read and the data requested. E.g., 200 bytes seeked to lookup a 2 byte record implies 100x read amplification.
Paper observes that the read amplification in LevelDB is much higher than the write amplification. Compared to B-Tree, the writes are much better since the random accesses are 100x slower than sequential. Reads are at par.
LSM trees are suitable for SSDs for two reasons. First, random writes in SSDs show performance degradation with utilization. Second, aggregate throughput of parallel random reads in SSDs matches the throughput of sequential reads for some workloads.
Reduce the write amplification by storing keys separately from the values. Optimize reads by using parallel random accesses supported by modern SSDs.
- Values are no longer sorted. Range scans involve random accesses to value log.
- Garbage collection and crash recovery of a separate value log is hard since crash will lead to dangling keys or values.
- Wisckey doesn’t work well for workloads with smaller values ingested sequentially or scanning a large dataset sequentially. Paper calls out the design goal to support smaller keys and larger value sizes.
Wisckey is based on LevelDB. It provides the same set of rich APIs.
Writes have low amplification since the size of LSM tree is smaller. For a key-value pair, first the value and key are written to the log file, followed by inserting the key into the tree with value offset and size. Deletes remove the key leaving value to be garbage collected.
Reads first lookup the key in the LSM tree. This is faster because a smaller tree can be effectively cached in memory and will have fewer levels. Value lookups are a pointer access to the disk. Range queries first retrieve the keys and push them into a queue. Multiple threads concurrently pop the keys and retrieve values from the value log.
Note that Delete operations just remove the key. GC aims to reclaim space by removing the values. Wisckey follows a two pointer algorithm. Compare with LevelDB which merges GC with compaction process.
Value log is a singly linked list with new entries getting added at the
Log maintains an invariant that the list between
tail is always
valid and used for lookups. GC starts from the tail, for each value, it looks up
the key in LSM tree and if the key is found, the value is inserted again at the
head. GC skips the values without keys.
What if the GC crashes? Data is not lost because value is first inserted and flushed to disk before the key record (in LSM tree) and tail are updated.
We have two scenarios.
1. A crash may occur before values are flushed.
Modern file systems ensure that upon crash recovery only an sequential prefix
subset of bytes will be appended to the file, i.e., imagine a set of bytes
[b1, b2 ... bn] were supposed to be written, file system crash recovery
ensures that only
[b1 ... bx] will be appended where
x < n. Thus, if a value
x+1 is lost, then all subsequent bytes are lost too.
With every key lookup, Wisckey verifies if the value is within the valid head-tail range. If not, it assumes the value is lost and deletes the key.
Writes can be synchronous. In this case, Values are flushed to disk and then keys are inserted to LSM tree.
2. A crash may occur before keys are inserted to LSM tree.
Keys can be recovered by scanning the Values log and inserting them into LSM
tree in order. However, the amount of scan must be limited. Wisckey periodically
stores the most recent known Value log
head in the LSM tree. Thus, upon
database open, the Value log is scanned from the last known
head to the
Writes to the value log use a buffer. This ensures that the overhead of
write() syscalls are amortized by batching writes. Paper mentions
that for writes above 4KB, the overhead is minimal and device throughput is
LSM trees use a log for crash recovery. However, given Wisckey uses the mechanism in (2) above for crash consistency, the tree log is optional.
Wisckey uses a 32 threads pool for range queries. Keys are inserted to a queue and the threads are woken up to read values in parallel.
fallocate hole punching is used for allocating disk space for the Value log
file. Hole punching guarantees that subsequent writes into allocated range does
A summary of scenarios and metrics used for evaluation are outlined below.
- Load performance
- Order of insertion: a) sequential, b) random
- Size of key: 16B, size of value: [64B, 256KB]
- Metric: throughput (MB/s)
- Max device throughput: 400MB/s for sequential writes
- Metric: Write amplification
- Metric: Write latency
- Query performance
- Type of query: a) point lookup, b) range query
- Size of key: 16B, size of value: [64B, 256KB]
- Metric: throughput (MB/s)
- Max device throughput: 500MB/s for sequential reads
- Metric: Read latency
- Garbage collection
- Workload: random writes in foreground, delete X% key-value pairs (free space), random writes in foreground
- Metric: throughput, while GC runs in background
- Crash consistency
- Workload: crash the system just before memtable is flushed to disk
- Metric: worst case recovery time
- Metric: consistency vulnerability count
- Tool: ALICE to simulate crashes
- Space amplification
- Workload: a) sequential writes, b) random writes/overwrites
- Metric: ratio of actual size vs logical db size
- E.g. record size 1KB takes 4KB on disk implies a space amplification of 4
- CPU usage
- Workload: various mentioned above
Few observations from results are below. Please refer evaluation section in paper for additional details.
For both sequential and random load performance, LevelDB never reaches the max device throughput. Wisckey reaches the limits at 4KB and 16KB value sizes respectively.
In LevelDB load performance, for smaller key-value pairs most of the time is spent writing to the log file. Large key-value pairs spent time on waiting for the memtable flushes.
Query performance: for point lookups, higher throughput in Wisckey is more pronounced after the value size crosses 1KB. Better performance is attributed to low read amplification in Wisckey due to smaller size of LSM tree.
Similarly, for random range queries, Wisckey trumps over LevelDB post the 4KB value size. Low throughput of LevelDB is attributed to opening too many SSTables, reading through the indexes, bloom filters etc.
A note on the trade-offs of the LSM storage systems.
No key-value store can minimize read amplification, write amplification, and space amplification at the same time. Tradeoffs among these three factors are balanced differently in various systems. In LevelDB, the sorting and garbage collection are coupled together. LevelDB trades higher write amplification for lower space amplification; however, the workload performance can be significantly affected. WiscKey consumes more space to minimize I/O amplification when the workload is running; because sorting and garbage collection are decoupled in WiscKey, garbage collection can be performed later, thus minimizing its impact on foreground performance.
I thoroughly enjoyed studying this paper. It served as a great introduction to LSM trees, LevelDB, and the various trade-offs in designing a storage data structure based off LSM trees. RocksDB has already integrated the ideas in this paper into BlobDB. See https://rocksdb.org/blog/2021/05/26/integrated-blob-db.html. We’ll take up studying RocksDB and its improvements in near future.