Notes on seeking wisdom and crafting software

Paper - Bigtable: A Distributed Storage System for Structured Data

We talked about Google File System in the last paper. GFS provides internet scale distributed file system APIs for batch oriented workloads. Bigtable builds on top of GFS to provide structured storage. What is the data model for a structured storage? What are the key design aspects? And how do they scale for Google use cases? This paper will help us answer some of these questions.

Chang, Fay, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. “Bigtable: A Distributed Storage System for Structured Data.” ACM Transactions on Computer Systems 26, no. 2 (June 1, 2008): 1–26.

Motivation

Raw data files are coarse by design. Storing smaller data items e.g. a key-value pair is a pain since the client must fetch the coarse file, organize it such that random point and range access are effective etc.. Inevitably a higher level abstraction is necessary to support structured access patterns.

  • Structured storage at massive scale
  • Simple data model where clients can dynamically control data layout and format. Non relational
  • Support both low latency online applications and batch oriented workloads

Structure of the paper makes reading a pleasure: data model, API and then implementation design. Finally tuning and performance evaluation.

Data model

  • Motivating example: Webtable
    • Use url as row key, various metadata in separate columns
    • contents is a column-family with url content stored with timestamp.
    • anchor column-family contains referring urls along with timestamp.
  • Schema
    • String row and column names. Data is schemaless array of bytes, clients choose the format
    • Client can control data locality, and storage media (memory or disk)
    • Index = row key, column key and timestamp
  • Storage
    • Sparse, distributed, persisted, multidimensional, sorted map
    • Rows
      • Atomic read/write for any columns under a row key. Why? Easy for clients to reason about behavior (compare: GFS client behavior)
      • Data sorted lexicographically by row key
      • Dynamic partition of rows into tablets. Client can choose locality by choosing appropriate row key (compare: GFS static partitions). E.g. Webtable uses reverse domain as key (in.codito.www/index.html) for effective analysis of a domain
    • Column families
      • Collection of column keys, smaller column families (100s), large number of column keys (unbounded)
      • Are part of schema definition, rarely change, unit of access control: read, write, create derived column families
      • Compressed as a whole
      • Column key are named like family:qualifier. E.g. anchor is a column-family with one column key for each referrer anchor:com.google, anchor:com.cnn etc.
      • Client settings for data cleanup: a) keep last n versions, b) keep data for last n days
    • Timestamps
      • 64bit int with microsecond granularity
      • Data is stored with recent timestamp first
    • A cell is represented by <row, column-family:qualifier, timestamp>

API

  • Management
    • Create, Delete tables, column-family
    • Update metadata like access control, cluster etc.
  • Write
    • Represented as mutations. Series of mutations on a row can be atomically applied. E.g. add or delete values to column keys
    • Batch write across row keys
    • Transactions on a single row key i.e. atomic read-modify-write. No multi row key support
    • Cells can be used as integer counters
  • Read
    • Iterate over multiple column-family in a query
    • Filters: regex match on column name, on timestamp
  • Compute
    • Capability to run user scripts in server address space
    • Scripts are written in sawzall language, allows data transformation, filtering and summarization (no writes)
    • Bigtable can be used as a input/output to mapreduce jobs

Design

(Note how the paper first covers building blocks and then moves on to implementation details)

  • Dependencies
    • GFS for storing logs and data
    • Cluster management system for job scheduling, resource management, failure detection and monitoring
    • Shared machine pool for running jobs
  • SSTable file format is used to store data
    • Persistent, ordered, immutable map of key to values (both arbitrary bytes)
    • Two structures: a) set of blocks (64KB), b) block index stored at end of file
    • Block index is loaded into memory on opening a file
    • Lookup in a single disk seek: a) binary search in index, b) read the block
    • Lookup on the entire SSTable loaded in-memory
  • Chubby is a distributed persistent lock service
    • 5 active replicas with one acting as leader
    • Based on paxos protocol to keep replicas consistent
    • Provides namespaces with directories and files
    • Directory and files can be locked: read/write to the file(s) will be atomic
    • Chubby client maintains session with service
      • Consistent caching of files
      • Lease based acquisition, renewal of locks
      • Register for notification on files/directories for events
  • How Bigtable uses Chubby?
    • Ensure there is a single master at any time
    • Bootstrap location of table data
    • Discover tablet servers and finalize their deaths
    • Store table schema information and access control lists
    • Critical dependency for availability (probability of unavailable of single cluster is 0.0326%)

3 components: a) client sdk, b) 1 master, c) N tablet server.

Bigtable is yet another single master distributed storage system.

  • Master
    • Tablet management: add, expire, balance load
    • Garbage collect files in GFS
    • Schema changes for table, column-family
    • Lightly loaded as clients do not communicate with master in most cases (compare: GFS)
  • Tablet server
    • Can be dynamically added/removed from a cluster
    • Manage set of tablets (10-1000 tablets/server)
    • Read/write to tablets, split large tablets
    • Clients directly connect for data transfer
    • Each table starts with one tablet, split as size grows, each tablet has a range of rows

Implementation

  • Storage (metadata)
    • Hierarchical similar to a B+ tree, max 3 level deep. Root and metadata tablets maintain a METADATA table
    • Chubby file points to root tablet
    • root tablet points to N metadata tablets. Never split
    • Each metadata tablet points to M user tablets
    • METADATA table
      • Row key = encoding of table identifier, table’s end row for the tablet. Why end row? Needed for table split across tablets
      • Column points to the tablet that stores the table
      • Each row = 1KB in memory. 128MB metadata tablet can store 2^34 user tablets
      • Log of events for each tablet including operations; helps for diagnostics
  • Client SDK
    • Prefetch and cache tablet locations, resolve stale location by querying up in METADATA hierarchy
    • Cost: worst case 3 roundtrips (RTs) for empty, 6 RTs for stale
  • Compute (Master)
    • Master and Tablet server machine life cycle is managed by the Cluster Management system
    • Tablet discovery
      • Master monitors the servers directory for new files (= new tablet servers)
      • Master asks each Tablet server for the tablets it manages
    • Tablet unassignment
      • Tablet server stops serving tablets if it loses its exclusive lock e.g. network partition
      • Tablet server tries to reacquire lock if file exists. Self terminate if no file exists and release locks
      • Tablet server can be removed by Cluster Management infra for whatever reason
    • Tablet assignment
      • 1 tablet always assigned 1 tablet server
      • Master keeps track of assignments, unassigned tablet and live tablet servers. Unassigned are assigned with a tablet load request to free tablet server
      • Master periodically asks each tablet server for status of locks
        • If tablet server lost its lock, or master is unable to reach tablet server: master acquires lock for the tablet server’s file and deletes it
        • Ensure that tablet server is dead
        • All tablets assigned to that tablet server are moved to unassigned pool
      • Master kills itself when its session with Chubby expires
    • Master startup
      • Take unique master lock in Chubby, prevent concurrent master initializations
      • Do tablet discovery (see above)
      • If root tablet is not assigned, add root to set of unassigned tablets, necessary for METADATA scan below
      • Scan METADATA table to learn set of tablets. If a tablet is not discovered (i.e. assigned to a tablet server), keep it in unassigned tablets set
    • Table creation, deletion and tablet merge all happen via master. Tablet splits are done via tablet servers (handled separately, see below)
    • Garbage collection uses mark-and-sweep algorithm to delete obsolete SSTables by scanning METADATA table
  • Compute (Tablet Server)
    • Tablet servers upon boot take an exclusive lock on a unique file in Chubby servers directory
    • Tablet split
      • Tablet server commits the split to METADATA table
      • It notifies the master of a split. If this is lost e.g. master or tablet server died, master detects the new tablet when it asks Tablet server to load the split tablet, it will then notify the master because only part is request to load
      • Split tablets can share the same SSTable to exploit immutability
    • Tablet storage
      • Memtable: sorted in-memory buffer to store recently committed updates
      • Tablet log or commit log: log file that contains all update operations. Stored in GFS
      • SSTables store older updates in a sorted append only structure. Stored in GFS
      • Locality groups
        • Client defined set of multiple column families
        • Stored together in a separate SSTable, or in-memory (used for location column in METADATA table)
        • Optimized for efficient reads
      • Compression
        • Client specified on per locality group
        • Two pass schemes are used on a per SSTable block. Why block? Read without decompressing entire file
        • Size reduction in 10:1 ratio by Bentley-Mcllroy scheme because of large shared boilerplate of webpage content in a single site
    • Tablet recovery
      • Tablet server reads the list of SSTables and redo points from METADATA table
      • Redo points are pointers to commit logs that contain data for tablet. These are already committed operations
      • Server reads indices of the SSTables into memory, reconstruct memtable by applying the updates committed since redo points
    • Write operation
      • Server validates request, checks for authorization: read permitted writes from Chubby file (cached)
      • Valid mutation is recorded in commit log, use group commit for better throughput (micro batching?)
      • After write is commited, contents are moved to memtable
    • Read operation
      • Server validates request, checks for authorization
      • Execute read on merged view of the SSTables and memtable. Optimized since both are lexicographically sorted data structures
    • Effective concurrency control since only memtables need coordination across read/writes. SSTables are immutable. Make each memtable row copy-on-write to reduce contention
    • Read/Write can continue when tablets are merged or split
    • Tablet compaction
      • Triggered when size of memtable reaches a threshold
      • Minor compaction
        • Old memtable is frozen and written into SSTable and new memtable is created
        • Reduce memory load of tablet server
        • Reduce amount of data to be read/reconstructed upon recovery
      • Merging compaction
        • Runs in background to keep the number of SSTables bounded
        • Major compaction merges all SSTables into a single SSTable
      • Deleted data pointers are kept in SSTables during non-major compactions. Major compactions remove them to reclaim resources
    • Caching
      • Two levels
      • Scan cache is higher level. Caches key-value pairs returned by SSTables. Helps with repeated reads of same data
      • Block cache is lower level. Caches SSTable blocks read from GFS. Helps with reading data around previous read
    • Bloom filters
      • Client can opt for creating bloom filters per locality group
      • Helps answer if a SSTable contains data for a specified row/column pair
      • Drastic reduction in disk seeks required for read operations
    • Commit log implementation
      • Separate commit file per tablet in GFS was no go
        • Too many files written concurrently to GFS. Large disk seeks
        • Lack of group commit optimization
      • Choose single commit file per tablet server
        • Downside: recovery is complex
        • Problem: new tablet server may need to read commit log of crashed tablet server on a per tablet basis. Such logs for all tablets are co-mingled in one commit log
        • Solution: master initiating recovery will do parallel sort of the commit log of crashed tablet server by <table, row name, log seq number>
        • In sorted output, the commit logs of a single tablet are colocated. Efficient reads with one disk seek
      • Two log writing threads to overcome high latencies from GFS. Writing switches based on performance. Each log has a sequence number to dedupe

Evaluation

What are the key elements to measure?

  • Number of 1000 byte values read/written per second
  • Benchmarks
    • Random read, writes
    • Sequential read, writes
    • Scan

How to measure?

  • Setup a Bigtable cluster with N tablet servers
  • Each tablet server uses 1GB memory
  • GFS cell with 1786 machines with 400GB IDE hard disks each
  • N client machines
  • Machines arranged in 2-level tree shaped switch network with 100-200 Gbps network bandwidth at root
  • R row keys involved chosen so that 1GB data written/tablet server

Random reads are slow throughput because 64KB block is transferred from GFS but only 1000 byte is read. Leads to bunch of waste and network link saturation. Recommended clients with such access pattern to use 8KB block size per SSTable.

Lessons

  • Exposure to all kinds of failures incl. clock skew, network partitions, bugs in dependencies
  • Delay adding new features until it is clear how the new features will be used
  • System level monitoring is necessary to detect and fix hard problems. Do this for a sampled set of RPCs
  • Code and design clarity are of immense help for maintenance and debugging

Conclusion

This paper demonstrates how multiple large scale internet systems come together to solve structured storage. E.g. GFS for storing raw data, Cluster management (Borg?) for machine operations and Chubby for distributed locks. There are multiple interesting patterns to learn from this paper. A hierarchical storage model makes addressing simple (note the use of age old B-tree like structure). Separation of concerns between Master and Tablet Servers is a pattern we already saw in GFS. Key difference here is the use of Chubby for tablet life cycle management.

Tablet servers solve interesting challenges around automated splits when size of a tablet grows. Additionally we learn about a log structured data structure for low latency and high throughput by effective concurrency control (restricted only to memtables). I was surprised at the 10:1 compression ratios supported by Bentley-Mcllroy scheme in the example Webtable because of duplicate/shared content between versions of a webpage. An interesting use of Bloom filters help reduce the disk seeks in GFS. Author stress the importance of code and design clarity for maintenance and debugging of large scale systems.

Next, we will take up a multi master distributed store.