Inside Out

Notes on seeking wisdom and crafting software.

Paper - Bigtable: A Distributed Storage System for Structured Data

We talked about Google File System in the last paper. GFS provides internet scale distributed file system APIs for batch oriented workloads. Bigtable builds on top of GFS to provide structured storage. What is the data model for a structured storage? What are the key design aspects? And how do they scale for Google use cases? This paper will help us answer some of these questions.

Chang, Fay, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. “Bigtable: A Distributed Storage System for Structured Data.” ACM Transactions on Computer Systems 26, no. 2 (June 1, 2008): 1–26.


Raw data files are coarse by design. Storing smaller data items e.g. a key-value pair is a pain since the client must fetch the coarse file, organize it such that random point and range access are effective etc.. Inevitably a higher level abstraction is necessary to support structured access patterns.

Structure of the paper makes reading a pleasure: data model, API and then implementation design. Finally tuning and performance evaluation.

Data model



(Note how the paper first covers building blocks and then moves on to implementation details)

3 components: a) client sdk, b) 1 master, c) N tablet server.

Bigtable is yet another single master distributed storage system.



What are the key elements to measure?

How to measure?

Random reads are slow throughput because 64KB block is transferred from GFS but only 1000 byte is read. Leads to bunch of waste and network link saturation. Recommended clients with such access pattern to use 8KB block size per SSTable.



This paper demonstrates how multiple large scale internet systems come together to solve structured storage. E.g. GFS for storing raw data, Cluster management (Borg?) for machine operations and Chubby for distributed locks. There are multiple interesting patterns to learn from this paper. A hierarchical storage model makes addressing simple (note the use of age old B-tree like structure). Separation of concerns between Master and Tablet Servers is a pattern we already saw in GFS. Key difference here is the use of Chubby for tablet life cycle management.

Tablet servers solve interesting challenges around automated splits when size of a tablet grows. Additionally we learn about a log structured data structure for low latency and high throughput by effective concurrency control (restricted only to memtables). I was surprised at the 10:1 compression ratios supported by Bentley-Mcllroy scheme in the example Webtable because of duplicate/shared content between versions of a webpage. An interesting use of Bloom filters help reduce the disk seeks in GFS. Author stress the importance of code and design clarity for maintenance and debugging of large scale systems.

Next, we will take up a multi master distributed store.