Inside Out

Notes on seeking wisdom and crafting software

Paper - GFS: Evolution on Fast-Forward

Table of contents

Initial design goals of the Google File System (GFS) covered partial failures, monitoring, error detection, fault tolerance and automated recovery. It was designed for batch oriented workload and broke traditional assumptions on block sizes given massive file size and immense scale. A single master design was chosen for simplicity. We’ll cover a paper published several years after the original GFS paper to learn more about the design and how it changed over time.

McKusick, Marshall Kirk, and Sean Quinlan. “GFS: Evolution on Fast-Forward: A Discussion between Kirk McKusick and Sean Quinlan about the Origin and Evolution of the Google File System.” Queue 7, no. 7 (2009): 10–20.

Single master

  • Why? Simplicity in overall design. Central place to control replication and garbage collection. Faster time to market.
  • Challenge: size of metadata as system scaled from 100TB -> 1PB -> 10PB
    • Bottleneck for computes at master e.g. scanning metadata for recoveries
    • Bottleneck for clients since they talk to master for every open. E.g. mapreduce led to spikes in thousand of tasks at a time
  • Solution: multiple cells per datacenter, multiple masters on a set of chunkservers
    • Guarantees at least one pool of underlying storage
    • Application can use one or more masters
    • Use namespaces to statically partition data across cells
  • Usually tune the binary to certain extent and start focusing on scalability; gain the performance from scale. However, in this case, Google ended up tuning the master binary quite a bit because it was a single bottleneck

Ownership of both file system and applications was a huge plus. GFS was improved and the apps were also designed with strengths and weaknesses of GFS in mind. The problems could be solved between the app and GFS.

Nature of applications changed over time. They started storing smaller file sizes. E.g. GMail. Large number of small sized files made significant demand on the master since every chunk require two bits of metadata (file identifier, and chunk maps). Original assumption had 64MB chunk size, but it reduced.

How do you combine number of objects into larger files? Solution to deal with small sized files (1MB) is a whole new system, not GFS. Note: we learn in the Spanner paper that the successor to GFS is called Colossus, not sure if the author points to that or to Spanner itself.

Bigtable was built on GFS and adheres to most of its principles. Structured storage system with lots of key-value pairs and a schema. Any app with tons of small data items uses it.

Bigtable wasn’t created to deal with file count problem. It uses GFS to store two key data structures: logs and SSTables. Uses compaction at multiple levels similar to a log structure file system. Logs are for mutable stuff and SSTables are sorted immutable. Data in both these structures is encoded using protocol buffers.

Latency sensitive applications

Single master was unacceptable for latency sensitive applications, works well for batch apps optimized for throughput. Second, synchronous replication/recovery during writes was also blocker for low latency (aka clients keep writing until succeeds). Third, automated master failover used to take minutes and later optimized to seconds. Still not acceptable for latency aware scenarios. Pain to build an interactive DB on a file system designed for batch operations.

How to manage latency sensitive apps on GFS? Application worries about it. Use redundancy e.g. write in multiple files (bigtable) or fallback to regions (gmail), reconcile later.

Design goal to handle latency issues in distributed master db.

GFS accepted that there will be cases where clients will read stale data. E.g. a client with file open may miss data appended after opening. Case of complexity pushed out to the app.

Consistency

Clients keep trying until data is written. What if the client crashes in between? Data is in an undefined state. Mitigation: detect inconsistent state and for what window will it remain inconsistent.

RecordAppend API allows multiple writers to append data to a file concurrently. Primary chooses the offset prior to writes, but one of the chunkservers may not write or the primary could change and a new offset could get chosen. Effect: data ends up multiple times in a file or could be in different order. This breaks the basic expectation of a client. Suggest to have a single writer that can serialize writes to ensure replicas are consistent.

Paper likely points to the spanner DB as a next version of GFS or Bigtable.

Read next: Bigtable builds on GFS to provide an internet scale structured storage system.