Inside Out

Notes on seeking wisdom and crafting software.

Paper - The Google File System

Google File System (GFS) paper is one of the classics in distributed systems world. It brought forth many novel ideas including building large scale reliable internet services from unreliable commodity hardware. Second, this is probably one of the first papers from Google on internet scale design. What were the key design choices then? I’m curious to follow their journey of building systems, learning from the design and evolving to the next system.

Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. “The Google file system.” In Proceedings of the nineteenth ACM symposium on Operating systems principles, pp. 29-43. 2003.

Motivation

Offer fault tolerant, scalable distributed file system that runs on inexpensive commodity hardware.

API surface includes usual create, delete, open, close, read and write. Snapshot operation copies a file or directory at low cost. Record append allows multiple clients to write data concurrently with atomicity guarantees for each client’s append.

Architecture

Design decisions

Single master

Single Master simplifies the design. Use the global knowledge for chunk placement and replications.

Master owns several responsibilities.

  1. Namespace management and locking. Necessary to allow concurrent operations at master. Both read and write locks are supported on /d1/d2/leaf or parts. Read lock prevents delete, rename and snapshot and write lock any updates. For creation, read lock on parent and write lock on created object. Allows concurrent creates.
    • Locks are taken in consistent total order - first by level, then lexicographically
  2. Replica placement. How to maximize data reliability/availability and maximize network bandwidth? Spread replicas across machines AND racks. Reads can utilize aggregate bandwidth of multiple racks.
  3. Replica create/re-replicate/rebalance.
    • Creation optimizes disk usage, placement and load balance (limits new creations on a single chunk server).
    • Re-replicate triggered when replica count falls below threshold. Priority is determined based on diff with replication goal, and if the chunk is blocking client progress.
    • Clone happens with master instructing another replica to copy from a source. Limit on active cloning activities to not jeopardise the client operations.
    • Rebalance gradually fills up free chunk servers instead of swarming them with new data and hence new traffic.

Data storage

Chunk sizes in GFS are 64MB, much larger than the typical file system block size.

This allows less network overhead from client to chunkserver since few requests get the necessary data. Size of metadata will be small, and can be stored in-memory at master.

Smaller files are at disadvantage. Clients can create hotspot at chunk server for such files. Solution: maintain more replicas and stagger the concurrent requests.

Garbage collection reclaims deleted chunks and files. Triggered by master with scans of a) files, and b) chunk namespace. Files are renamed, hidden with deletion timestamp upon delete. They are later garbage collected. Similarly, master and chunk server exchange metadata of chunks and any chunks not present in master are deleted by chunk server. Call out: distributed garbage collection is a deep problem in PL community, however the scheme presented here is simple.

Reclamation/replication policies are user configurable.

Stale chunk replicas are garbage collected. Detection happens using version number of chunk during the heartbeat between master and chunk server. Master maintains it. Version number is increment with every chunk operation, and for stale chunks it will lag. Master ignores stale replicas in its response to clients. Client/Chunk server verify the version number to ensure they access the up to date data.

Advantages of GC over eager deletion:

Disadvantage: storage is not immediately available under pressure.

Metadata storage

Operation Log is maintained at master which provides a consistent view of operations on a logical timeline i.e. a global total order. Represents files, chunks and version info. It is replicate to ensure durability. Client operations are responded after writing to operation log.

Master recovers its state by replaying the operation log. Size of replay is kept small with periodic checkpoints. Checkpoints are B-Tree like which can be directly mapped to memory and thus are fast to read. Building a checkpoint can happen in parallel in a separate thread, doesn’t block mutations. Older checkpoints and log files can be deleted.

Master’s data is replicated to shadow masters. They apply the operation log actions exactly as master does.

Consistency model

Communications

Decouple the flow of data from flow of control. Fully utilize the network bandwidth and avoid network bottlenecks or high latency links.

Tricks to maximize data flow:

Atomic record appends are supported with client providing only data and GFS at-least-once appends it to an offset of its choice, return the offset to client. Primary replica chooses the offset and asks secondaries to write at the same offset.

Invariant: after an append, all replicas have the record as long as the primary. Next record append will be at a higher offset.

Snapshot operation creates a copy almost instantaneously. Uses copy-on-write primitive. Master first revokes any leases on source chunk. It creates a copy of metadata and logs in operation log. Upon the first write on new chunk, master asks the source chunk server to create a copy and then perform the write. Copies are local in the same chunk server.

Fault tolerance

Assume component failures and disk corruptions.

High Availability uses two fold strategy: fast recovery, and replication.

Data Integrity is managed by each chunk server independently. Chunk is divided into 64KB blocks and each has a checksum stored as metadata separately in-memory in chunk server. Checksum is verified by chunkserver before returning data. Chunkserver reports mismatches to master who then clones the chunk from another replica. Read overhead is low. Optimized for append workloads.

Diagnostics is extensive and detailed logging has helped in root cause, diagnosis and performance analysis. Store many significant events like membership changes, and RPC requests/response. Logging is sequential and async, low overhead.

Conclusion

This work is almost two decades old, yet the design choices still carry their weight. I loved the clarity on separation between the data and control flows. Master owns the administrative responsibilities including replica management, garbage collection. Single master instance is debatable. Batch workloads quite unpredictable varying between sawtooth pattern for scheduled work and sudden spikes for on demand workload. Will the decision to limit client to master interaction only to metadata suffice?

We will see the ramifications of single master design in a paper on GFS evolution. Single master definitely helps to keep the design simple. Operation log provides a global total order, a must for consistency guarantees and recovery.

Storage design reflects the notion of hierarchy with namespaces and chunks. Choice to keep chunkservers as the source of truth for chunk locations is nice. It relieves us of yet another point of conflict.

Choice of delaying resource reclamation probably stems from the fact that storage is dispensable while throughput is at premium. Consistency problem is reduced to a choice of the right offset to write, that is the only aspect that should be in a critical section, actual data writes are buffered in LRU cache at chunkservers. This is an important pattern.

Overall I enjoyed this paper. See you next in a discussion on GFS evolution.