Notes on seeking wisdom and crafting software

Paper - The Google File System

Google File System (GFS) paper is one of the classics in distributed systems world. It brought forth many novel ideas including building large scale reliable internet services from unreliable commodity hardware. Second, this is probably one of the first papers from Google on internet scale design. What were the key design choices then? I’m curious to follow their journey of building systems, learning from the design and evolving to the next system.

Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. “The Google file system.” In Proceedings of the nineteenth ACM symposium on Operating systems principles, pp. 29-43. 2003.

Motivation

Offer fault tolerant, scalable distributed file system that runs on inexpensive commodity hardware.

  • Component failures are a norm, not an exception. Requirement: constant monitoring, error detection, fault tolerance and automatic recovery
  • Files are huge e.g. multi GB size. Requirement: determine appropriate IO parameters including block size etc.
  • Files are mutated by appending new data vs overwriting existing data. Performance and atomicity are focused on appends. Concurrent appends must be supported.
  • High sustained bandwidth is more important that low latency.
  • Co-design application and filesystem APIs to improve flexibility and reduce burden on clients.

API surface includes usual create, delete, open, close, read and write. Snapshot operation copies a file or directory at low cost. Record append allows multiple clients to write data concurrently with atomicity guarantees for each client’s append.

Architecture

  • Each file is divided into fixed size chunks which are stored in chunk servers and are replicated with three copies
  • Two types of roles in a cluster: master and chunk server
  • A cluster has a single master
    • Master stores the metadata i.e. namespace, access control and chunk mappings
    • Manage chunk id creation, lease, migrations and garbage collection
    • Communicates with chunk servers using periodic heartbeats
  • A cluster has many chunk servers
    • Store chunks in the linux file system
    • No caching since the kernel anyways maintains frequently accessed files in buffer
    • Operations use chunk id and number of bytes
  • Client code is linked into each application
    • Provides API implementation
    • Application provides filename and chunk offset. Client translates this to chunk index. Retrieves the chunk id and chunk server info from master. Connect to chunk server afterwards
    • Cache file metadata using filename and chunk id for limited time. No master communication until this expires or the file is reopened
    • No cache of file data (thus, no cache coherence issues)

Design decisions

Single master

Single Master simplifies the design. Use the global knowledge for chunk placement and replications.

  • Risk: single point of failure. Mitigated by minimizing client calls to master and caching this data at clients.
  • Client asks for multiple chunks in same request, and master returns info for chunks following those requested

Master owns several responsibilities.

  1. Namespace management and locking. Necessary to allow concurrent operations at master. Both read and write locks are supported on /d1/d2/leaf or parts. Read lock prevents delete, rename and snapshot and write lock any updates. For creation, read lock on parent and write lock on created object. Allows concurrent creates.
    • Locks are taken in consistent total order - first by level, then lexicographically
  2. Replica placement. How to maximize data reliability/availability and maximize network bandwidth? Spread replicas across machines AND racks. Reads can utilize aggregate bandwidth of multiple racks.
  3. Replica create/re-replicate/rebalance.
    • Creation optimizes disk usage, placement and load balance (limits new creations on a single chunk server).
    • Re-replicate triggered when replica count falls below threshold. Priority is determined based on diff with replication goal, and if the chunk is blocking client progress.
    • Clone happens with master instructing another replica to copy from a source. Limit on active cloning activities to not jeopardise the client operations.
    • Rebalance gradually fills up free chunk servers instead of swarming them with new data and hence new traffic.

Data storage

Chunk sizes in GFS are 64MB, much larger than the typical file system block size.

This allows less network overhead from client to chunkserver since few requests get the necessary data. Size of metadata will be small, and can be stored in-memory at master.

Smaller files are at disadvantage. Clients can create hotspot at chunk server for such files. Solution: maintain more replicas and stagger the concurrent requests.

Garbage collection reclaims deleted chunks and files. Triggered by master with scans of a) files, and b) chunk namespace. Files are renamed, hidden with deletion timestamp upon delete. They are later garbage collected. Similarly, master and chunk server exchange metadata of chunks and any chunks not present in master are deleted by chunk server. Call out: distributed garbage collection is a deep problem in PL community, however the scheme presented here is simple.

Reclamation/replication policies are user configurable.

Stale chunk replicas are garbage collected. Detection happens using version number of chunk during the heartbeat between master and chunk server. Master maintains it. Version number is increment with every chunk operation, and for stale chunks it will lag. Master ignores stale replicas in its response to clients. Client/Chunk server verify the version number to ensure they access the up to date data.

Advantages of GC over eager deletion:

  • Simple and reliable on the face of component failures. One way to handle all chunks which are unknown i.e. created during a failed operation.
  • Storage reclamation is a background activity, cost is amortized.

Disadvantage: storage is not immediately available under pressure.

Metadata storage

  • 64B metadata per chunk (64MB). Entirely stored in-memory at Master
  • namespaces, and file to chunk mappings are persisted and replicated
    • Namespaces are lookup tables compressed with prefix compression
  • chunk locations are pulled by master at startup, not persisted
    • Kept up to date with periodic heartbeat messages between master and chunk server
    • Chunk server is the source of truth for the chunks it owns!
  • Risk: number of chunks will expand beyond Master’s memory. Mitigation: easy to add new memory; prefer simplicity, performance and reliability of in-memory store

Operation Log is maintained at master which provides a consistent view of operations on a logical timeline i.e. a global total order. Represents files, chunks and version info. It is replicate to ensure durability. Client operations are responded after writing to operation log.

Master recovers its state by replaying the operation log. Size of replay is kept small with periodic checkpoints. Checkpoints are B-Tree like which can be directly mapped to memory and thus are fast to read. Building a checkpoint can happen in parallel in a separate thread, doesn’t block mutations. Older checkpoints and log files can be deleted.

Master’s data is replicated to shadow masters. They apply the operation log actions exactly as master does.

Consistency model

  • Namespace operations at master are atomic and correct, recorded in operation log
  • File regions mutation
    • Consistency = all clients see same data irrespective of the replica they query
    • File region can be in two states: defined if consistent and clients see what mutation has written, undefined otherwise
    • Defined state
      • Successful mutation without interference from concurrent writers
    • Undefined state
      • Concurrent successful mutations: all clients see the same data (consistent), but it doesn’t reflect what any one mutation has written
      • Failed mutation
  • Region are guaranteed to be defined after a series of successful mutations. GFS ensures the mutations are applied to chunk in same order, and uses version numbers to detect stale replicas
  • Appends are guaranteed to be at-least-once even in concurrent mutations
  • Staleness can occur at client side since it caches the replica for a chunk, usually they will return end of file. Client cache is refreshed when file is opened again
  • Identify failed servers by handshakes between master and chunkservers
  • Client primitives
    • Mutate files by appending instead of overwriting
    • 1 writer, N readers: writer generates file, writes data and atomically renames, uses periodic checkpoints. Readers read from defined regions in checkpoints
    • M writer, N reader: each writer’s changes are preserved, has extra info like checksums for integrity, creates duplicates, readers adapt to filter dupes (part of the client library)

Communications

Decouple the flow of data from flow of control. Fully utilize the network bandwidth and avoid network bottlenecks or high latency links.

  • Client writes data to all replicas concurrently. All writes are held in LRU buffer at each replica (chunk server). This is the data flow.
  • Control flow: client asks primary replica to write. It determines a serial order for all writes, and sends the order to each replica. All the replicas write in same order.

Tricks to maximize data flow:

  • Data is pushed out to chunk servers linearly with the closest server first. Tree or other topologies are not used since they split bandwidth with multiple recipients.
  • Follow chain replication i.e. client to s1, s1 to s2 etc. where s1 .. sN are chunk servers.
  • Pipeline data transfer over tcp.

Atomic record appends are supported with client providing only data and GFS at-least-once appends it to an offset of its choice, return the offset to client. Primary replica chooses the offset and asks secondaries to write at the same offset.

Invariant: after an append, all replicas have the record as long as the primary. Next record append will be at a higher offset.

Snapshot operation creates a copy almost instantaneously. Uses copy-on-write primitive. Master first revokes any leases on source chunk. It creates a copy of metadata and logs in operation log. Upon the first write on new chunk, master asks the source chunk server to create a copy and then perform the write. Copies are local in the same chunk server.

Fault tolerance

Assume component failures and disk corruptions.

High Availability uses two fold strategy: fast recovery, and replication.

  • Maintain seconds boot time for servers. No distinction between normal/abnormal termination. Restore state upon boot.
  • Replication of chunks, and master data. Replicas provide readonly access when primary is down.

Data Integrity is managed by each chunk server independently. Chunk is divided into 64KB blocks and each has a checksum stored as metadata separately in-memory in chunk server. Checksum is verified by chunkserver before returning data. Chunkserver reports mismatches to master who then clones the chunk from another replica. Read overhead is low. Optimized for append workloads.

Diagnostics is extensive and detailed logging has helped in root cause, diagnosis and performance analysis. Store many significant events like membership changes, and RPC requests/response. Logging is sequential and async, low overhead.

Conclusion

This work is almost two decades old, yet the design choices still carry their weight. I loved the clarity on separation between the data and control flows. Master owns the administrative responsibilities including replica management, garbage collection. Single master instance is debatable. Batch workloads quite unpredictable varying between sawtooth pattern for scheduled work and sudden spikes for on demand workload. Will the decision to limit client to master interaction only to metadata suffice?

We will see the ramifications of single master design in a paper on GFS evolution. Single master definitely helps to keep the design simple. Operation log provides a global total order, a must for consistency guarantees and recovery.

Storage design reflects the notion of hierarchy with namespaces and chunks. Choice to keep chunkservers as the source of truth for chunk locations is nice. It relieves us of yet another point of conflict.

Choice of delaying resource reclamation probably stems from the fact that storage is dispensable while throughput is at premium. Consistency problem is reduced to a choice of the right offset to write, that is the only aspect that should be in a critical section, actual data writes are buffered in LRU cache at chunkservers. This is an important pattern.

Overall I enjoyed this paper. See you next in a discussion on GFS evolution.