We’re picking up a classic talk today. A summary of the evolution, challenges and motivations of cloud computing in practice. Given it’s a 30 min talk, please do watch it.
Dean, Jeff. “The rise of cloud computing systems.” In SOSP History Day 2015, pp. 1-40. 2015.
Here are my notes from the talk.
My expectation from this talk was primarily to learn the historical context. And to be able to draw an appreciation for why certain cloud abstractions emerged.
Story of cloud computing starts with Multics (1960s). They were the first to come up with the idea of utility computing. High Performance Computing (HPC) and Database Systems later led the cause. Both pushed the limits of compute and storage on single systems.
How were they different? HPC had an interest on performance, not fault tolerance. Thus, they would treat component failures as total failures. Halt execution and try to rollback to last known checkpoint. DBMS gave us clarity on transactions (multiple operations on multiple objects), the definitions of ACID incl. Consistency. Replication and redundancy emerged as a pattern.
We need scale and fault tolerance.
Emergence of search pushed the envelope on storage systems. Store billions of pages and retrieve them at subsecond latencies at a high throughput. Cost effectiveness was a requirement, thus commodity machines emerged as an alternative to powerful super computers.
Experiments on commodity clusters optimized cost with a trade-off on reliability. Hardware failures would occur at every level of the stack for various reasons including human errors and natural disasters. This led to a turning point. Reliability must come from software.
How do we achieve reliability with software? We do the thing we claim to be good at. Build abstractions. Try as hard as possible to hide the notion of multiple nodes. Let the higher layers not care about the implementation of lower closer to metal layer. This gives rise to platforms which purely worry about managing and repairing hardware failures transparently.
A series of abstractions emerged that catered for large scale storage and compute while maintaining the fundamental tenets of fault tolerance, low latency and high throughput.
- Storage (file system) was the first one. Can we build a store that scales to infinity. Definition of success: a) scalable bandwidth, b) high availability, c) fault tolerance, and d) transparent API. Thus, Google file system was born.
- What’s the point of infinite storage if we can’t run computation on it efficiently? Computation on clusters was the next problem. VMs and Containers emerged as solutions. Cluster Scheduling Systems managed them and provided a mechanism to schedule tasks, again with the fundamental guarantees. E.g. Borg at Google, Autopilot at Microsoft etc.
- So, we can now schedule jobs over the thousand node clusters. A few
challenges emerged in most use cases.
- Effective resource utilization requires sharing the cluster across workloads i.e. no dedicated cluster per workload. Trade-off: wrong task on wrong machine spec (e.g. building ML models on CPU)
- Isolation and limits are necessary to avoid exploitation
- Tail latency is our (unfortunate) companion
- We need a framework for running distributed computations. Suddenly we could write programs that were can run on any distributed system. E.g. map-reduce pattern. They can run on large clusters like Hadoop or in-memory on Spark. Our concerns in (3) are handled by a consistent lower layer now.
- Turns out Distributed Storage may be too low level abstraction. We need Structured Storage that can be tuned for specific use cases. Bigtable, Spanner, Dynamo etc. emerged as solutions in this phase. Each offered different trade-offs with the fundamental guarantees. Some were externally consistency and some eventually consistent. Choose what suits you.
- Finally, it was revolutionary to see all of these exposed to the commons as Public Cloud. Rent whatever you want, and pay as you go.
What’s the next milestone? The talk calls out two interesting ones. First, we need an abstraction for running interactive services with many subsystems. Second, we need systems to handle heterogeneity e.g. split computation on device vs the data center.
I think this talk gives a roadmap to walk the history. We should probably pickup a few fundamental papers in this order of abstractions and reason over their challenges. I cannot help but guess that the first future milestone was probably the emergence of services like Firebase combined with functions as a service. Edge and Fog computing are definitely bringing computation closer to where it is needed. It’s so awesome to observe the progress real time.
Read next: SOSP 2015 celebrated a history day for their 25th anniversary. All of the day’s talks are graciously available online. Needless to say they’re a gold mine in summarizing the computing progress/challenges of past several decades straight from the greatest researchers. Please check it out: https://sigops.org/s/conferences/sosp/2015/history/