Paper - A Note on Distributed Computing
What is the right programming abstraction for Distributed Computing? Can it be as simple as creating an object and invoking few operations on it? This paper starts with a definition of unified objects and the promises. It examines them with the fundamental challenges of Distributed Computing and makes a case against a grand abstraction.
We will talk about ideas in Waldo, Jim, Geoff Wyant, Ann Wollrath, and Sam Kendall. “A note on distributed computing.” In International Workshop on Mobile Object Systems, pp. 49-64. Springer, Berlin, Heidelberg, 1996.
Local objects share computing resources (address space etc.) where as Distributed objects are only aware of a contract, and may comprise heterogeneous implementations. Unified objects present that both belong to a single ontological class, which is inaccurate because of fundamental differences in speed, reliability and security in distributed computing.
Imagine we can separate the components into interfaces and implementations. Interfaces define the protocol or operations on the objects. Implementation takes of the correctness, and location transparency. In other words, as long as the interfaces are appropriately defined, and the clients of an object will not see any difference whether the object is local or distributed (remote). For the old timers, this may remind you of CORBA and similar systems ;)
The vision is nothing short of a holy grail. An appropriate implementation will take care of the location constraints i.e. local vs remote. Correctness is a function of the interface and thus it remains consistent. We can use a single object oriented design regardless of context. Such a design is easy to maintain since the changes happen at a object level instead of systems
So far so good.
Except that the paper calls all of above as false promises and principles. Why? Unified objects assume that communication is the hardest problem in Distributed Computing. It is not. Hard problems are in partial failures, lack of a central manager, performance and concurrency.
It may not be possible to hide these in any abstraction. Let’s examine the challenges further.
- Latency is the most observable difference between local and remote execution. Even with best in class hardware and tooling it is impossible to beat physics. Latency is so pervasive that it masks other differences like memory access or concurrency.
- Memory access in local is simple since the address space is shared. For remote access, the unified implementation may attempt to marshal and unmarshal on behalf of the client. Paper argues that building such a programming paradigm without address relative pointer access will lead to mistakes by the unaware application programmer. Instead it may be prudent to be explicit and let the programmer know about remote access.
- Partial failures and concurrency are probably the hardest class of problems.
Local failures are total i.e. they affect all entities working together in the app. And such failures are detectable by a central resource allocator e.g. the operating system. Local executions always complete, even in case of failures.
Distributed failures are partial i.e. the system can continue to function even when a few components have failed. It is impossible to detect failures centrally because we cannot distinguish if a component has failed vs network link has broken.
E.g. we can react to local failures by catching an exception or servicing an interrupt. This is possible since the central resource allocator is like a god watching over all transactions. In a remote scenario, we may not even know that a failure has occurred, a component can magically disappear into thin air.
Unless we use certain mechanisms like timeouts, we cannot treat the disappearance as a failure. Our use of such mechanisms breaks the abstraction that unified objects promise since the application programmer is now aware that they are programming against a distributed object.
Wait, you seem to be playing with words! Isn’t multithreading similar to distributed computing? Why won’t similar abstractions work for distributed computing?
No, for two reasons. First, in multithreading the application programmer controls the invocation order with critical sections and mutual exclusion patterns. Second, multithreading is blessed by the presence of a godly central resource allocator; the OS can help sync and recover from failures. Such primitives are unavailable in distributed computing.
Thus, robustness is not simply in the implementation details, it does get exposed in the interface; contrary to our desires. Additionally, robustness of individual components doesn’t guarantee a robust whole since the network is variable.
Paper argues that the interface must define upper bounds on distributed operations or bring in a central godly resource allocator.
NFS is one of the first large scale distributed storage systems. Paper outlines a few observations.
NFS had two modes to expose faults.
Soft mounts exposed faults to the end user. E.g. error codes are thrown when network link breaks for instance. It also exposed several parameters for system admins to tune. Applications were not fault aware. This approach saw low adoption.
Hard mounts encapsulated the faults. E.g. the client may hang until the server responds. This approach led to brittle systems since a failure somewhere will make other systems freeze. System administrations policies helped in such scenarios. This approach saw high adoption. The system admin acts as a central manager in such an approach. They’re responsible for detecting failures and fixing them.
It is desirable to expose the intentions (local or remote) in the interface definition. Provide parameters e.g. timeouts etc. to deal with the failure scenarios. It is up to the client code generation to skip such parameters for local objects.
Explicit awareness that an object has distributed operations will help the programmer add necessary checks to improve robustness.
The premises of this paper remain true after two decades. We no longer try to hide the distributed nature of computing with explicit APIs and policies around usage e.g. throttling, back pressure etc. (the paper calls these as interfaces). The closest to a notion of distributed object would probably be the client code generators for an interface definition (IDL) e.g. gRPC, WCF and the likes. The advice still remains true: do not assume as if you’re operating on local objects.
Read next: Fallacies of Distributed Computing is another reminder on things to not take for granted in distributed world.