Tour de Platform
Recently I had an opportunity to work on the core storage platform of an internet scale service. I thoroughly enjoyed the journey and here are my notes on few things I learned from this experience.
If it can fail, it will fail. And the obvious corollary, everything will fail. Thus it is extremely important to plan ahead of time and add the recovery mechanisms. A related learning is that create those troubleshooting guides along with the features. When hell breaks loose, the poor guy handling it will need some mitigation guidance and is going to thank you for it. Alternatively, if you’re too busy writing code, you will get called to every Sev 2 incident. This is going to burn you down.
A reasonable success criteria will be to detect incidents early and mitigate them fast. The largest blocker to detection is self induced noise. A team adds too many alerts and not all of those are worthy of an investigation. Real incidents get buried in the haystack. Add an alert only for user impact, not the short-term infrastructural blips. Do log the later and mine them if you will. The enemy of fast mitigation is tacit knowledge. The more knowledge remains stuck in your developer’s head, the worse things will be. That poor person will become a bottleneck. Instead put them on a searchable wiki with three things: Symptom, Hypothesis and Mitigations.
Infrastructure is the king. I was surprised by the fact that very few of the academic literature talks about it. Back of napkin calculations are your friend. I struggled with this a lot because of tacit knowledge. The solution was to freaking reverse engineer some calculations on how much resource are used for what kind of a load. With this mental model, I could then plan for future scale of features I would work on. May be do this as early as you start understanding the data flows of the system.
Configuration options provide an illusion of safety. This happens every time you write poor abstractions for that feature, and convince yourself that you anyways have the configuration flag to toggle it. The reality is configurations stack upon on each other and get interleaved to create a combinatorial hell of data flows. One fine day you lose the ability to reason over the system. And thus begins a saga of fighting against legacy untouchable code. Please document the configurations you’re adding and for the sake of sanity clean them up.
A fast local development loop is a blessing. Don’t be a hero and start debugging things on remote clusters during development. The slowness will drive you crazy. A reasonable recipe is this - always setup a local cluster with whatever fancy tool you use, then write a few integration tests against the local cluster and finally start working on coding the feature.
Write small proxy to enable all those integration tests against local cluster also run against the remote larger clusters. So your outer loop for any code change is to deploy and first run those integration tests on the newly upgraded bits. Start the load and other tests only after this.
Every reasonably complex feature on a large distributed system requires tons of experiment. This requires you to setup a remote cluster with a hundred odd machines, deploy your patched bits and test the fundamentals - scale, performance, reliability and cost. Then those bits move to an int environment for some baking time with real world workloads. And each subsequent ring exposes additional workloads. This is your core loop.
Distributed tracing will save your life. A good tool provides a single trace id for a request and shows you the flow across multiple systems participating in serving the request. A typical diagnostic flow looks like this: start with a symptom (e.g., user requests are timing out), identify the machines which served those requests, trace through the logs to find if there are any anomalies, root cause them and mitigate. Without a distributed tracing, you will be stuck in step two.
Capture numerous metrics, but always group them with your mental model of the architecture. Imagine what success and failure looks like at each component boundary. Add a handful metrics to measure both. Most of the slicing/dicing of metrics will be find patterns and correlations. E.g., an increase in latency may correlate with threadpool block time etc. So a second pivot on metrics must be around the resources - cpu, memory, disk, network etc. Many times failures revolve around the shared resource.
What are the axes of evolution of the platform? The answer depends on where your team stands. The end goal of a business is never to keep creating platform unless you’re in the developer tooling/services industry. For consumer or enterprises the value add at top of funnel is an end user scenario. These apps are your customers. They will drive your backlog. A second source will be the focus on fundamentals. Inevitably every system desires to be highly performant with as little cost as possible. So from time to time you will focus on the resource usage and attempt to trim by adopting modern solutions.
It is extremely critical to have a holistic view of the future for your product. In absence of this, each customer will pull you in their direction and the product will spread too thin. This is one of the root causes of death by configuration parameters. A multitude of customization will hurt your support load.
Don’t be too generous to cross the platform/application boundaries. Unnecessary complexity creeps in when application responsibilities get pulled into the platform territory with bogus explanations such as code reuse. E.g., if you write those alerts that your customer is supposed to write, you are eternally screwed. Days and weeks are spent on hand holding the customers, reverting their bad data pushes and this will break your engineers. Instead enable the applications by exposing control plane operations if needed. Make it self serve.
In my little experience, there is a large gap between the green pastures of academia flourishing with exotic techniques and the ground reality with day to day struggles of detecting, diagnosing and troubleshooting errors, or rolling up your sleeves and doing infra/sysadmin work to keep the systems alive. All of that said a magical feeling dawns when you see your code scale up to hundreds of thousand requests. It may be worth something!