Practice of continuous feedback via continuous deployment and related practices are a norm in modern software development. With each deployment, there is an eminent risk of regression in the functional behavior (correctness), or the non functional aspects of the service (performance etc.). We mitigate such risks with rigorous testing at various levels, and augment it with change management.
This post will focus on change management through configuration to support experimentation, operational switches or feature toggles.
Feature toggles in the Martin Fowler wiki provides a nice taxonomy of various kinds of configuration based mechanisms. I’d like to group them into three categories. We’ll use the terms flag and toggle as synonyms.
- Release flags associated with a feature rollout
- Experimentation toggles - feature may be exposed gradually
- Release toggles - static configuration to allow half-baked features to ship with limited exposure
- Operational flags provide an ability to switch a feature on/off
- Configurations that map dynamic aspects of a feature. E.g. properties, limits
- May be based on an User aspect (i.e. who the user is - premium etc.)
- May be Service scoped. E.g. a configuration to manage timeouts, or certain dynamic endpoints
- Other scopes - Machine instance, Environment etc.
All of these flags or configurations provide tremendous value in terms of agility. When used with good observability mechanism, the Release and Ops toggles provide an efficient experimentation platform where mistakes may be permissible.
Goodness comes with a cost :) With the toggles scattered through a code base, reasoning through the code flow becomes extremely hard. Slowly the growing set of toggles become a technical debt. And to make things worse, toggles also add to unintentional accidents due to human errors. Time for a story.
In one of my prior projects, we used to call Operational flags as Kill Switches. In the happy path, we would use kill switch to contain large scale deployment risks. Example - imagine a service with a 600 member org that follows a weekly deployment cadence. Each small team contributes their DB schema, all changes are centrally managed and deployed every Tuesday. The risk of a team’s deployment step (schema migration) breaking others is extremely high. We will protect the step under a kill switch, so that it can be safely turned off for the worst scenario and unblock the entire deployment. Goodness.
We’ll also use kill switches to enable verbose logs, and high risk features in the application logic. To make things efficient, engineers would club several functionality under a kill switch. Now, imagine a live site incident comes through. The on-call engineer enables a kill switch to see verbose logs, and it makes things worse because the kill switch also activates a business logic that’s never tested widely. True story. And finding such a toggle is a night mare. Allow me to humor you more. At times, these kill switches will remain activated for months together for specific scopes (tenant, environment etc.).
We have come a full circle now. Adding configurations to enhance agility, using it for fun/profit, and finally abusing too much of it leading to poor agility.
Let me bring you back to today. Say you’re working on a new business logic. Will you reason over the application behavior including or excluding the kill switches? Combine the kill switches from a handful to a bunch. With the combinatorially exploding paths, how many will you test? Will you remember which kill switches are activated in which scope (tenant, environment, or machines)? Yes, I’ve seen specific configuration that may isolate a machine e.g. skip all requests that come to it (such requests are just forwarded to another machine).
Similar risk exist for the feature toggles which remain activated for years for 25% of the customers. We used to have mechanisms like sending reminders on feature toggles, or kill switches to engineers to remove them from code.
Is there a way out?
Sure thing. First, understand the complexity of these branches. Combine the feature toggles, ops switches and the dynamic (instance/stage) specific configurations as just configurations. All of them pollute the code base if left untended.
Second, understand when to use what. If a flag is shortly lived, have mechanisms to keep their usage under check. Do not allow fire and forget. Create a note in your tech debt backlog to clean them as soon as possible.
Long lived configurations (static/dynamic) must be well tested for both positive and negative paths. The risk that comes with configurations is the difference between environments e.g. Canary environments may end up with different configuration parameters than Prod, and this may lead to untested behavior entering Prod. Be careful about the nature of configuration - is it a simple mapping (e.g. a Canary specific endpoint?), or a threshold. Or does it change the assumptions like a flag?
Treat all kinds of short lived configurations as potential technical debt. Please!