No technology platform is incident-free, even the world’s software giants like Google and Amazon. However, what matters is the existence of procedures and processes to minimise incident impacts.
Over the past three years, one aspect of my job has been to establish an incident management framework that Felix operates within during times of crisis. It has been a journey involving a lot of reading, many practical lessons, and constant evolution.
First, let’s start with some definitions.
An incident in the context of using a tech platform is:
An event that could lead to a loss of, or disruption to processes, services, or functions.
At Felix we divide incidents into three primary types:
We also have various severity levels to further classify incidents.
It was a cosy afternoon a few years back when I first experienced a service incident that impacted our platform. There was an outage that lasted for an hour caused by a hiccup during a release. At the time, it was manageable for our Head of Product to individually communicate to our customers.
As we reflected on how we responded, and what steps we’d need to take to respond in the future at a grander scale, we found our need for a better documented and standardised approach.
Fast forward three years and incident management is a standard part of the day-to-day. We have a dedicated internal site for Incident Management and have a cross-functional team responsible for responding to major incidents. I’ve written this article to share some of the learnings I’ve made over the years and hopefully help provide a starting point for others going through the same process.
The first draft of our incident management framework was closer to a communications procedure than it was a response procedure, and was nothing near an overall framework. Looking back, for the size of the company that we were, this was ultimately the right starting point, and as we built out the fuller framework over the past several years, allowed us the flexibility to retain and remove what was and wasn’t working.
Don’t feel pressured to have a sprawling set of resources, procedures, or a framework on day one.
Start with what feels right and build from there. Things to consider include communications, responsibilities, and how the response is coordinated (i.e., on Slack, Teams, etc). Suggestions for a start point:
Over the years, myself and our DevOps team have consumed information from many sources (PagerDuty, government crisis response frameworks, OpsGenie, other experts in the space) to improve our procedures, how we trained team members, and how we administered the framework. This has supplemented our own experiences greatly and led us toward better ways of responding to incidents by taking in stride the wisdom of others.
Some resources I’ve used and would recommend:
There’s no better way to improve your framework than by using it and reflecting on how things went. Even if you’re starting out with a simple checklist, make sure to run post-incident reviews (PIRs) and collate a brief post-mortem of the incident after as many incidents as possible.
The most learnings we take from our incidents and the actions we perform to improve how we respond in the future come down to three key sections we review within our PIR:
Identifying the root cause and what led to the occurrence will help you clamp down on any areas that mightn’t be 100% as part of day-to-day operations.
In addition to this, we review a “what went wrong” section where we talk about things that didn’t go so well as part of the response.
Like all aspects of the PIR/post-mortem, this is blameless and is an exercise geared towards finding changes that can be made to procedures, training, or wider systems to improve the next response.
While training team members and having discussions around making decisions inside of an incident, the question often surfaces from to-be Incident Commanders (IC) and responders, “How do I know the decision I’m making is the right one?”.
Incident Commander: This is a command and coordination role responsible for gathering the right responders, decision making, and listening to the advice and information provided by internal subject matter experts (SMEs). |
The greatest challenge for a response team, especially the IC, is to become comfortable being confident but not certain. Incidents are often time-sensitive, and time taken to be 100% can quickly become counter-productive to the overall goals of the response.
When conducting training activities, I’ve found it worthwhile to include a component on this to ensure people are acting with the information they have in-the-moment to make confident decisions, while still making reasonable calls to dig deeper when it’s required.
As part of our tabletop scenarios, we replicate the environment by utilising situations that people might not be familiar with to produce this effect, as well as spread the information out across the team. This not only ensures the team is actively collaborating, but also provides a left-of-centre environment where the IC is unlikely to be able to act as SME, even if they have strong knowledge in a given area.
Incident response activities can quickly become at odds with a company’s culture, or standard management practice. People who are used to leading will naturally gravitate towards these behaviours within an incident response. Thus, if those people aren’t the Incident Commander, it’s important to train teams on ways to address this.
Standard workflows can also quickly bog down response activities. Imagine you need to authorise someone to upgrade a service - under normal circumstances, this might require a change ticket, purchase approvals, and management approvals. Expediting this in a time-sensitive environment can be tricky unless people are aligned ahead of time that some considerations will need to be made during an incident, and some steps might even be skipped altogether.
Just like Scouts, the key factor in effectively managing incidents is to be prepared. This doesn’t mean you need to have a 100-page response procedure on stand-by, but you want to make sure that your key responders are aware of your procedures and where to find information about them.
During incidents, I still run with a checklist to sanity check I haven’t missed anything critical while we’re in-flight and that we’re moving towards resolution. Empowering your team with something simple (or yes, a 30-page response procedure) will help them navigate the stormy seas and keep their focus on driving the incident to resolution, rather than searching for the road along the way.
------------
If you have undergone a similar journey, or have resources to share on this topic, feel free to comment below.