How we set up an incident management framework at Felix: An insider's view

No technology platform is incident-free, even the world’s software giants like Google and Amazon. However, what matters is the existence of procedures and processes to minimise incident impacts.

Over the past three years, one aspect of my job has been to establish an incident management framework that Felix operates within during times of crisis. It has been a journey involving a lot of reading, many practical lessons, and constant evolution.

What is an incident?

First, let’s start with some definitions.

An incident in the context of using a tech platform is:

An event that could lead to a loss of, or disruption to processes, services, or functions.

At Felix we divide incidents into three primary types:

Service incident (e.g. an error page that consistently appears when it shouldn’t be)
Security incident (e.g. unauthorised access to stored data)
Business continuity incident (e.g. COVID-19)

We also have various severity levels to further classify incidents.

How it all started

It was a cosy afternoon a few years back when I first experienced a service incident that impacted our platform. There was an outage that lasted for an hour caused by a hiccup during a release. At the time, it was manageable for our Head of Product to individually communicate to our customers.

As we reflected on how we responded, and what steps we’d need to take to respond in the future at a grander scale, we found our need for a better documented and standardised approach.

Fast forward three years and incident management is a standard part of the day-to-day. We have a dedicated internal site for Incident Management and have a cross-functional team responsible for responding to major incidents. I’ve written this article to share some of the learnings I’ve made over the years and hopefully help provide a starting point for others going through the same process.

Start simple

The first draft of our incident management framework was closer to a communications procedure than it was a response procedure, and was nothing near an overall framework. Looking back, for the size of the company that we were, this was ultimately the right starting point, and as we built out the fuller framework over the past several years, allowed us the flexibility to retain and remove what was and wasn’t working.

Don’t feel pressured to have a sprawling set of resources, procedures, or a framework on day one.

Start with what feels right and build from there. Things to consider include communications, responsibilities, and how the response is coordinated (i.e., on Slack, Teams, etc). Suggestions for a start point:

Checklist of actions to perform when there’s an incident
Runbook with slightly more detail than a plain checklist

Learn from others

Over the years, myself and our DevOps team have consumed information from many sources (PagerDuty, government crisis response frameworks, OpsGenie, other experts in the space) to improve our procedures, how we trained team members, and how we administered the framework. This has supplemented our own experiences greatly and led us toward better ways of responding to incidents by taking in stride the wisdom of others.

Some resources I’ve used and would recommend:

Learn by doing

There’s no better way to improve your framework than by using it and reflecting on how things went. Even if you’re starting out with a simple checklist, make sure to run post-incident reviews (PIRs) and collate a brief post-mortem of the incident after as many incidents as possible.

The most learnings we take from our incidents and the actions we perform to improve how we respond in the future come down to three key sections we review within our PIR:

Root cause
Contributing factors
What went wrong

Identifying the root cause and what led to the occurrence will help you clamp down on any areas that mightn’t be 100% as part of day-to-day operations.

In addition to this, we review a “what went wrong” section where we talk about things that didn’t go so well as part of the response.

Like all aspects of the PIR/post-mortem, this is blameless and is an exercise geared towards finding changes that can be made to procedures, training, or wider systems to improve the next response.

Train for adversity

While training team members and having discussions around making decisions inside of an incident, the question often surfaces from to-be Incident Commanders (IC) and responders, “How do I know the decision I’m making is the right one?”.

Incident Commander: This is a command and coordination role responsible for gathering the right responders, decision making, and listening to the advice and information provided by internal subject matter experts (SMEs).

The greatest challenge for a response team, especially the IC, is to become comfortable being confident but not certain. Incidents are often time-sensitive, and time taken to be 100% can quickly become counter-productive to the overall goals of the response.

When conducting training activities, I’ve found it worthwhile to include a component on this to ensure people are acting with the information they have in-the-moment to make confident decisions, while still making reasonable calls to dig deeper when it’s required.

As part of our tabletop scenarios, we replicate the environment by utilising situations that people might not be familiar with to produce this effect, as well as spread the information out across the team. This not only ensures the team is actively collaborating, but also provides a left-of-centre environment where the IC is unlikely to be able to act as SME, even if they have strong knowledge in a given area.

Align mindsets

Incident response activities can quickly become at odds with a company’s culture, or standard management practice. People who are used to leading will naturally gravitate towards these behaviours within an incident response. Thus, if those people aren’t the Incident Commander, it’s important to train teams on ways to address this.

Standard workflows can also quickly bog down response activities. Imagine you need to authorise someone to upgrade a service - under normal circumstances, this might require a change ticket, purchase approvals, and management approvals. Expediting this in a time-sensitive environment can be tricky unless people are aligned ahead of time that some considerations will need to be made during an incident, and some steps might even be skipped altogether.

Be prepared

Just like Scouts, the key factor in effectively managing incidents is to be prepared. This doesn’t mean you need to have a 100-page response procedure on stand-by, but you want to make sure that your key responders are aware of your procedures and where to find information about them.

During incidents, I still run with a checklist to sanity check I haven’t missed anything critical while we’re in-flight and that we’re moving towards resolution. Empowering your team with something simple (or yes, a 30-page response procedure) will help them navigate the stormy seas and keep their focus on driving the incident to resolution, rather than searching for the road along the way.

------------

If you have undergone a similar journey, or have resources to share on this topic, feel free to comment below.

Alex Moloney

Alex oversees the Product Support department at Felix working with his team to help users understand how to use the platform and resolve issues that arise. He also shapes our incident management framework keeping pace with our evolving platform and provides training to team members.

How we set up an incident management framework at Felix: An insider's view

What is an incident?

How it all started

Start simple

Learn from others

Learn by doing

Train for adversity

Align mindsets

Be prepared

Related Articles

The inside scoop: why we launched Felix’s customer-facing Product Roadmap

Insider threats: types, warning signs and examples

Insights from #PASAPremier 2017 that every procurement and supply professional should consider

Let's stay in touch

Solution Overview

Industries

Vendors

Resources

Company