Recently I attended the PagerDuty Incident Responder Masterclass as part of the PagerDuty Summit 2021.
Before this course, I already have some experience participating in an incident war room as part of my day job, and incident response is a topic commonly discussed in literature such as in cybersecurity. So I started the PagerDuty Incident Responder course assuming that I already knew most of the stuff.
However, when I got halfway through the course, I started to diligently take notes, because I realized there were many gems and insights here that should not be overlooked. The course is really that good! It covers a lot of ground from history and definitions of the framework, up to an introduction of cognitive bias, but my favorite parts are the best practices and the advice given. Those are very practical and sound.
The course description mentioned that the course is suitable for technical team leads, managers, directors, responders, and anyone seeking to better their incident response process. I agree that everyone on the team should take this course as a prerequisite before joining an incident call.
They seem to have the stats to back this up:
I'm sharing here my thoughts and key takeaways about the course. Thank you to Hayley Neal and Camden Louie from the PagerDuty University for delivering this workshop.
Introduction of Incident Response
- Incident: any uplanned disruption or event that is affecting customers' ability to use the product.
- Goal: to handle the situation in a way that limits damage and reduces recovery time and costs.
- Replace chaos with calm.
- Incident Response: an organized approach to addressing and managig an incident.
Incident Command System (ICS):
Major incident: requires a coordinated response between multiple teams.
- Timing is a surprise, typically little to no waarning
- Time matters, need to respond quickly
- Situation is rarely perfectly understood at the start
- Require coordination and mobilization, often cross-functional
- Anyone can trigger the Incident Response Process at any time.
The difference in emergency operations:
- Hierarchy, clear order, work fast
- Resolution team has the highest authority
- Team works together to resolve incident, document what happened, and keep stakeholders updated.
- Emergency mode until the incident is officially resolved.
Response Team Goals:
- Work to resolve the incident quickly and efficiently
- Document decisions and follow up items
- Keep stakeholders informed
- Stay in wartime until the incident is officially resolved
To accomplish this goal:
- Mobilize and inform only the right people at the right time
- Use systematic learning and improvement
- Work towards total automation.
Role of an Incident Commander:
- Single source of reference
- Is not a resolver but to coordinate and delegate
- Gain consensus
- Make a decision
Size-up -> Stabilize -> Update -> Verify
- Becomes the highest authority (even higher than the CEO). Can only do that if you know what is being done and why.
- Work to resolve the incident quickly by building consensus. Practice the "two ears, one mouth" rule: listen twice as much as you speak.
- Don't panic. Breakdown in communication can hamper the entire process. The role of an IC is to try to make the line of communication clear and maintain discipline.
- Clear is better than concise. Avoid acronyms.
Start by introducing yourself.
"Hello, this is Camden. I'm the Incident Commander."
If there are no IC in the call, if you are trained as an IC and joined the call, you'll be the IC. The oncall incident commander would not take over automatically when they join a call, you'll be the IC until you performed a handover.
> "Is there an IC on the call?""
> "Hearing nothing. My name is Camden, I'm the incident commander."
Incident cycle is to stabilize the situation:
- Ask for status.
- Decide action, gain consensus.
- Assign task.
- Follow up on task completion.
"What actions can we take?" (ask the SMEs what they want to do)
"What are the risks involved?" (understand the impact, that may change the decision)
Make a decision:
- Avoid decision paralysis. It prolongs the incident much further.
- Making a wrong decision is better than making no decision.
- Flip a coin if you need to. You learn nothing new and make no progress otherwise. At least wrong decision will give you new information.
- Rely on team's expertise to try out a solution for that incident.
- Distributed consensus is hard, so read it in a different way to implicitly gain consensus.
> "I propose xxx. Are there any strong objections?"
> "Hearing none. Let's proceed."
- Avoid "bystander effect". Point to somebody.
"Hayley, I'd like you to investigate the increased latency, try to find the cause. I'll come back to you in 5 minutes. Understood?"
- Assign tasks to a specific person. It's fine to assign to a role (e.g DBA oncall) as long as the role points to a single individual. Don't assign to a group.
- Time-box all tasks to set expectations.
- Get acknowledgements.
"Hayley, it's been 5 minutes. Do you have any information on the latency issue?"
What if they need more time?
- Ask the experts how long do they think they need.
> "How much time do you need?"
< "20 minutes should be enough"
> "OK, I'll come back to you in 20."
Map incident cycle with corresponding phrase:
- What's wrong?
- What action can we take? What are the risks? Are there any strong objections?
- Clear ownership with a timebox.
- What's the status?
Deputy role (IC's right hand):
- Keeps the IC focused.
- Takes on any and all additional tasks as necessary (timekeeping, paging people etc)
- Serves to follow up on reminders and ensure tasks aren't missed.
- Acts as a "hot standby" for the IC.
Scribe role (record keeper):
- Documents the incident timeline and important events as they occur.
- The incident log will be used during the post-mortem.
- Notes as important actions are taken, follow-up items, and status updates.
- Anyone can be a scribe. For small incidents, typically a deputy can also be a scribe.
Communications Liaison role:
- Can be all-in-one, or separate for external and internal stakeholders
- Notifies customers of current conditions.
- Informs the IC of relevant feedback from customers as incident progresses.
- Crafts language appropriate status updates and notification. (e.g use words like "service disruption" instead of "outage")
- Typically a member of the Support team.
- Don't ask for too frequent status updates.
- Remind people that writing an update takes away time from solving the incident.
- Recommend no more frequently than every 20-30mins, and at the start of a phase.
Minimum bases to cover:
- Make sure chain of command is clear, every role reports to IC. Only one leader directing the shell.
- Incident Command System (ICS) is a framework and can be scaled up or scaled down based on your organization.
Being Prepared for Incidents
The 4 steps of an incident:
- Triage (assess)
- Mobilize (the right people)
- Resolve (work towards a common goal)
- Prevent (have the right teams and roles engaged)
How do I prepare to manage incident response teams?
- Ensure explicit processes and expectations exist.
- Set up runbooks and automated actions.
- Find ways to create more space for your teams to work.
- Make checklists (read the "Checklist Manifesto") reduce the strain to remember stuffs during time of panic.
- Practice running major incidents as a team (do regular Failure Fridays so IC can always be calm and collected)
Incident response pitfalls (anti-patterns)
Executive hostile takeover:
< "Ignore the IC, do what I say!"
> "Do you wish to take command?"
> "We understand your concerns. We are working to resolve the incident quickly. Your instructions are slowing down the response. So please take your comment for discussion after the incident has been resolved."
Motivate people to solve things faster:
< "Let's try and resolve this in 10 minutes please!"
> "We're in the middle of an incident, please keep your comments until the end."
Requesting time-consuming information:
< "Can I get a spreadsheet of all affected customers?"
> "This will take time away from the incident. This is the time needed to solve the problem, after then we can look at the list."
> "We can either get you that list, or fix the incident. Not both. The incident takes priority."
Arguing about severity:
- If cannot decide on the severity, always assume it's a high severity and keep moving on.
- Even if it turns out to be a SEV-4 doesn't matter, that's something to discuss in the post-mortem.
< "Is this really a SEV-1?"
> "We do not discuss incident severity during the call. We're treating this as a SEV-1."
Failure to notify stakeholders
Anti-pattern: Getting everyone on the call. -> Get the right people at the right time
Anti-pattern: Forcing everyone to stay on the call. -> If you don't need this person anymore, let them go.
Being overly focused on an issue. -> Keep the bigger picture in mind (as an SME and IC)
Requiring deeply technical Incident Commanders -> IC can be team agnostic. IC is the person expert in coordinating the response, not actually solving technical issues. That's what SMEs are for.
The Belligerent Responder (big ego)
"Hey, you're being obstructive to the team on the call. If you continue, I will have to remove you."
Handsoffs are encouraged
- Responders are human. Encourage taking responders off after 90 mins or so.
- More importantly, ICs need to have handoffs. Replace with deputy, rotate a new deputy in. Keep the cycle going.
- This is the reason it's important to have as many trained ICs as we possibly can. Make sure everybody stays fresh, rested and ready to respond.
> "Everyone on the call, be advised I'm handing over command to Tatiana."
< "This is Tatiana, I'm now the Incident Commander."
PagerDuty Ops Guides:
Follow Up and Postmortems
What went wrong, and how do we learn from it?
Institutionalize the culture of continuous improvement.
Completing a postmortem should be prioritized over planned work.
- 3 business days for SEV-1
- 5 business days for SEV-2
IC will select and directly notify one responder to own completing the postmortem.
Postmortem owner is not the only person responsible for completing the postmortem itself. It is a collaborative effort and should include everyone involved in the incident response.
Postmortems are not a punishment. Effective postmortems are blameless.
We don't call postmortems RCAs. Because in a complex systems, we have multiple root causes that leads to failure.
Owner is the accountable individual who performs the administrative tasks, follows up the information needed to drive it home. Writing it is a collaborative effort, but the single owner is the person orchestrating the entire effort.
Pointing finger in the old view of human error will increase time to acknowledge the incident, MTTR and exacerbating the impact of incident.
Becoming aware of our biases, we can identify when they occur and work to move past them.
Fundamental attribution error.
Tendency to believe what people do reflect their character rather than the circumstances.
To combat: Intentionally focus the analysis on the situational causes rather than discrete actions that people took.
Tendency to favor information that reinforces our existing beliefs.
When presenting with ambiguous information, the human mind interprets it in a way that supports the existing assumptions a lot of the time.
To combat: Pointing someone to play the devil's advocate. Their job is to take a contrarian viewpoint during the investigation. Be cautious of introducing negativity or combativeness with that devil's advocate.
Alternatively, invite someone from other team to ask any and all questions that come to mind. Help to surface the things the team take for granted.
Memory distortion where we recall events to form a judgement.
If we know the outcome, it's easy to see the event as being predictable, despite there has been little to no objective basis of predicting it.
People often call events to make themselves look better, believe they knew it's going to happen as the event is unfolding. Acting on this bias can lead to defensiveness in the team.
To combat: explaining events in terms of foresight. Work the timeline forward instead of starting from the resolution then work backwards.
The notion of things that have more of a negative nature have a greater effect on one's mental state than those of a neutral or positive nature.
Research on social judgement show that negative information disproportionately impact the person's impression of others. We tend to focus and magnify the negative events, and this can lead to demoralizing, burnouts, chaos.
How to avoid blame:
- Ask "what" and "how" questions rather than "who" or "why"
- Ask why a reasonable, rational, and decent person may have taken a particular action.
- Consider multiple and diverse perspectives.
- Abstract to an inspecific responder, anyone can have made the same mistake.
Introducing postmortem practices:
As leaders, you must help introduce blameless postmortems.
- Culture change is hard. Change does not have to be driven by management, can be bottom-up changes that are often more successful than top-down mandate.
- Make sure you have buy-in and go up to your leadership team once you have buy-in from individual contributors.
- Need commitment from leadership that no individual will be reprimanded after an incident.
- Sell the business value of blamelessness. Encourage collaborative learning.
Get buy-in from individual contributors, because the tendency to blame is not unique to managers, can be across team mates as well.
- Explain why blameness is harmful to trust and collaboration.
- Agree to work together to become blame-aware and be accountable by kindly call to each other when blame is observed.
Acknowledge that practicing blamelessness is difficult for everyone.
- Avoid blaming the management for blaming others. Ask leadership if they could be receptive to receiving the feedback if and when they accidentally suggest blame after an incident.
- A sense of confidence that the team will not embarrass, reject, or punish someone for speaking up.
- People need to feel safe talking about failure before they speak up about incident.
- This becomes a key driver of high performing software delivery teams.
How do I even change culture?
- The # 1 thing you can do for your teams is to build a culture of psychological safety with blameless postmortems.
The Postmortem Report best practices around the process
- Schedule portmortem meeting for 30 mins to 1 hour depending on the complexity of that incident.
Create a timeline:
- Present only facts.
- Report any changes in the status and impact of the incident and any key actions taken by responders.
- For each items in the timeline, identify metrics to help illustrate each points clearly based on facts rather than opinion.
- Link to monitoring graphs or tweets in some cases. Anything showing a data point that you want to illustrate could be added to the timeline.
Document the impact:
Described from a few perspectives:
- How long the impact is visible? The length of time user/customers/partners are affected. Often they were impacted before the incident was triggered.
- How many customers were affected, how many percentage? Support may need to list the number of customers so they can reach out individually.
- How many customers wrote or call support about the incident?
- What functionality was impacted and how severely impacted?
- Quantify impact with business metrics specific to your product.
Analyze the incident:
- An individual's action should never be considered a root cause, especially not a name.
- Check data. Ask why the system was designed to make this possible? Why the design decision seem to be the best decision during the time? Answering these questions will help uncover the contributing factors.
- Convert all to-do lists into tickets, but these do not need to be completed before the postmortem meeting.
All action items should be actionable, specific and bounded.
- Actionable: each action item is a sentence that should start with a verb.
- Specific: the action should resolve in a useful outcome.
- Bounded: to tell when it's actually finished as opposed to continually ongoing.
Write external messaging:
- A summarized and sanitized version of the information used for the internal postmortem.
- An essential outcome is buy-in for the action plan.
- Opportunity to discuss proposed action items, brainstorm other opinions and gain consensus among the team of leadership.
- Discuss what will and will not be done and the explicit implications of the choices.
- Written postmortem is intended to be shared widely within the organization, but the primary audience of the postmortem meeting is the team directly involved with the incident.
- The meeting gives the chance to the team to align with what happened, what to do about it and how to communicate the incident to internal and external stakeholders.
Develop good facilitators.
- Encourage participants to speak up and keep the discussion on track.
- Helpful to designate a facilitator who is not also trying to participate in the discussion.
- Adopt practices that promote sharing. People want to share their successes and they want to replicate that success.
- May seem counterintuitive to share incident reports because it seems that you're sharing failure rather than success.
- The truth is, practicing blameless postmortem leads to success because it enables teams to learn from failure, and improve systems to reduce the prevalence of failure.
- Being transparent about system failure reinforces a culture of blamelessness.
- Engage leaders that prioritize work.
- Email completed postmortems to all teams involved in incident response.
- Schedule postmortem meetings on a shared calendar annd anyone is welcomed to join.
- Create a community of experienced postmortem writers to review drafts and spread good practices.
- Clarify policy and ownership of postmortem action items.
- Start small. Get feedback and continue to grow from there.
- Practice makes perfect. Find opportunity to practice, organize Failure Fridays in a non-production environment and conduct table top discussions.