Incident “Post-mortem” refers to a process that enables an incident response team to learn from past downtime, outages and other incidents. During post-mortem, an incident response team determines what happened during an incident, identifies what was done right and what can be corrected, learns from its mistakes and proceeds accordingly.
A post-mortem generally involves the following steps:
- Review the Incident from All Angles
Set up a meeting to discuss an incident. That way, an incident response team can determine exactly what happens during an incident and brainstorm solutions to prevent recurring problems.
Encourage incident response team members to bring their incident notes to a post-mortem meeting, too. This can help team members offer pertinent incident information during the meeting. Perhaps most important, team members can work together to help an organization decide the best course of action to prevent future incidents.
- Ensure Team Members Know the Incident Was ‘Blameless’
Let’s face it – no one wants to be responsible for an incident. Yet it is important to remember that all humans are prone to error. And no matter how hard a person tries, he or she sometimes makes mistakes.
There is no reason to play the blame game during incident post-mortem. In fact, all incidents – regardless of size or severity – are blameless. As an incident response team reviews an incident, team members should work together to analyze the incident and find solutions.
Throughout incident post-mortem, prioritize the incident, what happened during the incident and any facts related to the incident. With this approach, an incident response team can move past any mistakes and identify the best ways to resolve incidents both now and in the future.
- Create a Post-Mortem Report
A post-mortem report empowers an incident response team to review its efforts and drive meaningful improvements. It should focus on the following areas:
- Incident Details: Explains what happened, the number of customers and services impacted by an incident and other incident information.
- Root Cause Analysis: Details the initial source of failure.
- Incident Actions: Describes the steps taken to diagnose, analyze and resolve an incident.
- Timeline: Includes information about significant events throughout the incident response cycle.
- Key Takeaways and Next Steps: Outlines what worked and what did not work during an incident, along with next steps to ensure the same problems do not occur in the future.
Crafting a post-mortem report may seem time-consuming and resource-intensive. Thankfully, alert tracking and monitoring systems are available that offer advanced analytics and reporting to help incident response teams quickly generate actionable insights.
An alert monitoring system with advanced analytics and reporting empowers an incident response team to collect data over the course of an incident. Then, team members can instantly craft post-mortem reports and gain the insights they need to find ways to improve incident response.
- Identify Preventative Measures
In all likelihood, an incident provides lots of data that an incident response team can evaluate. After a comprehensive data assessment, incident responders can determine what can be done to prevent future incidents.
There is no one-size-fits-all solution that works well for all incidents. But with a consistent incident management process in place, incident response teams can follow steps to resolve incidents as quickly as possible.
Additionally, incident response teams can use an incident alerting system to retrieve and analyze data. This system enables incident responders to simultaneously obtain incident data and speed up incident response.
- Develop Incident Response Best Practices
Incident response best practices should be developed and shared across an organization. These practices can help incident responders reduce downtime and stop outages. Also, they can help incident response teams limit the impact of incidents.
Best practices for incident resolution should be integrated into an organization’s operations as well. ITIL points out some of the best practices for incident resolution include:
- Diagnosis: Involves the initial identification of an incident.
- Escalation: Describes steps to escalate an incident to the appropriate parties.
- Investigation and Diagnosis: Defines how support staff responds to an incident; additional diagnosis also may be performed at this stage.
- Resolution and Recovery: Outlines the process used to notify stakeholders that an incident has been resolved and services have been restored to levels defined by a service-level agreement (SLA).
- Closure: Explains the process used to close an incident.
Incidents should not be viewed as failures within an organization. Instead, each incident presents a valuable learning opportunity, particularly for an incident response team that wants to do everything possible to contribute to an organization’s success.
Thanks to incident post-mortem, an incident response team can analyze an incident and move forward. Incident post-mortem empowers incident responders to use data to review their failures and successes. Next, incident responders can gain insights that they can use to transform assorted weaknesses into strengths.
As incident response teams search for ways to improve, don’t forget about incident post-mortems. By incorporating post-mortems into incident response, team members can explore innovative ways to drive ongoing improvements.