This guide provides best practices and practical guidelines for the management of network operations and information security incidents.
True Cost of an Incident
Downtime is a serious problem for organizations of all sizes and across all industries.
A recent survey conducted by software company Information Technology Intelligence Consulting (ITIC) found that one hour of downtime costs:
And it’s getting even worse…
The average cost of a single hour of unplanned downtime has increased from 25 percent in 2008 to 30 percent last year.
You get the idea. Outages are expensive.
When an incident affects public-facing core infrastructure (say an e-commerce platform), then you’re looking at EVEN MORE damage:
- Frustrated customers
- Poor reviews
- Social media venting
- Tons of new support tickets
- And negative publicity
- Finding a resolution to these issues is important (but complicated).
Most enterprise systems are supported by internal employees, working alongside external vendors–with support staff strategically located around the world, to provide 24/7 year-round coverage. So, when it comes to solving an incident with complex layers, you need flexible and reliable systems and workflows in place to mobilize key decision makers and stakeholders.
Problem vs. Incident
Problem Management is not the same as Incident Management.
Incident management is often mistaken for problem management.
Problem management emphasizes finding a permanent solution to an issue. It’s a reactive process to fix errors or weaknesses in the system. The result of problem management is to ensure an incident does not re-occur, and if it does, to limit its impact.
An incident refers to an unplanned or undesired event (caused by a problem) that interrupts how you do business. Incidents range from fairly minor to catastrophic and may result in brand reputation damage, downtime or even loss of revenue.
Incident management emphasizes agility and speed to resolve incidents as quickly as possible.
(remember that 15 minute, $84,000 tradeoff from earlier?)
The incident management team creates a workaround solution to overcome the issue as quickly as possible so the organization can continue operating normally…
…while the problem management team works on the permanent issue of ensuring the weak point in the network is fixed.
What is Incident Management?
Incident management consists of all activities to identify, analyze and address technical disruptions to your business. Other common terms for this (depending on severity) are Major Incident Management and Critical Incident Management.
Organizations deploy incident management teams to provide immediate solutions to incidents that may result in the complete loss or disruption of your organization’s ability to conduct business.
Effective incident management involves the following best practices:
By planning ahead, your organization can identify potential problems before they lead to incidents. It may seem impossible to prepare for every possible incident, but companies that focus on industry-specific dangers can identify potential problems before they happen.
For example, your organization can’t necessarily stop cybercriminals from launching attacks, but you can plan ahead for potential dangers. By doing so, you can speed up incident response and resolution times while minimizing the risk of revenue losses and other collateral damage.
(Note: Our incident management software empowers businesses to take a proactive approach to major incidents. We also offer a mobile app to help your incident management teams identify and respond to major incidents from any location, on any device – at any time.)
Communication & Collaboration
Communication fosters collaboration between your key stakeholders helps them address incidents (in real-time). Major incidents can quickly spiral out of control due to poor communication and collaboration, so having everyone on the same page is key.
Another Note: Our tool also provides live inbound call routing that ensures incoming calls are routed to the right incident management team members and stakeholders – without exception. Plus, we foster real-time collaboration via HipChat and Slack, enabling incident management teams to address major incidents in real-time.)
Reporting processes and protocols empowers your teams to learn from past incidents and reduce incident management costs, while driving improved incident management analysis, evaluation and decision-making.
How your company reports on incidents may play a key role in your long-term incident management strategy. Since businesses that use advanced reports can learn from past incidents to avoid making the same mistake twice.
(Last note: Our enterprise reporting capabilities help incident management teams track every detail of major incidents. This allows teams to identify both challenges and opportunities to adjust budget and staffing levels as needed.)
When your team effectively prioritizes issues, they are free to focus on what matters.
Prioritization involves classifying incidents based on a number of key metrics like analysis, evaluation and routing. Your team should have different processes and workflows for incidents depending on priority level:
Low-priority Incidents (Everyday)
Everyday incidents that do not interrupt ‘flow’ are considered low-priority because a workaround is fairly easy to provide via an alternate tool or resource. A ‘flow’ could be internal or external–anything that blocks your employees’ or customers’ ability to accomplish a task. Example: Your printer is running slightly slower than normal.
Medium-priority Incidents (Critical)
Critical incidents are considered medium-priority because they directly interrupt everyday ‘flows’, and generally cause disruptions. Example: Your server shuts down, and the team is unable to access certain files.
High-priority Incidents (Major)
Major incidents impact a large number of users and service delivery; high-priority incidents generally have financial ramifications. Example: Internet connectivity goes down company-wide and your self-hosted ecommerce store goes offline, causing your business to lose significant revenue.
The Incident Lifecycle
Each incident your team encounters will go through what is called the “Incident Life Cycle.” This process includes a set of six stages:
New: For a new incident, your service desk may have received information about an incident but has not yet assigned it to a service desk agent.
Assigned: An incident has been assigned to a service desk agent.
In-Progress: A service desk agent is working to resolve their assigned incident.
On-Hold: An incident is temporarily suspended; this may occur if additional information is required from a user or third-party and ensures service-level agreement (SLA) response requirements are not exceeded while waiting for a 3rd party.
Resolved: The service desk confirms the incident has been resolved and service has returned to SLA levels.
Closed: Incident is fully resolved, and no further actions are necessary.
The incident lifecycle is extremely important for tracking incident categories, time to resolution, and debriefing with accurate incident data. This helps organizations improve service quality (eventually reducing the overall volume of incidents).
The Incident Management (or NOC) Team
An Incident Management team (or Network Operations Control team) helps limit potential disruptions caused by an incident so your business can return to normal operations as quickly as possible.
Without an effective incident management team in place, your organization risks interruptions that directly affect business operations, customers, employees, information security, and IT systems.
(In many cases, this costs the business thousands of dollars in lost productivity and/or revenue.)
An incident management team ensures an incident is closed or resolved within a predefined time-limit described in an Service Level Agreement (SLA).
They initially define the steps that must be taken to handle an incident, along with the sequence and responsibilities of all parties involved.
Then, when an incident occurs, the team assigns a category, and priority level to incident and provides status updates to stakeholders that describe actions it is taking to close or resolve an incident.
When incidents become critical enough, an Incident Management team will usually invest in an Incident Management Platform to help automate processes to help reduce the time to resolve issues.
Identifying the Core Issue
To combat outages, organizations first must address the root causes of downtime.
Luke Stone, Google’s director of customer reliability engineering, outlined some of the leading causes of downtime during a breakout session at the 2017 Google Cloud Next conference. According to Stone, the primary causes of downtime include:
Overload: When service demand exceeds capacity, errors may occur, causing network, server or system overload.
Noisy Neighbor: If users overload a server with spam, they may create excess “noise” that leads to downtime.
Retry Spikes: If users are unable to access a service and repeatedly try to gain access, retry spikes may cause a service to shut down.
Bad Dependency: If an application’s input and output stop communicating with one another, user requests may accumulate quickly and overload backend systems.
Scaling Boundaries: An organization that tries to serve additional client requests may encounter problems if its backend systems lack the proper capacity boundaries.
Escalating an Incident
An incident management team responds differently based on how critical an issue is.
For instance, if your incident management team encounters a company-wide systems outage, they will need to act quickly and decisively to get critical systems back up and running.
On the other hand, a slow web server that affects only a small group of employees but lacks urgency similar to a company-wide systems outage, would not require the same level of resources or attention.
In fact, many incidents can be completely resolved by third-tier support staff. If more support is needed, then an incident can be escalated all the way up to first-tier management as necessary. Three key factors are used to gauge whether an incident should be escalated– they are: Urgency, Impact, and Severity.
Urgency is the amount of time before an incident leads to a significant business impact.
An incident may have lower urgency if affects your business later in the financial year. And it may have a much higher urgency if it will immediately result in brand reputation damage and/or revenue losses.
Impact measures the effect of an incident on business processes; it often is based on how service levels are affected.
A high-impact incident may force business processes to come to a complete halt, where a low-impact incident has little to no effect on business operations.
Severity describes the impact on users.
For example, an incident of critical or major severity that affects multiple users, may require communicating with top executives, to craft a public statement. On the other hand, an incident of minor severity may requires action, but may not affect users.
Using Incident Management Tools
Critical Incidents create stress. Incident Management tools reduce stress.
This not only goes for the incident management team, but for executives, and key stakeholders.
That’s one reason most incident management teams leverage an incident management platform. This gives the team automatic escalation capabilities, monitoring tools, and collaboration tools to resolve issues together quickly.
A flexible platform helps your team handle almost any scenario, faster than they otherwise could.
By setting up automatic and customizable notification retry and escalation rules, alerts can instantly be escalated to each team’s manager and/or key stakeholders until an incident is assigned or closed.
Or, an incident can be escalated to a manager or manager group after a set amount of time–even while on-call staff members are still being alerted.
Finally, when the incident management tool detects a problem is fixed, it automatically closes the alert.
Handling Incident Management Stress
When you’re losing hundreds (if not thousands) of dollars every second, managing major incidents can be stressful.
Here’s some key ways we recommend you take care of yourself to optimize your effectiveness–and sanity.
Take 3 very deep breathes. Research shows that after doing so, your lung capacity increases slightly. You’ll also purge carbon dioxide from your system–while simultaneously taking in more oxygen. Easy.
Be present. Forget about tomorrow, or repercussions that may or may not happen hours from now. Take it one step at a time. Your team should have a step-by-step process in place. What is the first step? Do that, and only that.
Take a walk. Taking a walk each day helps your body get moving. And things in motion tend to stay in motion. So when stuff happens, you’ll automatically be more likely to handle it without thinking about it–less stress.
Create flow. Flow helps you get things done efficiently and effectively. Getting into flow is different for each person. It could be stimulated by listening to music for example, or practicing a set of movements (like a ritual) prior to starting your work.
Get a massage–seriously. Go pay for a nice long massage, or use a foam roller. Massages help remove cytokines (that cause inflammation) which help stimulate mitochondria–extremely important for cell function.
Stay hydrated. Whether it’s water, tea, coffee, or your secret stash of Red Bull, make sure you’re hydrated. According to world-class performance trainers half a liter dehydration can increase your cortisol levels – which is directly linked to your stress levels. In fact, they call it the stress hormone.
The Incident Management Process (Key Steps To Resolving Incidents)
Here’s a quick look at what the incident management process looks like from start to finish, step-by-step.
Identification: Initial detection of an incident.
Logging: Tracking of incident information; incident logs include the name of the person reporting the incident, date and time of the incident and other incident details.
Categorization: Placement of an incident in an appropriate category and subcategory.
Prioritization: Assessment of an incident and its impact on your business and key stakeholders.
Diagnosis: Formulation of a hypothesis about an incident; in some instances, an incident management team can resolve an incident based solely on an initial diagnosis.
Escalation: Request additional incident support; front-line support teams must gather and log incident information for prompt escalation.
Resolution: Deployment of necessary steps and processes to resolve an incident.
Closure: Return of incident to the service desk; at this point, the service desk will close the incident. After an incident is closed, your incident management team should debrief with stakeholders to ensure everyone is on the same page.
This means providing a brief summary of the scope of the issue, as well as how it was resolved. Often times, customers and stockholders want to hear from top management regarding major issues–especially if they’re related to security.
Using an Incident Management Model
Modeling is important for developing a complete view of incidents from multiple angles.
Models include time frames for incident resolution, insights into how to properly escalate an incident and best practices to preserve data and KPI’s during an incident.
The development of an incident management model offers a valuable learning opportunity for companies of all sizes and across all industries.
This is especially true since new incidents may be similar to previously resolved incidents.
Incident management teams use models to identify risks faster, and understand the best ways to manage all aspects of an incident.
Essentially, having a model or ‘template’ helps incident management teams understand how to fully manage an incident’s impact on business operations and SLAs.
Just one incident can jeopardize your organization.
Make sure you have an incident management team that understands incidents and can fully contain and resolve them in a short period of time. (And make sure they have the tools they need to succeed).
Regardless of your organization’s size, complexity or industry, the need of effective incident management remains the same: The prevention, detection and resolution of incidents quickly, to reduce stress on the rest of the organization and in many cases, to protect the bottom-line!
AlertOps is on a mission to continue creating the most flexible and collaborative major incident management platform for IT and DevOps teams.