This guide provides best practices and practical guidelines for the management of network operations and information security incidents. Incidents happen, and cost organizations thousands of dollars due to downtime.
Downtime is a serious problem for organizations of all sizes and across all industries.
A recent survey conducted by software company Information Technology Intelligence Consulting (ITIC) revealed how much one hour of downtime costs:
And it’s getting even worse...
The average cost of a single hour of unplanned downtime has increased from 25 percent in 2008 to 30 percent last year.
You get the idea. Outages are expensive.
When an incident affects public-facing core infrastructures (say an e-commerce website), then you’re looking at EVEN MORE damage:
Social media venting
Tons of new support tickets
The cost goes far beyond the lost revenue of one hour of downtime.
So finding a resolution to these issues is important (but complicated).
Most enterprise systems are supported by internal employees, working alongside external vendors, with support staff strategically located around the world, to provide 24/7 coverage year-round. So, when it comes to resolving an incident with complex layers, you need flexible and reliable systems and workflows that mobilize key decision makers and stakeholders fast.
Problem vs. Incident
Problem Management is not the same as Incident Management.
Incident management is often mistaken for problem management.
Problem management is finding a permanent solution to an issue. It’s a reactive process to fix errors or weaknesses in the system. Successful problem management ensures an incident does not re-occur and, if it does, limits the impact.
An incident is an unplanned or undesired event (caused by a problem) that interrupts business operations. Incidents range from fairly minor to catastrophic and may result in brand reputation damage, downtime, or lost revenue.
Incident management emphasizes agility and speed to resolve incidents as quickly as possible.
(Remember that $84,000 for 15 minutes tradeoff from earlier?)
The incident management team creates a workaround solution to overcome the issue as quickly as possible so the organization can continue operating normally…
…While the problem management team works on the permanent solution and ensures the weak point in the network is fixed.
What is Incident Management?
Incident management consists of identifying, analyzing, and addressing technical disruptions to your business. Other common terms for this (depending on severity) are Major Incident Management and Critical Incident Management.
Organizations deploy incident management teams to provideimmediate solutions to incidents that may result in partial or complete disruption of the organization’s business operations.
Effective incident managementinvolves these best practices:
By planning ahead, your organization can identify potential problems before they cause incidents. It may seem impossible to prepare for every conceivable incident, but companies that focus on industry-specific dangers can identify potential problems before they happen.
For example, your organization can’t stop cybercriminals from launching attacks, but you can plan ahead for potential dangers. By doing so, you can shorten incident response and resolution times, and minimize the risk of revenue loss and other collateral damage.
(Note: Our incident management software empowers businesses to take a proactive approach to major incidents. We also offer a mobile app to help your incident management teams identify and respond to major incidents from any location, on any device, at any time for around-the-clock incident management.)
Communication & Collaboration
Communication fosters collaboration between your key stakeholders and helps them address incidents in real-time. Poor communication and collaboration can cause major incidents to quickly spiral out of control, so having everyone on the same page is key.
Another Note: Our tool also provides live inbound call routing that ensures incoming calls are routed to the right incident management team members and stakeholders without exception. Plus, we foster real-time collaboration via HipChat and Slack, enabling incident management teams to address major incidents in real-time.)
Reporting processes and protocols empower your teams to learn from past incidents and improve incident management analysis, evaluation, and decision-making for reduced incident management costs.
How your company reports on incidents plays a key role in your long-term incident management strategy. Businesses that use advanced reports can better examine past incidents and avoid repeated mistakes.
(Last note: Our enterprise reporting capabilities help incident management teams track every detail of major incidents. This allows teams to identify challenges and opportunities to adjust budget and staffing levels as needed.)
When your team effectively prioritizes issues, they can focus on what matters.
Prioritization involves classifying incidents based on a number of key metrics like analysis, evaluation, and routing. Your team should have different processes and workflows for incidents depending on priority level:
Low-priority Incidents (Everyday)
Everyday incidents that do not interrupt ‘flow’ are considered low-priority because a workaround is fairly easy to provide via an alternate tool or resource. A ‘flow’ interruption could be internal or external–anything that inhibits your employees’ or customers’ ability to accomplish tasks.
Critical incidents are considered medium-priority because they directly interrupt everyday ‘flows,’ and generally cause disruptions. Example: Your server shuts down, and teams are unable to access certain files.
High-priority Incidents (Major)
Major incidents impact a large number of users and service delivery; high-priority incidents generally have financial ramifications. Example: Internet connectivity goes down company-wide, and your self-hosted ecommerce store goes offline, causing your business to lose significant revenue.
The Incident Lifecycle
Each incident your team encounters will go through what is called the “Incident Life Cycle.”
This process includes six stages:
New: For a new incident, your service desk may have received information about an incident but has not yet assigned it to a service desk agent.
Assigned: An incident has been assigned to a service desk agent.
In-Progress: A service desk agent is working to resolve their assigned incident.
On-Hold: An incident is temporarily suspended; this may occur if additional information is required from a user or third party and ensures service-level agreement (SLA) response requirements are not exceeded while waiting for a third party.
Resolved: The service desk confirms the incident has been resolved and service has returned to SLA levels.
Closed: Incident is fully resolved, and no further actions are necessary.
The incident lifecycle is extremely important for tracking incident categories, time to resolution, and debriefing with accurate incident data. This helps organizations improve service quality (and reduce the overall volume of incidents).
The Incident Management (or NOC) Team
An Incident Management team (or Network Operations Control team) limits potential disruptions caused by incidents so your business can return to normal operations as quickly as possible.
Without an effective incident management team in place, your organization risks interruptions that directly affect business operations, customers, employees, information security, and IT systems.
(In many cases, this costs the business thousands of dollars in lost productivity and/or revenue.)
An incident management team ensures an incident is closed or resolved within a predefined time limit described in a Service Level Agreement (SLA).
They initially define the steps that must be taken to handle an incident, along with the sequence and responsibilities of all parties involved.
Then, when an incident occurs, the team assigns a category and priority level to the incidentand provides status updates to stakeholders that describe the actions it is taking to close or resolve an incident.
To prevent incidents from becoming critical, an Incident Management team will usually invest in an Incident Management Platform to automate processes, reduce the time to resolve issues, and improve ROI from other incident management resources.
Identifying the Core Issue
To combat outages, organizations must first address the root causes of downtime.
Luke Stone, Google’s director of customer reliability engineering, outlined some of the leading causes of downtime during a breakout session at the 2017 Google Cloud Next conference. According to Stone, the primary causes of downtime include:
Overload: When service demand exceeds capacity, errors may occur, causing network, server, or system overload.
Noisy Neighbor: If users overload a server with spam, they may create excess “noise” that leads to downtime.
Retry Spikes: If users are unable to access a service and repeatedly try to gain access, retry spikes may cause a service to shut down.
Bad Dependency: If an application’s input and output stop communicating with one another, user requests may accumulate quickly and overload backend systems.
Scaling Boundaries: An organization that tries to serve additional client requests may encounter problems if its backend systems are not capable enough.
Escalating an Incident
An incident management team responds differently based on how critical an issue is.
For instance, if your incident management team encounters a company-wide systems outage, they will need to act quickly and decisively to get critical systems back up and running.
On the other hand, a slow web server that affects only a small group of employees and lacks urgency similar to a company-wide systems outage would not require the same level of resources or attention.
In fact, many incidents can be completely resolved by third-tier support staff. If more support is needed, then an incident can be escalated all the way up to first-tier management as necessary. Three key factors are used to gauge whether an incident should be escalated. They are: Urgency, Impact, and Severity.
Urgency is the amount of time before an incident has a significant business impact.
An incident may have lower urgency if affects your business later in the financial year. It may have a much higher urgency if it will immediately result in brand reputation damage and/or revenue losses.
Impact measures the effect of an incident on business processes; it is often based on how service levels are affected.
A high-impact incident may force business processes to come to a complete halt, where a low-impact incident has little to no effect on business operations.
Severity describes the impact on users.
For example, an incident of critical or major severity, that affects multiple users, may require communicating with top executives to craft a public statement. On the other hand, an incident of minor severity may require action, but may not affect users.
Using Incident Management Tools
Critical Incidents create stress. Incident Management tools reduce stress and make your incident management teams more efficient and effective.
This not only goes for the incident management team, but for executives and key stakeholders.
That’s one reason most incident management teams leverage an incident management platform. This gives the team automatic escalation capabilities, monitoring tools, and collaboration tools to quickly resolve issues together.
A flexible platform helps your team handle almost any scenario, faster than they otherwise could.
By setting up automatic and customizable notification retry and escalation rules, alerts can instantly be escalated to each team’s manager and/or key stakeholders until an incident is assigned or closed.
Or, an incident can be escalated to a manager or manager group after a set amount of time–even while on-call staff members are still being alerted.
Finally, when the incident management tool detects a problem is fixed, it automatically closes the alert.
Handling Incident Management Stress
When you’re losing hundreds (if not thousands) of dollars every second, managing major incidents can be stressful.
Here are some ways we recommend for taking care of yourself to optimize your effectiveness–and sanity.
Take 3 very deep breathes. Research shows that after doing so, your lung capacity increases slightly. You’ll also purge carbon dioxide from your system–while simultaneously taking in more oxygen. Easy.
Be present. Forget about tomorrow, or repercussions that may or may not happen hours from now. Take it one step at a time. Your team should have a step-by-step process in place. What is the first step? Do that, and only that.
Take a walk. Taking a walk each day helps your body get moving. And things in motion tend to stay in motion. So when stuff happens, you’ll automatically be more likely to handle it without thinking about it–less stress.
Create flow. Flow helps you get things done efficiently and effectively. Getting into flow is different for each person. For example, listening to music, or practicing a set of movements (like a ritual) prior to starting work each day will help put you in a state of flow.
Get a massage–seriously. Go pay for a nice long massage, or use a foam roller. Massages help remove cytokines (that cause inflammation) which helps stimulate mitochondria–extremely important for cell function.
Stay hydrated. Whether it’s water, tea, coffee, or your secret stash of Red Bull, make sure you’re hydrated. According to world-class performance trainers, half a liter of dehydration can increase your cortisol levels, which is directly linked to your stress levels. In fact, they call it the stress hormone.
The Incident Management Process (Key Steps To Resolving Incidents)
Here’s a quick overview of what the incident management process looks like from start to finish, step-by-step.
Identification: Initial detection of an incident.
Logging: Tracking of incident information; incident logs include the name of the person reporting the incident, date and time of the incident, and other incident details.
Categorization: Placement of an incident in an appropriate category and subcategory.
Prioritization: Assessment of an incident, and its impact on your business and key stakeholders.
Diagnosis: Formulation of a hypothesis about an incident; in some instances, an incident management team can resolve an incident based solely on an initial diagnosis.
Escalation: Request additional incident support; front-line support teams must gather and log incident information for prompt escalation.
Resolution:Execution of necessary steps and processes to resolve an incident.
Closure: Return of incident to the service desk; at this point, the service desk will close the incident. After an incident is closed, your incident management team should debrief with stakeholders to ensure everyone is on the same page.
This means providing a brief summary of the scope of the issue, as well as how it was resolved. Often times, customers and stockholders want to hear from top management regarding major issues–especially if they’re related to security.
Using an Incident Management Model
Modeling is important for developing a complete view of incidents from multiple angles.
Models include time frames for incident resolution, insights into how to properly escalate an incident, and best practices for preserving data and KPI’s during an incident.
The development of an incident management model offers a valuable learning opportunity for companies of all sizes, across all industries.
This is especially true since new incidents may be similar to previously resolved incidents.
Incident management teams use models to identify risks faster and understand the best ways to manage all aspects of an incident.
Having a model or ‘template’ helps incident management teams understand how to fully manage an incident’s impact on business operations and SLAs.
Just one incident can jeopardize your organization.
Make sure you have an incident management team that understands incidents and can fully contain and resolve them in a short period of time (and make sure they have the tools they need to succeed).
Regardless of your organization’s size, complexity or industry, the need for effective incident management remains the same: prevention, detection, and quick resolution of incidents, reduces stress on the rest of the organization and, in many cases, protects the bottom-line!
AlertOps is on a mission to continue providing the most flexible and collaborative major incident management platform for IT and DevOps teams.