Glossary

An agent is a program installed on a physical server. An agent executes various server processes.

Agile is an iterative set of software development best practices. The top priorities in agile software development include software quality assurance, user feedback integration, and the ability to “fail fast” and implement rapid changes as needed.

An alert is used to notify an organization significant changes in their IT environment. Alerts can also indicate when a system has failed.

Alert aggregation refers to the connection of all IT monitoring tools to view all alerts and incident data in one place.

Alert fatigue occurs when many monitoring systems create an abundance of alerts that flood mailboxes. This causes alerts to become less meaningful, and often decreases the responsiveness of IT team members.

Alert noise is a high volume of false alarms that makes it difficult to detect and respond to actual, important alerts.

Alert rules are customized policies created by the user of an IT alerting system. They can be used to normalize the behavior of alerts based on time of day, type of alert and more.

Analytics are the application of statistics, research, and computing to gather insights and meaning from a set of data.

An API is used as an intermediary between different programs; an API ensures programs can share data with one another.

Application release is a practice in which a software release is deployed across multiple environments and configurations with little to no human interaction.

An artifact is a descriptive model used to create software; artifact examples include diagrams and UML models. 

An incident that has been assigned or acknowledged means that a specific team or individual has committed to taking accountability for and resolving the issue.

In IT alerting, automation is the technique of enabling the alerting process or notification system to operate automatically using alert rules. 

Autonomy in DevOps is self-governance. An autonomous DevOps team empowers each member to act based on the situation and resources available without the need to defer to a superior.

Behavior-driven development is a form of software development that involves ongoing communication between developers, business analysts, quality assurance teams, and other team members. It promotes constant collaboration and helps key stakeholders work together to achieve common software development goals.

Branching refers to a programming technique in which a source code copy is used to create two versions of software. This enables the source code to be simultaneously modified by two developers. 

A capacity test enables a DevOps team to determine the maximum number of end users an application, computer, or server can handle before it crashes. 

Categorization of incidents allows their impact, urgency, and severity to be easily understood and transparent.

A closed incident is considered fully resolved, and it is confirmed that no additional action by a network operations center (NOC) or incident management team is necessary.

Closure is the confirmation that no further action needs to be taken on a resolved incident.

In an incident response team, the communications lead facilitates communication about an incident to parties both inside and outside of the organization.

BA complex-adaptive system consists of an IT platform or project that includes multiple components. In this system, each component interacts with others in ways that cannot be accurately predicted or controlled.

Configuration drift occurs when a hardware or software infrastructure configuration changes from a recovery or secondary configuration. It may occur due to inconsistent configurations across a set of computers or devices. 

Configuration management is a system engineering process for creating and maintaining product consistency. It involves management of a product’s performance, function, and physical attributes relative to its design and requirements. 

Containerization is the use of virtual software containers that include operating system resources, memory, and services to run an application or service. It often helps a developer test production flows for services deployed in the cloud.

Containment is the third step in an incident response plan. The goal of this step is to quickly patch up the cause of the incident.

Continuous delivery (CD) is a software engineering approach that utilizes short, frequent cycles to produce software.

Continuous deployment utilizes automated software code testing. If code passes the automated test, the software automatically moves into a production environment.

Continuous integration (CI) is a software engineering practice that merges developer code changes into a single repository. Once CI is complete, the merged code is used to automate software builds and tests.

Continuous quality is the integration of software quality reviews into the CD pipeline. It requires quality assurance team members to review software code as soon as it becomes available, and address any potential code issues during the software development cycle.

Continuous testing is the execution of automated tests as part of the software delivery pipeline. It enables a DevOps team to get feedback for identifying potential risks before software is publicly released.

A cyber-attack is an attempt by hackers to penetrate a business’ IT networks or systems. 

A data breach is the intentional or unintentional release of private information to an untrusted recipient. Data breaches can cause incidents.

Data enrichment is the process of merging data from a third-party source with an existing database. Brands implement data enrichment to enhance their data, improve data accuracy, and make more informed decisions.

Deduplication refers to the elimination of duplicate or redundant alerts received by monitoring systems.

Deployment is all the activities performed before software is publicly released.

DevOps (development and operations) is a culture in the IT industry that fosters collaboration between developers and operations teams.

DevSecOps (development and security operations) is a “security as code” culture that fosters collaboration between software developers and information security teams.

Diagnosis of an incident is a formulation of a hypothesis as to what caused the incident. An incident management team may be able to resolve an incident solely based on an initial diagnosis.

In an incident response team, the documentation lead documents the timeline of events during the incident response process.

Downtime is the period that a system is unable to perform its primary functions.

Eradication is the fourth step in an incident response plan. It aims to completely remove the threat causing an incident.

Incidents require escalation when more support is needed to resolve them. Teams gather and log incident information to prompt incident escalation to other team members or executives.

Escalation is bringing an issue to an individual or team in a higher department within an organization. For example, if a customer service representative finds an issue that can be resolved only by the IT team, the issue can be escalated to an IT manager.

An event is any occurrence that causes a change in the IT environment.

Event-driven architecture is a form of software architecture that involves the creation of events by a system; the system then uses these events to identify or consume similar occurrences in the future.

An event-triggered email is an automated alert sent via email, text, or phone call when a pre-determined event occurs.

Exploratory testing is a strategy that provides human software testers with the ability to analyze different areas of a piece of software. It empowers human software testers with the flexibility to test potential software issues that may otherwise go undetected during automated tests.

Sometimes, incidents can cause an organization to be charged with a criminal offense. In an incident response team, the HR or legal representative must navigate any legal consequences of an incident.

Identification is the initial detection of an incident.

Identification, also known as detection and analysis, is the second step in an incident response plan. In this step, research is done to find the cause of a detected incident.

Impact measures the effect of an incident on business processes. A high-impact incident may force business processes to come to a halt, whereas a low-impact incident has little or no effect on operations.

An incident is an unplanned and undesired event that interrupts business operations. Incidents can cause downtime, revenue loss, compliance penalties, and brand reputation damage. Incidents can also affect employees and customers. 

An incident log includes the name of the person reporting an incident, the date and time of an incident and other incident details.

Incident management is the process of identifying, analyzing, and addressing the incidents or technical disruptions of a business.

An incident management model includes time frames for incident resolution, insight into how to escalate an incident and best practices for preserving data and key performance indicators (KPIs) during an incident.

An incident management tool is used by organizations to both facilitate and improve incident management. Incident management tools can automate escalations, monitoring, and collaboration.

An incident response phase is a stage of an incident response plan. There are generally six incident response phases: preparation, identification, containment, eradication, recovery and lessons learned. Each phase plays an important role in effective incident response.  

An IT incident response plan guides IT staff in detecting, understanding, and responding to incidents caused by issues like cybercrime, data loss, and outages.

Incident response or incident management teams, also known as NOC teams, are trained to provide immediate solutions for incidents that disrupt an organization’s operations. An incident management team ensures an incident is closed or resolved within a predefined time limit described in an SLA.

Incident volume refers to the number of incidents received in a given time period.

An information security incident is an adverse event, such as a cyber-attack or insider threat, that negatively impacts an information system or a network. This type of incident poses a threat to the availability, integrity and confidentiality of a system.

Infrastructure as a service (IaaS) is a form of cloud computing that utilizes virtualized computing resources over the internet.

Integration of alerting tools allows users to streamline their alerts into one space. Integration interconnects data and notifications.

Integration testing involves the evaluation of myriad software components. During an integration test, software components are combined and analyzed as a single group.

An IT alerting system is a tool used by organizations to mitigate business risks and detect problems in the IT environment. It is a customizable tool that can automate and deduplicate alerts from various integrated monitoring sources, run analytics and create reports.

A key performance indicator (KPI) is a performance indicator clearly demonstrates how effectively and efficiently an organization is meeting its objectives. MTTR and MTTF are good KPI examples.

In an incident response team, the lead investigator analyzes an incident to find its root cause so that the team may start recovering from the incident and developing preventative measures as soon as possible.

Lessons learned, also known as post-incident activity, is the final step in an incident response plan. In this step, a resolved and closed incident is reviewed to identify steps that can be taken to improve a system, and aid in prevention of future incidents.

Mass notifications or manual paging delivers information to a group of people in the form of email, text, or phone call.

A mass notification system (MNS) is a platform that delivers information to a group of people. The system is flexible in its configuration of messages, controls, recipients, and methods of communication.

Mean time between failures (MTBF) is commonly used to measure hardware component or system reliability. It is calculated as an average of the time between hardware component or system failures. 

Mean time to acknowledge (MTTA) is the average time between an incident’s detection and the beginning of assistance or “acknowledgement” to resolve the issue.

Mean time to detect (MTTD) is the average time it takes to identify an issue. It measures the time between the beginning of an outage and when the business identifies the issue.

Mean time to failure (MTTF), aka “uptime,” is the average amount of time elapsed between a DevOps team encountering a serious defect in a system and the complete failure of the system. 

Mean time to recovery (MTTR) is the average time it takes to return to production status after a hardware component or system fails.

A message template or topic is a template in an IT alerting system that eases the process of sending messages to stakeholders, employees, and customers.

Microservices, or microservices architecture, is a software development methodology that involves building single-function modules with clearly defined interfaces and operations.

Mobile incident management tools allow users to complete incident management processes and tasks on a mobile device such as a smartphone or tablet. 

Model-based testing requires the use of test cases derived from visual models that represent the desired behavior of a system or environment. It is commonly used to generate manual tests, test data, and automated tests.

Monitoring or “logging” refers to the tracking of incident information (such as time, duration, and severity) in an incident log.

A new incident is one that has been newly discovered by a team or individual and is yet to be assigned or acknowledged.

A notification is a message sent to an individual to alert them of any updates or issues.

A notification channel is the channel used to deliver a notification. Notification channel examples include text, email, and phone call.

On-call management is the management of an on-call team’s accountability, visibility, and responsibilities.

An on-call team is the team scheduled to respond to messages or incidents at unpredictable times.

An on-hold incident is one that has been assigned or acknowledged but is suspended. Incidents can be put on hold if more information is needed to resolve the issue.

The OODA loop is an incident response strategy developed by U.S. Air Force military strategist John Boyd. The steps of the OODA loop are: Observe, Orient, Decide, and Act. The OODA loop is designed to help businesses quickly identify and respond to incidents.

An open application programming interface (API) is a public API that is generally available to consumers and developers. 

Overload occurs when a service demand exceeds its capacity. This can cause errors, and even network, server, or system overloads, causing an incident.

Pair programming is a software development technique in which two developers simultaneously work on a single feature. It promotes collaboration, as both developers can analyze each other’s code to bolster overall code quality.

Organizations can use planning to shorten incident response and resolution times. Organizations plan for incident management by identifying potential events that may cause incidents before they happen.

Platform as a service (PaaS) is a form of cloud computing service that involves the use of a platform to develop, run, and control applications. With PaaS, a third-party provider delivers hardware and software tools to end users via the internet.

Preparation is the first step in an incident response plan. In this step, all assets are complied, and ranked in order of importance.

Prioritization is the assessment of an incident and its impact on business processes and stakeholders. Different processes and workflows can be implemented depending on the priority level of an incident.

Problem management is a process for fixing system errors or weaknesses. Successful problem management limits the impact of incidents and ensures that an incident does not re-occur.

Recovery is the fifth step in an incident response plan. In this step, the aim is to get any affected systems and processes to become operational again.

Release engineering is the technical process of building reliable and fast pipelines to quickly transform source code into a product.

Release management is the non-technical process of overseeing and scheduling software build stages such as testing and deployment.

Reporting allows businesses to understand previous incidents and improve future incident management analysis, evaluation, and decision-making for reduced incident management costs.

Resolution occurs after the necessary steps and processes to resolve an incident have been completed.

A resolved incident has been mitigated and all service has returned to SLA standards.

If users are unable to access a service and repeatedly try to gain access, it causes a retry spike. Retry spikes can cause a service to shut down, causing an incident.

Rich alerting is the method of alerting in which all alerts are ensured to reach the correct and most relevant alert recipient, depending on the type of alert and the recipient’s schedule.

Role-based security ensures that users are restricted to viewing only data and alerts that are for them. For example, a software developer will not receive or be able to view alerts meant for a C-level employee.

A service-level agreement (SLA) is a commitment made between a service provider and its client. It can include elements such as reliability, responsiveness, obligations, and penalties to be implemented when the SLA is not followed.

Severity describes the impact of an incident on a business’ users. For a severe incident, a business may need to craft a public statement to its users. An incident of minor severity may require action but may not immediately affect users.

A stakeholder is any person with an interest and concern in a business. Stakeholders can include investors, users, employees, and executives of an organization.

A standard operating procedure (SOP) in IT alerting is a set of instructions compiled by an organization to help teams carry out routine IT operations based on alerts they receive.

In IT alerting, streamlining alerts makes organizations more efficient and effective by employing simpler working methods for received alerts.

In an incident response team, the team leader’s role is to coordinate incident response activity to keep the team on track and minimize damage to the system and organization.

Test automation involves the use of software to perform tests and compares actual and predicted test outcomes.

In DevOps, a toolchain is a set of software and/or products used to create a new program or perform a complex software development task. For example, IT alerting tools and/or incident management tools can be part of a toolchain.

A trigger is any event that starts the automated response process in an IT alerting system.

Unit testing is a software testing methodology in which each part of an application (unit) is evaluated individually.

Urgency is the amount of time before an incident has a significant business impact. For example, an incident with high urgency may result in immediate brand reputation and/or revenue loss.

A Virtual Machine (VM) is a computer file that acts like a computer system. It runs like a typical computer program and replicates the system experience.