An agent is a program installed on a physical server. An agent executes various server processes.
Agile is an iterative set of software development best practices. The top priorities in agile software development include software quality assurance, user feedback integration, and the ability to “fail fast” and implement rapid changes as needed.
Alert fatigue occurs when many monitoring systems create an abundance of alerts that flood mailboxes. This causes alerts to become less meaningful, and often decreases the responsiveness of IT team members.
Behavior-driven development is a form of software development that involves ongoing communication between developers, business analysts, quality assurance teams, and other team members. It promotes constant collaboration and helps key stakeholders work together to achieve common software development goals.
Branching refers to a programming technique in which a source code copy is used to create two versions of software. This enables the source code to be simultaneously modified by two developers.
BA complex-adaptive system consists of an IT platform or project that includes multiple components. In this system, each component interacts with others in ways that cannot be accurately predicted or controlled.
Configuration drift occurs when a hardware or software infrastructure configuration changes from a recovery or secondary configuration. It may occur due to inconsistent configurations across a set of computers or devices.
Configuration management is a system engineering process for creating and maintaining product consistency. It involves management of a product’s performance, function, and physical attributes relative to its design and requirements.
Containerization is the use of virtual software containers that include operating system resources, memory, and services to run an application or service. It often helps a developer test production flows for services deployed in the cloud.
Continuous integration (CI) is a software engineering practice that merges developer code changes into a single repository. Once CI is complete, the merged code is used to automate software builds and tests.
Continuous quality is the integration of software quality reviews into the CD pipeline. It requires quality assurance team members to review software code as soon as it becomes available, and address any potential code issues during the software development cycle.
Continuous testing is the execution of automated tests as part of the software delivery pipeline. It enables a DevOps team to get feedback for identifying potential risks before software is publicly released.
Data enrichment is the process of merging data from a third-party source with an existing database. Brands implement data enrichment to enhance their data, improve data accuracy, and make more informed decisions.
Diagnosis of an incident is a formulation of a hypothesis as to what caused the incident. An incident management team may be able to resolve an incident solely based on an initial diagnosis.
Escalation is bringing an issue to an individual or team in a higher department within an organization. For example, if a customer service representative finds an issue that can be resolved only by the IT team, the issue can be escalated to an IT manager.
Event-driven architecture is a form of software architecture that involves the creation of events by a system; the system then uses these events to identify or consume similar occurrences in the future.
Exploratory testing is a strategy that provides human software testers with the ability to analyze different areas of a piece of software. It empowers human software testers with the flexibility to test potential software issues that may otherwise go undetected during automated tests.
Sometimes, incidents can cause an organization to be charged with a criminal offense. In an incident response team, the HR or legal representative must navigate any legal consequences of an incident.
Identification, also known as detection and analysis, is the second step in an incident response plan. In this step, research is done to find the cause of a detected incident.
Impact measures the effect of an incident on business processes. A high-impact incident may force business processes to come to a halt, whereas a low-impact incident has little or no effect on operations.
An incident is an unplanned and undesired event that interrupts business operations. Incidents can cause downtime, revenue loss, compliance penalties, and brand reputation damage. Incidents can also affect employees and customers.
The incident lifecycle is a series of six stages that each incident goes through after being detected. These stages are: new, assigned/acknowledged, in-progress, on-hold, resolved, and closed.
An incident management model includes time frames for incident resolution, insight into how to escalate an incident and best practices for preserving data and key performance indicators (KPIs) during an incident.
An incident management tool is used by organizations to both facilitate and improve incident management. Incident management tools can automate escalations, monitoring, and collaboration.
An incident response phase is a stage of an incident response plan. There are generally six incident response phases: preparation, identification, containment, eradication, recovery and lessons learned. Each phase plays an important role in effective incident response.
Incident response or incident management teams, also known as NOC teams, are trained to provide immediate solutions for incidents that disrupt an organization’s operations. An incident management team ensures an incident is closed or resolved within a predefined time limit described in an SLA.
An information security incident is an adverse event, such as a cyber-attack or insider threat, that negatively impacts an information system or a network. This type of incident poses a threat to the availability, integrity and confidentiality of a system.
An IT alerting system is a tool used by organizations to mitigate business risks and detect problems in the IT environment. It is a customizable tool that can automate and deduplicate alerts from various integrated monitoring sources, run analytics and create reports.
In an incident response team, the lead investigator analyzes an incident to find its root cause so that the team may start recovering from the incident and developing preventative measures as soon as possible.
Lessons learned, also known as post-incident activity, is the final step in an incident response plan. In this step, a resolved and closed incident is reviewed to identify steps that can be taken to improve a system, and aid in prevention of future incidents.
A mass notification system (MNS) is a platform that delivers information to a group of people. The system is flexible in its configuration of messages, controls, recipients, and methods of communication.
Mean time between failures (MTBF) is commonly used to measure hardware component or system reliability. It is calculated as an average of the time between hardware component or system failures.
Mean time to detect (MTTD) is the average time it takes to identify an issue. It measures the time between the beginning of an outage and when the business identifies the issue.
Mean time to failure (MTTF), aka “uptime,” is the average amount of time elapsed between a DevOps team encountering a serious defect in a system and the complete failure of the system.
Model-based testing requires the use of test cases derived from visual models that represent the desired behavior of a system or environment. It is commonly used to generate manual tests, test data, and automated tests.
The OODA loop is an incident response strategy developed by U.S. Air Force military strategist John Boyd. The steps of the OODA loop are: Observe, Orient, Decide, and Act. The OODA loop is designed to help businesses quickly identify and respond to incidents.
Pair programming is a software development technique in which two developers simultaneously work on a single feature. It promotes collaboration, as both developers can analyze each other’s code to bolster overall code quality.
Organizations can use planning to shorten incident response and resolution times. Organizations plan for incident management by identifying potential events that may cause incidents before they happen.
Platform as a service (PaaS) is a form of cloud computing service that involves the use of a platform to develop, run, and control applications. With PaaS, a third-party provider delivers hardware and software tools to end users via the internet.
Prioritization is the assessment of an incident and its impact on business processes and stakeholders. Different processes and workflows can be implemented depending on the priority level of an incident.
Rich alerting is the method of alerting in which all alerts are ensured to reach the correct and most relevant alert recipient, depending on the type of alert and the recipient’s schedule.
Role-based security ensures that users are restricted to viewing only data and alerts that are for them. For example, a software developer will not receive or be able to view alerts meant for a C-level employee.
A service-level agreement (SLA) is a commitment made between a service provider and its client. It can include elements such as reliability, responsiveness, obligations, and penalties to be implemented when the SLA is not followed.
Severity describes the impact of an incident on a business’ users. For a severe incident, a business may need to craft a public statement to its users. An incident of minor severity may require action but may not immediately affect users.
A standard operating procedure (SOP) in IT alerting is a set of instructions compiled by an organization to help teams carry out routine IT operations based on alerts they receive.
Technical debt is a programming concept related to the implied cost of extra development work. A DevOps team may build technical debt if it implements a solution that delivers short-term results versus a long-term solution that requires additional time to produce and implement.
In DevOps, a toolchain is a set of software and/or products used to create a new program or perform a complex software development task. For example, IT alerting tools and/or incident management tools can be part of a toolchain.