managing on-call teams

Best Practices for Managing On-Call Rotation (in 2023)

Alerting has come a long way from the days of paging an on-call administrator in the middle of the night, to multiple on-call teams that run and manage incident response around the clock through automated on call rotation software.

This is because as organizations grow and scale, responding to incidents gets more complex and you often need more than one team to get involved to successfully resolve an incident.

Making sure that the right people from the right teams are always available seems straightforward enough, but can be a lot harder than it sounds.

While escalation policies are used to make sure the right person or team is alerted in the event of an incident, escalation policies alone don’t guarantee availability.

Ensuring that one team is awake, “alert” and available, while the other team is off schedule, requires planning, collaboration, scheduling, and proper on-call rotation.

Additionally, having a precise roadmap of on-call demand and availability requirements is critical in managing multiple on-call teams and rotating them in an effective and sustainable fashion

Scheduling and on-call rotation

Incident management is all about agility and speed. Unlike problem management, you actually have a live incident on your hands that typically requires a quick workaround solution.

When you have multiple on-call teams looking after a service or multiple services, schedules are your golden compass for ensuring everyone is pointed in the right direction.

Scheduling helps you manage which members of a specific team are available at a given time, which teams are overlapping each other and how the escalation policies affect the different members of the teams.

On-call schedules used to be created on spreadsheets that couldn’t really account for availability. Modern on-call rotation software tools are interactive and have in-built scheduling features that make the process a lot easier.

These tools also feature escalation policies that, in addition to managing the hierarchy of an escalation, also let you define rules on how alerts work and behave.

While scheduling multiple escalations is essential when you need to get multiple teams engaged at the same time during an outage, on-call rotations are essential to making sure everyone gets enough sleep.

It’s not just about planning for an outage, but also about making sure everyone shares the workload evenly, especially on weeknights, weekends and holidays.

In fact, good on-call scheduling tools let you customize a number of things like preferred contact methods based on predefined schedules, rotation frequencies and time of day restrictions.

Additional features include the ability to build complex schedules where users can even trade-off or swap schedules for a certain amount of hours each day.

What all this automation and “time-saving” actually boils down to is the reduction of steps involved in alerting the right person, when every second counts.

Collaboration and integration with monitoring apps

Modern enterprise organizations often have huge teams that are geographically distributed across the globe.

In addition to scheduling these teams so they complement and don’t clash with each other, collaboration is key to making sure teams have the right tools and information available to deal with any incidents.

What if critical logs and information are available to a team that’s off duty, but not to the team that’s on call? Incident management isn’t just about getting the right people involved, but also about equipping them with the right information and the ability to communicate freely with other members in real time.

This is where integrations with collaboration software like Slack and Microsoft Teams enable constant collaboration between teams.

This includes the ability to reply to alerts in real time, message or question other team members in real time and add additional teams or members to an ongoing alert.

Integration obviously isn’t just about integrating with collaboration or chat tools; it also involves integration with monitoring systems like ServiceNow, logging tools like Sumo Logic, and other event monitoring tools. AlertOps is an example of one incident management platform that works well with others and seamlessly integrates with chat, collaboration, and monitoring tools.

This is because it features pre-built integrations and open, no-code APIs which allow you to connect and configure integrations with a number of different monitoring, communications and performance enhancing tools.

The Open API also lets you connect to any system by email, webhooks or APIs. In this way, one common platform can be used to seamlessly manage and optimize all alerts from various monitoring systems, while simultaneously managing communication and collaboration.

This greatly reduces the chances of burnout and fatigue among engineers as well as reduces Mean Time To Resolution (MTTR).

Mature routing process for alerts

In conclusion, while alerting may seem like a rather small cog in the giant DevOps wheel, without alerts we would probably have to stare at graphs all day to make sure nothing goes wrong.

Unlike in the past where monolithic applications didn’t really require complex on-call team structures, all of today’s services are usually pieces of a larger puzzle that fit together to form a large, interconnected platform.

Additionally, this platform, more often than not, is dependent on a complex system of resources that can include everything from public clouds to database management services and complex networking layers.

The good news here is that there is more than sufficient scope to implement rule-based or policy-based routing of alerts to a point where absolutely no time is wasted between an incident occurring and the right people being informed.

In today’s dynamic software market, where an organization’s on-call incident management abilities are considered the key aspect of quality and dependability, choosing the right incident management platform can be a critical decision.