Combatting downtime

Technology

Eduardo Crespo at PagerDuty argues that reducing downtime with AI means being operationally mature

Combatting downtime is essential. An always-on company has a strong advantage in customer service, but the trouble is that as a company grows, so does its technological complexity. AI is now suggested for removing friction and many causes of downtime.

However, talking generally about AI obscures more than it reveals. The best route forward for organisations is to consider the deeper level of what they are aiming to do with their AI implementations, and how to get there efficiently.

Understanding digital operational maturity, removing the causes of downtime and ensuring AI projects really contribute. They should not be an extra burden on digital infrastructure.

The value of operational maturity

The first step when taking a strategic approach to improving digital operations maturity is to understand where the organisation stands. Business and technical decision makers must align to:

Benchmark against best practices and solutions to improve in identified areas
Visualise and map the journey to higher maturity, involving automation and AI use cases
Set tailored metrics for success, including around skills acquisition and cultural change

Why become more operationally mature? To perform better, reduce downtime from addressable causes and to react more effectively to unexpected incidents. On average, organisations with mature digital operations approach, acknowledge incidents 7 minutes faster, mobilise responders 11 minutes faster, resolve incidents 2 hours faster, and have 14 fewer hours of downtime each month.

But the benefits are not just for customer service and maintaining business infrastructure. Higher levels of operational maturity improve the employee experience, reducing stress and churn while keeping skills within the business. IDG outlines how it contributes to a virtuous cycle of business, customer, and employee gains in a PagerDuty sponsored report.

The digital operations maturity model defines operational maturity, allowing leaders to identify their current place and where to focus improvements. It’s a five-point progression from least capable and mature to the optimal state of readiness and resilience.

That’s where organisations are on top of maintenance and preventative activity, ensuring all systems operate at the desired efficiency. The further along the scale, the greater the ability to use automation and AI and the more uptime and customer expectations fulfilled, as the foundations of agility and resilience are well laid.

Very briefly, the model moves from Manual, a legacy environment where humans lack automation and machine-learning-driven assistance to Reactive. At this stage there are some initial tech investments for visibility and real-time mobilisation and maturing apps into more complex digital services.

At the Responsive level organisations are exploring machine learning to identify issues and reduce the false positives and ‘noise’ tech teams face.

The fourth level is Proactive. Organisations have seamless, coordinated responses and an action sequence for urgent real-time work, and issues are detected and fixed before customers notice, while programmatic learning identifies opportunities for optimisation.

The fifth and final stage is Preventative. Predictive issue remediation happens via machine learning insights for seamless customer experiences. Organisations have highly automated processes eliminating manual engineering toil. Best practices and a culture of continuous learning, improvement, and prevention are in place. The business can predict the future impact of changes.

Figure 1. The five stages of operational maturity

The tools for the job

Organisations should bring digital telemetry data into an AIOps solution and create a comprehensive system where context can be applied, allowing talent to quickly determine the right course of action. Downtime will happen, so teams need to know what’s playing out in real-time. An incident management platform manages and groups alerts from different monitoring systems and provides this context.

Event-driven automation at scale lies at the centre of a strong AIOps-led process that creates context from the signals coming from observability tools and customer cases. GenAI becomes a key enabler in AIOps due to the speed offered in creating context and stakeholder communications, among other benefits.

AIOps is foundational, offering real-time insights, predictive analytics, proactive incident management and supporting businesses onboard new technologies, improving operations and constantly enhancing the customer experience.

The central nervous system maintains service

Mature levels of operational ability and proper digital infrastructure management pays off in a reduction of breakdowns and downtime. It saves costs from lost revenue as well as emergency break-fix.

When enhanced by automation that supports the engineering team and their workflows — as well as backed up by successful AI implementations, such as predicting points of failure that would lead to service outages — the tech function will have created a real central nervous system that can react to incidents and respond in such a manner that customer services rarely see any impact on quality and delivery.

However, reaching that ideal state of full operational maturity is often complicated by various cultural and practical roadblocks. To aim for success, there is a breadth of strategy that leaders need to embrace, in order to strive to improve their business. Complexities that need to be managed to remain in this near optimal state include:

Maintaining and improving predictive issue remediation based on machine learning insights so great customer experiences continue.
Investing in tested resilient and agile automated processes and services that eliminate toil and escalations for your human talent.
Training, reviewing and tweaking where needed to develop a well-defined and coordinated response process for any digital issues, whether internal or from external sources and partner issues.
Reviewing and adapting best practices while evaluating the culture of continuous learning and blameless investigation.
Including everyone, including non-technical stakeholders, in planning so the business is able to predict the impact of any future changes, understand how to respond and best manage everyone’s expectations.

The bottom line

The bottom line is that tech issues are not just tech issues. Putting in place the right talent, tech, teams and processes is key to reduce downtime and keep customers happy with their services, and the route there comes from honest engagement with a roadmap to real digital maturity.

Eduardo Crespo is VP EMEA at PagerDuty

Main image courtesy of iStockPhoto.com