AI SRE and Observability - Amplify and Accelerate

A few years back I was involved in a project to improve monitoring and security posture of the production environment at the company I worked at. We had, over the years, generated a lot of data, mostly from logs, which we dumped at great volume into Splunk this was very much a common practise. There was a lot of extremely valuable data in those logs and Splunk did an excellent job of mining them. We used the data for everything from populating the Security Information and Event Management System (SIEM), to fraud detection and systems monitoring. It was a prime monitoring tool for almost everything in the business and in using Splunk it was largely very good.
However I (we) fell into a trap that we often fall into with monitoring and telemetry collection. We collected everything for fear of missing something. Splunk being the tool that it is, thrives on these large datasets and is an excellent tool for extracting signal from noise, if you know what you are doing. But the firehose of data can very quickly become unmanageable, not useful to users and you wake up with a small army of people who have become experts at Splunk rather than experts at getting value from the data that drives the business.
One response to this dilemna (and Hello to Jason Isaacs… If you know, you know) was to layer on another tool that would use fancy machine learning models to do the heavy lifting of finding meaningful signal in the torrent of noise. So called, AIOps tools were touted as the answer to overloading human operators. In my opinion, much of the AIOps tooling of a few years back was more hype than reality and just piled more complexity on top of existing complexity. But lets set that aside for now! Lets say that AIOps tooling was the answer and was the perfect solution. I would argue that it was the wrong solution. With the LLM’s and AI tooling now being vastly more capable and this brand of AIOps being a far more feasible solution to this problem, it’s tempting to throw them at it and have the magic improve, accelerate and optimise your operations.
The other response to growing complexity and data overload was Site Reliability Engineering(SRE) and modern Observability.
Observability, SRE and AI - The Two Paths
There are two ways you might apply AI to telemetry data and to build on an SRE practices. These are:
Path One - Apply AI to existing telemetry stores and use it as a analysis engine for the mass of data. Then the analysis output to create an AI driven “virtual SRE”.
Path Two - Apply AI to the problem of instrumentation to drive down the bad data, build higher value telemetry from source and amplify and enhance the value of an existing SRE practice.
Path One is obviously the modern cousin of the old AIOps approach. Where Path Two is an extension of an existing SRE and modern Observability driven approach. In my opinion, Path Two is the path we should take. Here is why.
Regression - The Path One Trap
We have spent years building up best practise around DevOps, SRE. Out of it has come, Observability, a fit for purpose approach to system health and monitoring of large scale distributed systems.
DevOps encourages the ownership of development and operations as a shared responsibility rather than a “throw it over the fence” approach. SRE instill’s the mantra that developers and operations people should have the same point of view of application health and reliability and be engaged from inception to production operation of the software they create. We have worked hard to build these disciplines. If we are keep ourselves honest with continual learning, such as absorbing the yearly DORA report, then we are able to measure the effectiveness of these approaches and course correct where needed.
If we are tempted by the allure that Path One offers: Just keep generating lots of data, feed it to the AI and let it the virtual SRE make sense of it. Then we start to break that chain between ownership of the software from creation to operation. We will regress back to the throw it over the fence model, except we are throwing it over the fence to an AI.
You might be tempted to say “Is that actually bad though? If the AI can do the job, then why not? We can just focus on producing software!”. But it is bad. For so many reasons. The first of which, is that it is a trap. A genuine trap.
Once you break that chain and rely on an external someone or something else on the other side of the fence to reason about the state of your software and infrastructure in operation, you are trapped with that person or thing. This is the “hero engineer” problem we have been trying for years to get away from and avoid. You, the creator of that software, will lose control. We shouldn’t worry about SkyNet taking over the world, but we should worry about being locked into it running our software and infrastructure. In doing so we lose agility and flexibility in operational capability of our infrastructure and software.
The other obvious issue is cost. I think there is a not insignificant short term gain to be had in taking Path One, which is part of the allure. But in the medium term, if left unchecked, that gain will be wiped out by the high cost of the AI and ballooning size of the telemetry stores. Anyone who has done Observability over the last few years will know the struggle that is managing the cost of data storage. To make matters worse, there will be very little you will be able to do about the cost because you have lost control of the data and wont necessarily be able to reliably reduce the volume without adversely and unpredictably impacting operations.
Which leads to perhaps the less obvious cost. This is automated, scaled up and AI powered technical debt generation. Which in the near term is invisible. Undoing it will be extremely hard and costly.
Taking this path is trading short term gain for long term pain. Not a good use of AI tech. Let’s not be blinded by the tech and blunder into a regression in capability and good practice disguised as innovation with AI.
Enlightenment - Path Two
AI tech is and will revolutionise our industry and I am super excited about the possibilities. I strongly believe it can, and should, be used to accelerate and amplify what are already good practises, not undermine them. Even if by accident. We need to avoid accidents and traps.
For the last few years, I have been working with various teams to establish an SRE driven Observability approach to building and running software and infrastructure. I have advocated moving away from the “chuck all the data you can into the mix” model and instead take a much more considered approach. Making instrumenting code and infrastructure a first-class concern as part of the build process. Rather than leaving it as a problem for the operations teams later.
Ensuring the signals coming out of the software and systems are high quality and can be used to develop a user centric reliability view as is the goal of observability, has multiple benefits. Firstly it reduces the noise and volume of data going to the telemetry stores. Allowing for cleaner and higher value data by default, tighter control on volumes of data, which unlocks cost effective management options without overly compromising fidelity. On top of this, high quality Service Level Objectives(SLO) are built which allow teams to reason about highly scalable and dynamic distributed systems. Systems, built to be tolerant and expectant of component failure. These cannot be monitored in the “traditional” way without falling prey to alert fatique
This is a mindset shift. It requires being more disciplined in both the creation of telemetry and the response to signals from the telemetry. It is essential to an SRE approach that the engineers that are responsible for building and running the software and systems they create to understand the instrumentation used to generate the signals that describe the state of the system at runtime. This continuity provided by DevOps, SRE and Observability means that engineers looking at the state of the systems in operation have a baked in understanding of the signals those systems are generating. An intrinsic understanding of how the systems function. This makes operating complex software and systems much easier and more deterministic.
Inserting AI into the continuity rather than breaking it is the key. Driving established Observability and SRE practises with AI in the right way will amplify best practise and accelerate high-performing capabilities.
Where observability implementations get bogged down and often derailed is the toil involved in instrumenting software. You have to do it properly and completely or you end up diluting the value of the telemetry and end up with poor quality signals and a lot of work for SRE’s during on the operational end. One solution to this has been auto-instrumentation. Decorating methods and functions with tags for instrumentation then having and agent inject instrumentation on the fly. This can work well, but it has it’s limits, you are stuck with the instrumentation provided by the agent or framework and there may be performance impacts if the agent is injecting instrumentation at runtime.
This is where AI tooling comes into it’s own. It can provide the benefits of auto-instrumentation but with greater flexibility and more granular control. Using well defined observability principles and requirements as guidance to AI, you can use it to inject instrumentation directly into the code in the right context at build time and/or actively suggest how to instrument effectively as you build the software. With some discipline over how the AI is used and a human in the loop you don’t break that chain understanding and ownership that is the heart of SRE.
By grounding the AI with established monitoring methodologies such as RED, USE and The Four Golden Signals, you can get it to assist with applying the right approach at the right time and ensuring telemetry is available to make any or all approaches are effective. In operation, the same grounding can be used to enhance SRE’s in problem analysis and troubleshooting.
It is my belief that this is where the big wins are in using AI in an SRE and Observability context. The path locks in long term benefit without losing best practise and well proven ways of working. Rather it accelerates and amplifies both.