There is a growing problem in IT Operations – the actual problem is that IT Operations is growing – very, very fast. IT operational event data from logs, agents and network wire data has exploded from hundreds of events per day in the 1970s, through thousands in the 80s, hundreds of thousands in 90s, millions in 2000s and billions of events per day today – some enterprises are now even talking about measuring trillions of events per day.
Conversely, as event volume grows, event value is dropping rapidly. The vast majority of operational alerts are false positives, and this is only growing. There are proportionally fewer needles and many, many more haystacks.
Compounding the volume and value problems, we’ve seen the growth in the variety of specialized monitoring tools for network, logs, applications, storage, databases, cloud, web and system resources. While most large enterprises have attempted to collapse the growing variety of siloed solutions through “Single Pane of Glass” initiatives, typically the only lasting result is the implementation of an overlying event aggregator solution to pull together events from the various silos. But the variety of dozens of underlying specialized monitors remains.
This complex and dynamic operational miasma increases troubleshooting time, complicates decision making and slows restoration time … all while the business and customer have growing expectations for increased velocity in service restoration.
Humans can’t keep up with the volume, value, variety and velocity demands of today’s IT Ops environment. We are becoming the problem and not the solution.
So, what is AIOps? Artificial Intelligence for IT Operations (formerly Algorithmic IT Operations Analytics) is a term generally applied to software products or services that combine big data, data analytics, AI, and automation technologies to resolve the volume/value/variety/velocity problem. AIOps products are intended to build on, not replace, siloed monitoring tools and event aggregation platforms already in place. But AIOps provides the smarts and automation beyond what today’s platforms offer.
Looking under the hood, AIOps products:
- Ingest and aggregate event data from various monitoring silos or even event aggregation platforms
- Provide noise reduction and event correlation
- Detect operational anomalies and determine root cause (a.k.a, causal analytics)
- Integrate with the IT Ops workflow for ticketing, notification, event & incident management, and event data capture
- Provide decision support mechanisms for IT Ops staff during troubleshooting
- Automate remediation and restoration
Great news … except there is a lot of AIOps hype on the market today. As is seen in every gold rush, there are many new prospectors staking a claim but only a small proportion hit gold. There is a host of new startups claiming to offer fully refined AIOps right out of the chute. There are a many of incumbent event aggregation platforms that claim they have made the evolution from simple event aggregation to smart AIOps. And there is a bunch of siloed monitoring tools that claim that they not only offer AIOps but can span other silos as well. How do you tell the difference between pay dirt and fool’s gold?
6 Questions to tell if an AIOps offering is glitter or gold:
From which operational silos can your AIOps product ingest events?
An AIOps product must ingest a copious amount of event data from a wide spectrum of event sources. Today there are still operational barriers between agent, log, timeseries and network wire data. So, there isn’t “one ring to rule them all’ … yet. Nonetheless, an AIOps solution must collect and store event data from a variety of specialized monitoring tools, possibly even other event aggregation tools.
How does your AIOps product provide noise reduction and event correlation?
AIOps tools must not only collect event data but must also filter out irrelevant events, correlate related events and determine the causal relationships between related events. If a system only collects and de-duplicates events it is a log aggregation product and not AIOps.
What AI does your AIOps product use?
An AIOps product must have AI. This seems obvious, but many products touting AI only use traditional if/then/else logic in coded rule sets. If you hear phrases like thresholds, rules engine, ‘tuned to your environment’, ‘you can code your own rules or triggers’ this should raise red flags. When you ask, ‘what AI model(s) are used’, ‘how are they trained’, ‘what data is used to train them’ and ‘how do they get retrained’, if you get blank stares or a smokescreen of trade secrets, move on. Conversely, a real AI vendor will get giddy at these questions and dive into a detailed explanation eyes a-twinkle. Now you’re on the right track.
How does your AIOps product integrate with my IT Operations workflow?
An AIOps tools must integrate with various IT Ops workflow tools for functions such as ticket creation and processing, event and incident creation and management, and recording event data for later analysis and learning. Before the advent of AIOps, these traditional ITIL related tasks were carried out manually, greatly extending troubleshooting and restoration time. A true AIOps tool will take care of this grunt work streamlining the process, reducing errors and freeing scarce Ops labor for more valuable tasks.
How does your AIOps product help with decision support?
An AIOps tool should not be a passive participant in the detection, diagnosis, remediation and restoration process. It should provide active tools allowing operators to isolate pertinent data, drill down to specific event clusters, identify similar past event patterns, enable cross-discipline collaboration and present concise visualizations to facilitate speedy and informed decisions.
How does your AIOps product provide automation for remediation and restoration?
This is a newer feature being deployed by AIOps solutions, but once a course of remediation has been decided, the AIOps tool should integrate with automation tools, run books and orchestration products to automatically and globally apply fixes and restore service. This is can be a scary step for many large enterprises but is definitely where AIOps is heading.
As AIOps solutions mature, they will integrate with more event data sources, consolidate and merge with event aggregation platforms and expand their automation and orchestration capabilities. Couple these trends with the parallel evolution of AI-based event analytics tools underway on the security side of the house and its apparent that there will be convergence between operational and security platforms that will soon detect, diagnose and remediate both malfunction and malfeasance.
There are, of course, considerable non-technical challenges to be surmounted when implementing AIOps solutions. Sunk cost bias, cultural impedance, job security fears and author-pride in artisan hand-built manual solutions are common counterweights to AIOps initiative. These concerns are certainly real and not to be ignored. But as the pain of operational volume, value, variety and velocity grows, a tipping point will be reached where the benefits of AIOps provide a compelling alternative.
Mark Campbell is the Chief Innovation Officer at Trace3 where his teams review over 1,000 tech start-ups each year. Based out of sunny Denver, Colorado, Mark is a researcher and industry watcher who leverages his 25 years of real world IT experience to help enterprises adopt emerging technologies to tackle their toughest technical and business problems. Mark holds telecom patents, writes frequent articles for tech publications and works with the world’s largest IT venture capital firms.