IBM

IT failures are increasingly costly, with even brief outages leading to millions in losses as more business moves online. Incident management has become more complex than ever due to a combination of technological advancements, infrastructure heterogeneity, and evolving business needs. Resolving IT incidents is similar if not more complex to software code bug fixing. It is a very tedious and expensive task. Several advancements have been made including IBM’s Intelligent Incident Remediation using LLMs and generative AI to streamline incident resolution by identifying probable causes and using AI-guided remediation steps. In this demo, we are describing how we are advancing the state of the art in incident remediation using agentic Gen AI approaches. We demonstrate SRE-Agent-101, a ReAct style LLM-based agent, along with a benchmark to standardize the effectiveness of analytical solutions for incident management. SRE-Agent-101 uses several custom built tools, namely anomaly detection, causal topology extraction, NL2Traces, NL2Metrics, NL2Logs, NL2TopologyTraversal, and NL2Kubectl. These tools take natural language as input to fetch target data gathered by the observability stack. Given the verbosity of such data, even powerful models can quickly exhaust their context length. We have implemented a methodology to dynamically discover the more specific context using domain knowledge. The target context is then analyzed by underlying LLM to infer the root cause entity, fault, perform actions and this process iteratively continues until the incident is resolved.

Chat is not available.