Today enterprises of all sizes operate in very competitive market. To deliver on business expectations, IT environments continuously become more flexible and dynamic.
There is contemporary microservices architecture that simplified the scope of software developers, but roles of IT Operations and System Reliability Engineers (SREs) have become even more complex. IT environment can generate millions of transactions a day and they can change every few seconds. The sheer scale and dynamic nature of these distributed hybrid environments is difficult to fully comprehend.
The gap between IT complexity and the human ability to manage it is widening and threatens resiliency and reliability. One of the solutions to this problem that adopted by many organizations is employing Artificial Intelligence to assist IT Operations and SREs. In some cases, SREs analyze incoming events or symptoms before deciding on pursuing investigative actions, so not to spend time on benign variations. In the interviews conducted with SREs, diagnosis was identified as the most difficult task, often considered to be an innate skill [1]. There has been a great deal of effort spent on developing methodologies for reasoning about symptoms provided by monitoring.
PyRCA and Merlion libraries, for example, have implementation of methods from recent research in metric-based anomaly detection and root cause analysis. These libraries might be quite helpful for researchers seeking to try these published algorithms. We however develop novel methods, demonstrated to be more powerful in these areas in our experiments. We present a demo of the methods we developed targeting IT data, followed by detailed description and evaluation results in comparison to the methods in PyRCA and Merlion libraries. Using publicly available SMD dataset we’ll show that the combination of unsupervised methods we use could perform as well, and in some case outperform semi-supervised methods in the library.