Search | VHL Regional Portal

Anomaly detection in microservice environments using distributed tracing data analysis and NLP.

Kohyarnejadfard, Iman; Aloise, Daniel; Azhari, Seyed Vahid; Dagenais, Michel R.

J Cloud Comput (Heidelb) ; 11(1): 25, 2022.

Article in English | MEDLINE | ID: mdl-35979413

ABSTRACT

In recent years DevOps and agile approaches like microservice architectures and Continuous Integration have become extremely popular given the increasing need for flexible and scalable solutions. However, several factors such as their distribution in the network, the use of different technologies, their short life, etc. make microservices prone to the occurrence of anomalous system behaviours. In addition, due to the high degree of complexity of small services, it is difficult to adequately monitor the security and behavior of microservice environments. In this work, we propose an NLP (natural language processing) based approach to detect performance anomalies in spans during a given trace, besides locating release-over-release regressions. Notably, the whole system needs no prior knowledge, which facilitates the collection of training data. Our proposed approach benefits from distributed tracing data to collect sequences of events that happened during spans. Extensive experiments on real datasets demonstrate that the proposed method achieved an F_score of 0.9759. The results also reveal that in addition to the ability to detect anomalies and release-over-release regressions, our proposed approach speeds up root cause analysis by means of implemented visualization tools in Trace Compass.

A Framework for Detecting System Performance Anomalies Using Tracing Data Analysis.

Kohyarnejadfard, Iman; Aloise, Daniel; Dagenais, Michel R; Shakeri, Mahsa.

Entropy (Basel) ; 23(8)2021 Aug 03.

Article in English | MEDLINE | ID: mdl-34441151

ABSTRACT

Advances in technology and computing power have led to the emergence of complex and large-scale software architectures in recent years. However, they are prone to performance anomalies due to various reasons, including software bugs, hardware failures, and resource contentions. Performance metrics represent the average load on the system and do not help discover the cause of the problem if abnormal behavior occurs during software execution. Consequently, system experts have to examine a massive amount of low-level tracing data to determine the cause of a performance issue. In this work, we propose an anomaly detection framework that reduces troubleshooting time, besides guiding developers to discover performance problems by highlighting anomalous parts in trace data. Our framework works by collecting streams of system calls during the execution of a process using the Linux Trace Toolkit Next Generation(LTTng), sending them to a machine learning module that reveals anomalous subsequences of system calls based on their execution times and frequency. Extensive experiments on real datasets from two different applications (e.g., MySQL and Chrome), for varying scenarios in terms of available labeled data, demonstrate the effectiveness of our approach to distinguish normal sequences from abnormal ones.

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL