Log message anomaly detection using machine learning

Farzad, Amir

Log message anomaly detection using machine learning

dc.contributor.author	Farzad, Amir
dc.contributor.supervisor	Gulliver, T. Aaron
dc.date.accessioned	2021-07-05T17:06:21Z
dc.date.available	2021-07-05T17:06:21Z
dc.date.copyright	2021	en_US
dc.date.issued	2021-07-05
dc.degree.department	Department of Electrical and Computer Engineering
dc.degree.level	Doctor of Philosophy Ph.D.	en_US
dc.description.abstract	Log messages are one of the most valuable sources of information in the cloud and other software systems. These logs can be used for audits and ensuring system security. Many millions of log messages are produced each day which makes anomaly detection challenging. Automating the detection of anomalies can save time and money as well as improve detection performance. In this dissertation, Deep Learning (DL) methods called Auto-LSTM, Auto-BLSTM and Auto-GRU are developed for log message anomaly detection. They are evaluated using four data sets, namely BGL, Openstack, Thunderbird and IMDB. The first three are popular log data sets while the fourth is a movie review data set which is used for sentiment classification. The results obtained show that Auto-LSTM, Auto-BLSTM and Auto-GRU perform better than other well-known algorithms. Dealing with imbalanced data is one of the main challenges in Machine Learning (ML)/DL algorithms for classification. This issue is more important with log message data as it is typically very imbalanced and negative logs are rare. Hence, a model is proposed to generate text log messages using a Sequence Generative Adversarial Network (SeqGAN) network. Then features are extracted using an Autoencoder and anomaly detection is done using a GRU network. The proposed model is evaluated with two imbalanced log data sets, namely BGL and Openstack. Results are presented which show that oversampling and balancing data increases the accuracy of anomaly detection and classification. Another challenge in anomaly detection is dealing with unlabeled data. Labeling even a small portion of logs for model training may not be possible due to the high volume of generated logs. To deal with this unlabeled data, an unsupervised model for log message anomaly detection is proposed which employs Isolation Forest and two deep Autoencoder networks. The Autoencoder networks are used for training and feature extraction, and then for anomaly detection, while Isolation Forest is used for positive sample prediction. The proposed model is evaluated using the BGL, Openstack and Thunderbird log message data sets. The results obtained show that the number of negative samples predicted to be positive is low, especially with Isolation Forest and one Autoencoder. Further, the results are better than with other well-known models. A hybrid log message anomaly detection technique is proposed which uses pruning of positive and negative logs. Reliable positive log messages are first identified using a Gaussian Mixture Model (GMM) algorithm. Then reliable negative logs are selected using the K-means, GMM and Dirichlet Process Gaussian Mixture Model (BGM) methods iteratively. It is shown that the precision for positive and negative logs with pruning is high. Anomaly detection is done using a Long Short-Term Memory (LSTM) network. The proposed model is evaluated using the BGL, Openstack, and Thunderbird data sets. The results obtained indicate that the proposed model performs better than several well-known algorithms. Last, an anomaly detection method is proposed using radius-based Fuzzy C-means (FCM) with more clusters than the number of data classes and a Multilayer Perceptron (MLP) network. The cluster centers and a radius are used to select reliable positive and negative log messages. Moreover, class probabilities are used with an expert to correct the network output for suspect logs. The proposed model is evaluated with three well-known data sets, namely BGL, Openstack and Thunderbird. The results obtained show that this model provides better results than existing methods.	en_US
dc.description.scholarlevel	Graduate	en_US
dc.identifier.bibliographicCitation	Amir Farzad and T. Aaron Gulliver. Unsupervised Log Message Anomaly Detection‏. ICT Express, volume 6 (3), pp. 229-237, 2020.	en_US
dc.identifier.bibliographicCitation	Amir Farzad and T. Aaron Gulliver. Log Message Anomaly Detection with Oversampling. International Journal of Artificial Intelligence and Applications, 11 (4), pp. 53-65, 2020.	en_US
dc.identifier.bibliographicCitation	Amir Farzad and T. Aaron Gulliver. Oversampling Log Messages Using a Sequence Generative Adversarial Network for Anomaly Detection and Classification. In International Conference on Artificial Intelligence and Machine Learning, volume 10 (5), pp. 163-175, 2020.	en_US
dc.identifier.bibliographicCitation	Amir Farzad and T. Aaron Gulliver. Log Message Anomaly Detection and Classification Using Auto-B/LSTM and Auto-GRU. arXiv e-prints, page arXiv:1911.08744, 2019.	en_US
dc.identifier.uri	http://hdl.handle.net/1828/13085
dc.language	English	eng
dc.language.iso	en	en_US
dc.rights	Available to the World Wide Web	en_US
dc.subject	Machine Learning	en_US
dc.subject	Deep Learning	en_US
dc.subject	Supervised	en_US
dc.subject	Unsupervised	en_US
dc.subject	Anomaly Detection	en_US
dc.subject	Log Messages	en_US
dc.subject	Hybrid	en_US
dc.title	Log message anomaly detection using machine learning	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Farzad_Amir_PhD_2021.pdf
Size:: 4.75 MB
Format:: Adobe Portable Document Format
Description:: Dissertation PDF

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Electronic Theses and Dissertations (ETD)