Log message anomaly detection using machine learning

dc.contributor.authorFarzad, Amir
dc.contributor.supervisorGulliver, T. Aaron
dc.date.accessioned2021-07-05T17:06:21Z
dc.date.available2021-07-05T17:06:21Z
dc.date.copyright2021en_US
dc.date.issued2021-07-05
dc.degree.departmentDepartment of Electrical and Computer Engineeringen_US
dc.degree.levelDoctor of Philosophy Ph.D.en_US
dc.description.abstractLog messages are one of the most valuable sources of information in the cloud and other software systems. These logs can be used for audits and ensuring system security. Many millions of log messages are produced each day which makes anomaly detection challenging. Automating the detection of anomalies can save time and money as well as improve detection performance. In this dissertation, Deep Learning (DL) methods called Auto-LSTM, Auto-BLSTM and Auto-GRU are developed for log message anomaly detection. They are evaluated using four data sets, namely BGL, Openstack, Thunderbird and IMDB. The first three are popular log data sets while the fourth is a movie review data set which is used for sentiment classification. The results obtained show that Auto-LSTM, Auto-BLSTM and Auto-GRU perform better than other well-known algorithms. Dealing with imbalanced data is one of the main challenges in Machine Learning (ML)/DL algorithms for classification. This issue is more important with log message data as it is typically very imbalanced and negative logs are rare. Hence, a model is proposed to generate text log messages using a Sequence Generative Adversarial Network (SeqGAN) network. Then features are extracted using an Autoencoder and anomaly detection is done using a GRU network. The proposed model is evaluated with two imbalanced log data sets, namely BGL and Openstack. Results are presented which show that oversampling and balancing data increases the accuracy of anomaly detection and classification. Another challenge in anomaly detection is dealing with unlabeled data. Labeling even a small portion of logs for model training may not be possible due to the high volume of generated logs. To deal with this unlabeled data, an unsupervised model for log message anomaly detection is proposed which employs Isolation Forest and two deep Autoencoder networks. The Autoencoder networks are used for training and feature extraction, and then for anomaly detection, while Isolation Forest is used for positive sample prediction. The proposed model is evaluated using the BGL, Openstack and Thunderbird log message data sets. The results obtained show that the number of negative samples predicted to be positive is low, especially with Isolation Forest and one Autoencoder. Further, the results are better than with other well-known models. A hybrid log message anomaly detection technique is proposed which uses pruning of positive and negative logs. Reliable positive log messages are first identified using a Gaussian Mixture Model (GMM) algorithm. Then reliable negative logs are selected using the K-means, GMM and Dirichlet Process Gaussian Mixture Model (BGM) methods iteratively. It is shown that the precision for positive and negative logs with pruning is high. Anomaly detection is done using a Long Short-Term Memory (LSTM) network. The proposed model is evaluated using the BGL, Openstack, and Thunderbird data sets. The results obtained indicate that the proposed model performs better than several well-known algorithms. Last, an anomaly detection method is proposed using radius-based Fuzzy C-means (FCM) with more clusters than the number of data classes and a Multilayer Perceptron (MLP) network. The cluster centers and a radius are used to select reliable positive and negative log messages. Moreover, class probabilities are used with an expert to correct the network output for suspect logs. The proposed model is evaluated with three well-known data sets, namely BGL, Openstack and Thunderbird. The results obtained show that this model provides better results than existing methods.en_US
dc.description.scholarlevelGraduateen_US
dc.identifier.bibliographicCitationAmir Farzad and T. Aaron Gulliver. Unsupervised Log Message Anomaly Detection‏. ICT Express, volume 6 (3), pp. 229-237, 2020.en_US
dc.identifier.bibliographicCitationAmir Farzad and T. Aaron Gulliver. Log Message Anomaly Detection with Oversampling. International Journal of Artificial Intelligence and Applications, 11 (4), pp. 53-65, 2020.en_US
dc.identifier.bibliographicCitationAmir Farzad and T. Aaron Gulliver. Oversampling Log Messages Using a Sequence Generative Adversarial Network for Anomaly Detection and Classification. In International Conference on Artificial Intelligence and Machine Learning, volume 10 (5), pp. 163-175, 2020.en_US
dc.identifier.bibliographicCitationAmir Farzad and T. Aaron Gulliver. Log Message Anomaly Detection and Classification Using Auto-B/LSTM and Auto-GRU. arXiv e-prints, page arXiv:1911.08744, 2019.en_US
dc.identifier.urihttp://hdl.handle.net/1828/13085
dc.languageEnglisheng
dc.language.isoenen_US
dc.rightsAvailable to the World Wide Weben_US
dc.subjectMachine Learningen_US
dc.subjectDeep Learningen_US
dc.subjectSuperviseden_US
dc.subjectUnsuperviseden_US
dc.subjectAnomaly Detectionen_US
dc.subjectLog Messagesen_US
dc.subjectHybriden_US
dc.titleLog message anomaly detection using machine learningen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Farzad_Amir_PhD_2021.pdf
Size:
4.75 MB
Format:
Adobe Portable Document Format
Description:
Dissertation PDF
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2 KB
Format:
Item-specific license agreed upon to submission
Description: