User-Centered Spam Detection Using Linear and Non-Linear Machine Learning Models

Date

2019-04-24

Authors

Singh, Manpreet

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The Enron dataset is one of the very few datasets in the world of spam ham detection that has helped the data science community understand the relationship of ham and spam mails for specific users and build powerful models around it. The Enron dataset being textual in nature poses unique challenges in the manner in which information is extracted from the text and supplied to the models. The purpose of the MEng project is to replicate the results obtained by Metsis et al. [1] on spam detection using different strains of Naïve Bayes (NB) classification models and identify areas for improvement. While Metsis et al. focused solely on linear models, we have explored the performance of non-linear models as well. We have compared the existing NB models with the nonlinear models and simulated the mails that a typical mailbox receives in real time with incremental training. We have also created new data sets from the raw data of the Enron mails, and used these data sets to test the different models. They show interesting results that prove that the proposed approach works for personalized mails more accurately than being generalist in nature.

Description

Keywords

machine learning, spam, spam filter, xgboost, user centered

Citation