Abstract:
The detection of malicious Uniform Resource Locators (URLs) is important for network and cyber security. The Internet has long been a platform for online criminal activity. In this project, supervised Machine Learning (ML) is employed to identify and detect malicious URLs. The ISCX-URL-2016 dataset from the Canadian Institute for Cyber Security is employed for evaluation purposes. This dataset contains 79 features with four classes of URLs, namely spam, malware, phishing, and benign.
The Waikato Environment for Knowledge Analysis (WEKA) tool is used to test and train the ML classifiers. To compare the results, k-fold cross-validation is used with k = 5 and k = 10. Principal Component Analysis (PCA) is employed for dimensionality reduction of the dataset and the important features selected based on the eigenvalues. The best 10 and 25 features were selected using PCA and the classifiers were trained using 5-fold and 10-fold cross-validation. The classifiers were also trained using all 79 features. The ML classifiers evaluated are Random Forest (RF), Decision Tree, K-Nearest Neighbors (KNN), Bayesian Network (BayesNet), and Simple Logistic. The performance metrics accuracy, precision, recall, f-measure, and execution time are considered. The RF classifier resulted in the highest accuracy at 98.7% with 79 features. However, in terms of execution time, KNN outperformed RF with 0.06 s for 79 features and 98.3% accuracy, which is only second to RF. In general, the results obtained show that KNN provides the best overall performance.