Breast Cancer Prediction Using Machine Learning Algorithms




Shahzad, Zeeshan Ali

Journal Title

Journal ISSN

Volume Title



Breast cancer has become a pressing global health issue with its prevalence increasing worldwide. The rise in breast cancer cases is a cause for concern as it not only affects the physical and emotional well-being of individuals but also places a significant burden on the healthcare system. Early detection and timely intervention are critical factors in effectively combatting this disease. The ability to predict and diagnose breast cancer at its earliest stages can have a profound difference in patient outcomes, potentially saving countless lives. In recent years, the importance of Machine Learning (ML) in the field of healthcare has become paramount. This study considers the utility of supervised ML models to address the challenges posed by breast cancer using the publicly available Breast Cancer Wisconsin (Diagnostic) dataset from the University of California Irvine (UCI) ML repository. The Logistic Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), Naive Bayes and K-Nearest Neighbors (KNN) classifiers are implemented using Jupyter Notebook with Python programming. The goal of the proposed methodology is accurate breast cancer prediction. First, data preprocessing is employed to clean the dataset by removing null values and duplicates, and handling missing data. In order to balance the target labels of the dataset, Synthetic Minority Oversampling Technique (SMOTE) is employed. Then, Principal Component Analysis (PCA) is used to reduce the dimensions of the dataset. The number of components is varied (n=2, 5, 10, 15). For training and testing the ML models, five data splits, namely 80/20, 70/30, 50/50, 30/70, and 20/80 are employed to assess the impact on model performance. The performance of the models is evaluated using the metrics accuracy, precision, recall, F1-score, and execution time. The results obtained show that SVM and Logistic Regression outperform the other models with SVM having an accuracy of 98.2% and an execution time of 9.99 ms with an 80/20 split using 10 features and Logistic Regression having an accuracy of 97.9% and an execution time of 8.42 ms with a 50/50 split using 15 features.