Sign language recognition using SVM, CNN, RF, and Xception models
Date
2025
Authors
Adil, Mohammad Abbas
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Sign language is an essential means of communication for individuals with hearing and speech impairments. It enables them to express thoughts and emotions through hand gestures. With advances in computer vision and machine learning, recognizing these gestures through automated systems has become an active research area. The goal of this study is to develop an efficient system for static hand gesture recognition using supervised machine learning models and compare their performance.
This study uses two distinct datasets of hand gesture images that are openly accessible on Kaggle. The first dataset, called gestures (hand), has 16,000 preprocessed grayscale images in eight different gesture classes: fist, five, okay, peace, rad, straight, thumbs, and none. The second dataset, hand gesture recognition, contains an additional 4,000 preprocessed grayscale images for the same gesture classes. These datasets collectively provide 20,000 images for this study. The first dataset is used for training and validation, and the additional dataset is used for testing. Image augmentation techniques are applied to improve the diversity of training samples and enhance generalization.
Four models are implemented: Support Vector Machine (SVM), Convolutional Neural Network (CNN), Random Forest (RF), and Xception. The SVM model is trained using an RBF kernel with different regularization values (C = 2, 4, 6, 8, 10, 12, 14). The CNN and Xception models are evaluated with early stopping patience values ranging from 1 to 7. All models are implemented and tested using Python in the Kaggle notebook environment.
The performance of each model is evaluated using accuracy, precision, recall, F1-score, and training time. The results show that the CNN model achieves the best overall performance with an overall accuracy of 99.20% and a training time of 5.89 min for a patience value of 5. The Xception model has 99.08% overall accuracy with the same patience value, but with a higher training time of 11.02 min. The SVM classifier achieves a maximum overall accuracy of 90.83% at C = 10 with a training time of 20.35 min. The RF model achieves a maximum overall accuracy of 83.68% for n_estimators = 200 with a training time of 0.46 min. These results highlight the effectiveness of deep learning approaches, especially CNN, in real-time gesture recognition.