Reducing Training Time in Text Visual Question Answering

dc.contributor.authorBehboud, Ghazale
dc.contributor.supervisorGulliver, T. Aaron
dc.date.accessioned2022-07-15T19:35:24Z
dc.date.available2022-07-15T19:35:24Z
dc.date.copyright2022en_US
dc.date.issued2022-07-15
dc.degree.departmentDepartment of Electrical and Computer Engineering
dc.degree.levelMaster of Applied Science M.A.Sc.en_US
dc.description.abstractArtificial Intelligence (AI) and Computer Vision (CV) have brought the promise of many applications along with many challenges to solve. The majority of current AI research has been dedicated to single-modal data processing meaning they use only one modality such as visual recognition or text recognition. However, real-world challenges are often a combination of different modalities of data such as text, audio and images. This thesis focuses on solving the Visual Question Answering (VQA) problem which is a significant multi-modal challenge. VQA is defined as a computer vision system that when given a question about an image will answer based on an understanding of both the question and image. The goal is improving the training time of VQA models. In this thesis, Look, Read, Reason and Answer (LoRRA), which is a state-of-the-art architecture, is used as the base model. Then, Reduce Uni-modal Biases (RUBi) is applied to this model to reduce the importance of uni- modal biases in training. Finally, an early stopping strategy is employed to stop the training process once the model accuracy has converged to prevent the model from overfitting. Numerical results are presented which show that training LoRRA with RUBi and early stopping can converge in less than 5 hours. The impact of batch size, learning rate and warm up hyper parameters is also investigated and experimental results are presented.en_US
dc.description.scholarlevelGraduateen_US
dc.identifier.urihttp://hdl.handle.net/1828/14062
dc.languageEnglisheng
dc.language.isoenen_US
dc.rightsAvailable to the World Wide Weben_US
dc.subjectAIen_US
dc.subjectMLen_US
dc.subjectDeep Learningen_US
dc.subjectMachine Learningen_US
dc.subjectVisual Question Answeringen_US
dc.subjectConvolutional Neural Networken_US
dc.subjectRecurrent Neural Networken_US
dc.subjectLong Short Term Memoryen_US
dc.subjectEarly Stoppingen_US
dc.titleReducing Training Time in Text Visual Question Answeringen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Behboud_Ghazale_MASc_2022.pdf
Size:
2.47 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2 KB
Format:
Item-specific license agreed upon to submission
Description: