Deep Learning Methods Cannot Outperform Other Machine Learning Methods on Analyzing Genome-wide Association Studies

dc.contributor.authorZhou, Shaoze
dc.contributor.supervisorZhang, Xuekui
dc.contributor.supervisorTsao, Min
dc.date.accessioned2022-08-31T23:56:49Z
dc.date.available2022-08-31T23:56:49Z
dc.date.copyright2022en_US
dc.date.issued2022-08-31
dc.degree.departmentDepartment of Mathematics and Statistics
dc.degree.levelMaster of Science M.Sc.en_US
dc.description.abstractDeep Learning (DL) has been broadly applied to solve big data problems in biomedical fields, which is most successful in image processing. Recently, many DL methods have been applied to analyze genomic studies. However, genomic data usually has too small a sample size to fit a complex network. They do not have common structural patterns like images to utilize pre-trained networks or take advantage of convolution layers. The concern of overusing DL methods motivates us to evaluate DL methods' performance versus popular non-deep Machine Learning (ML) methods for analyzing genomic data with a wide range of sample sizes. In this paper, we conduct a benchmark study using the UK Biobank data and its many random subsets with different sample sizes. The original UK Biobank data has about 500k participants. Each patient has comprehensive patient characteristics, disease histories, and genomic information, i.e., the genotypes of millions of Single-Nucleotide Polymorphism (SNPs). We are interested in predicting the risk of three lung diseases: asthma, COPD, and lung cancer. There are 205,238 participants have recorded disease outcomes for these three diseases. Five prediction models are investigated in this benchmark study, including three non-deep machine learning methods (Elastic Net, XGBoost, and SVM) and two deep learning methods (DNN and LSTM). Besides the most popular performance metrics, such as the F1-score, we promote the hit curve, a visual tool to describe the performance of predicting rare events. We discovered that DL methods frequently fail to outperform non-deep ML in analyzing genomic data, even in large datasets with over 200k samples. The experiment results suggest not overusing DL methods in genomic studies, even with biobank-level sample sizes. The performance differences between DL and non-deep ML decrease as the sample size of data increases. This suggests when the sample size of data is significant, further increasing sample sizes leads to more performance gain in DL methods. Hence, DL methods could be better if we analyze genomic data bigger than this study.en_US
dc.description.scholarlevelGraduateen_US
dc.identifier.urihttp://hdl.handle.net/1828/14167
dc.languageEnglisheng
dc.language.isoenen_US
dc.rightsAvailable to the World Wide Weben_US
dc.subjectdeep learningen_US
dc.subjectmachine learningen_US
dc.subjectgenomic analysisen_US
dc.subjectdisease predictionen_US
dc.subjectimbalance dataen_US
dc.subjecthit curveen_US
dc.titleDeep Learning Methods Cannot Outperform Other Machine Learning Methods on Analyzing Genome-wide Association Studiesen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Zhou_Shaoze_MSc_2022.pdf
Size:
2.82 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2 KB
Format:
Item-specific license agreed upon to submission
Description: