Predicting the programming language of questions and snippets of stack overflow using natural language processing

dc.contributor.authorAlrashedy, Kamel
dc.contributor.supervisorSrinivasan, Venkatesh
dc.contributor.supervisorGulliver, T. Aaron
dc.date.accessioned2018-09-11T18:10:19Z
dc.date.available2018-09-11T18:10:19Z
dc.date.copyright2018en_US
dc.date.issued2018-09-11
dc.degree.departmentDepartment of Computer Scienceen_US
dc.degree.levelMaster of Science M.Sc.en_US
dc.description.abstractStack Overflow is the most popular Q&A website among software developers. As a platform for knowledge sharing and acquisition, the questions posted in Stack Over- flow usually contain a code snippet. Stack Overflow relies on users to properly tag the programming language of a question and assumes that the programming language of the snippets inside a question is the same as the tag of the question itself. In this the- sis, a classifier is proposed to predict the programming language of questions posted in Stack Overflow using Natural Language Processing (NLP) and Machine Learning (ML). The classifier achieves an accuracy of 91.1% in predicting the 24 most popular programming languages by combining features from the title, body and code snippets of the question. We also propose a classifier that only uses the title and body of the question and has an accuracy of 81.1%. Finally, we propose a classifier of code snip- pets only that achieves an accuracy of 77.7%.Thus, deploying ML techniques on the combination of text and code snippets of a question provides the best performance. These results demonstrate that it is possible to identify the programming language of a snippet of only a few lines of source code. We visualize the feature space of two programming languages Java and SQL in order to identify some properties of the information inside the questions corresponding to these languages.en_US
dc.description.scholarlevelGraduateen_US
dc.identifier.urihttp://hdl.handle.net/1828/10054
dc.languageEnglisheng
dc.language.isoenen_US
dc.rightsAvailable to the World Wide Weben_US
dc.subjectStack overflowen_US
dc.subjectknowledge sharingen_US
dc.subjectNatural Language Processingen_US
dc.subjectMachine Learningen_US
dc.titlePredicting the programming language of questions and snippets of stack overflow using natural language processingen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Kamel_Alrashedy_MSC_2018.pdf
Size:
6.17 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: