Determine the Origin of Python Software Source Code using the Distinctiveness of Identifiers

dc.contributor.authorSun, Yiming
dc.contributor.supervisorGerman, Daniel
dc.date.accessioned2022-12-19T18:25:11Z
dc.date.available2022-12-19T18:25:11Z
dc.date.copyright2022en_US
dc.date.issued2022-12-19
dc.degree.departmentDepartment of Computer Scienceen_US
dc.degree.levelMaster of Science M.Sc.en_US
dc.description.abstractMore than 75% of organizations rely on open source as the foundation of their applications. The prevalence of code reuse in modern software development processes has led to potential drawbacks for developers and companies on both security and legal aspects, particularly for the inability to identify the origin of a copied software artifact. Conventional methods, such as code clone detection approaches, might not address this problem effectively, mainly because they are too exhaustive to be practical. Hence, I propose a lightweight, scalable, and robust method to narrow down the origin of a Python source code entity to a few possible candidates within a reference corpus of the Python Packaging Index (PyPI) open-source ecosystem, using only a few extracted classes and function name identifiers. Then, more exhaustive methods become applicable to this small set of candidates to identify the exact origin. Analyzing the PyPI ecosystem, I found that identifiers are very distinct. Among 11.2 M different identifiers found within PyPI's 244 K packages, 76% are only defined in one package, and 93% are in at most 3. Generally, randomly selecting 3 non-frequent identifiers from one or multiple files from an input package is enough to narrow down the origin to at most 3 packages 89% of the time. I evaluated the proposed method by mapping Debian Python packages to corresponding PyPI packages, where only 3 identifiers are extracted from each Debian package and matched against PyPI packages. I used a popularity index to rank the returned candidate packages so that the top one is the most likely to be the origin. By empirical experiments, this method is effective at finding the correct origin of a package that is not directly collected from the corpus, with a recall of 87%.en_US
dc.description.scholarlevelGraduateen_US
dc.identifier.urihttp://hdl.handle.net/1828/14567
dc.languageEnglisheng
dc.language.isoenen_US
dc.rightsAvailable to the World Wide Weben_US
dc.subjectsoftware provenanceen_US
dc.subjectsource code trackingen_US
dc.subjectidentifiersen_US
dc.subjectopen source softwareen_US
dc.subjectpythonen_US
dc.subjectsoftware originen_US
dc.titleDetermine the Origin of Python Software Source Code using the Distinctiveness of Identifiersen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Sun_Yiming_MSc_2022.pdf
Size:
1.94 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2 KB
Format:
Item-specific license agreed upon to submission
Description: