Determine the Origin of Python Software Source Code using the Distinctiveness of Identifiers




Sun, Yiming

Journal Title

Journal ISSN

Volume Title



More than 75% of organizations rely on open source as the foundation of their applications. The prevalence of code reuse in modern software development processes has led to potential drawbacks for developers and companies on both security and legal aspects, particularly for the inability to identify the origin of a copied software artifact. Conventional methods, such as code clone detection approaches, might not address this problem effectively, mainly because they are too exhaustive to be practical. Hence, I propose a lightweight, scalable, and robust method to narrow down the origin of a Python source code entity to a few possible candidates within a reference corpus of the Python Packaging Index (PyPI) open-source ecosystem, using only a few extracted classes and function name identifiers. Then, more exhaustive methods become applicable to this small set of candidates to identify the exact origin. Analyzing the PyPI ecosystem, I found that identifiers are very distinct. Among 11.2 M different identifiers found within PyPI's 244 K packages, 76% are only defined in one package, and 93% are in at most 3. Generally, randomly selecting 3 non-frequent identifiers from one or multiple files from an input package is enough to narrow down the origin to at most 3 packages 89% of the time. I evaluated the proposed method by mapping Debian Python packages to corresponding PyPI packages, where only 3 identifiers are extracted from each Debian package and matched against PyPI packages. I used a popularity index to rank the returned candidate packages so that the top one is the most likely to be the origin. By empirical experiments, this method is effective at finding the correct origin of a package that is not directly collected from the corpus, with a recall of 87%.



software provenance, source code tracking, identifiers, open source software, python, software origin