GitHub Issue Label Clustering by Weighted Overlap Coefficient




Li, Yunlong

Journal Title

Journal ISSN

Volume Title



GitHub labels are designed for helping people to classify and recognize different issues. When naming a label, people may use different word formats (e.g., bug, Bug, bugs, etc.) to express the same meaning. Therefore, managing the issue labels in GitHub becomes a challenging task. Clustering the morphological synonym labels will make it easier for management of the issues and complete some data preprocessing work for the automatic labeling research. String similarity calculation is the key part of the clustering algorithm. In this project, a weighted overlap coefficient method is proposed as a string similarity measure for clustering the labels. The most frequently used 200 labels are selected as the experiment data for analysis. The preliminary working results show that the new method does improve the original overlap coefficient by producing a 4.43% higher F-Measure and 92.42% of all the experiment labels have been correctly clustered.



GitHub issue label, string similarity metric, overlap coefficient, clustering