GitHub Issue Label Clustering by Weighted Overlap Coefficient

Date

2017-05-01

Authors

Li, Yunlong

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

GitHub labels are designed for helping people to classify and recognize different issues. When naming a label, people may use different word formats (e.g., bug, Bug, bugs, etc.) to express the same meaning. Therefore, managing the issue labels in GitHub becomes a challenging task. Clustering the morphological synonym labels will make it easier for management of the issues and complete some data preprocessing work for the automatic labeling research. String similarity calculation is the key part of the clustering algorithm. In this project, a weighted overlap coefficient method is proposed as a string similarity measure for clustering the labels. The most frequently used 200 labels are selected as the experiment data for analysis. The preliminary working results show that the new method does improve the original overlap coefficient by producing a 4.43% higher F-Measure and 92.42% of all the experiment labels have been correctly clustered.

Description

Keywords

GitHub issue label, string similarity metric, overlap coefficient, clustering

Citation