Exploring Design Discussions With Semi-Supervised Topic Modelling

Date

2022-08-11

Authors

Lasrado, Roshan N.

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Stack Overflow is a rich source of questions and answers—discussions—about software development. One topic of discussion is software design, such as the correct use of design patterns or best practices in data access. Since design is a more abstract topic in software engineering, researchers have long sought to characterize and model design knowledge. However, these approaches typically require significant expert input to contextualize the abstract design information. In this study, we explore how combining expert input with Stack Overflow might serve as an effective way to identify design topics. Being able to identify and classify this design knowledge would enable the discovery and sharing of this knowledge, enabling developers better leverage Stack Overflow for crowd-sourcing their design decisions. We first perform inductive coding of design-tagged Stack Overflow questions and answers to identify the design concepts that developers discuss. We report on areas where inter-rater agreement was a challenge, including abstraction levels. Since inductive coding is expensive, we apply a semi-supervised (Anchored CorEx) approach. We find that it outperforms LDA and offers superior interpretability and the ability to incorporate expert domain knowledge. We leverage Anchored CorEx to identify how design is discussed on Stack Overflow and leveraged in GitHub projects. We conclude by describing how our experience using the semi-supervised CorEx approach leads us to believe that approaches like Anchored CorEx that combine domain knowledge and scalability are key for analyzing large SE text repositories.

Description

Keywords

design discussions, semi-supervised topic modelling, design mining

Citation