Miroslav Tushev

Linguistic Documentation of Software History M. Tushev and A. Mahmoud, Inter. Conf. on Program Comprehension (ICPC), 2020




Open Source Software (OSS) projects start with an initial vocabulary, often determined by the first generation of developers. This vocabulary, embedded in code identifier names and internal code comments, goes through multiple rounds of change, influenced by the interrelated patterns of human (e.g., developers joining and departing) and system (e.g., maintenance activities) interactions. Capturing the dynamics of this change is crucial for understanding and synthesizing code changes over time. However, existing code evolution analysis tools, available in modern version control systems such as GitHub and SourceForge, often overlook the linguistic aspects of code evolution.

To bridge this gap, in this paper, we propose to study code evolution in OSS projects through the lens of developers’ language, also known as code lexicon. Our analysis is conducted using 32 OSS projects sampled from a broad range of application domains.

Our results show that different maintenance activities impact code lexicon differently. These insights lay out a preliminary foundation for modeling the linguistic history of OSS projects. In the long run, this foundation will be utilized to provide support for basic program comprehension tasks and help researchers gain new insights into the complex interplay between linguistic change and various system and human aspects of OSS development.