Not logged in.

Contribution Details

Type Conference or Workshop Paper
Scope Discipline-based scholarship
Published in Proceedings Yes
Title Mining file histories: should we consider branches?
Organization Unit
Authors
  • Vladimir Kovalenko
  • Fabio Palomba
  • Alberto Bacchelli
Presentation Type paper
Item Subtype Original Work
Refereed Yes
Status Published in final form
Language
  • English
ISBN 9781450359375
Page Range 202 - 213
Event Title ASE '18: 33rd ACM/IEEE International Conference on Automated Software Engineering
Event Type conference
Event Location Montpellier France
Event Start Date October 3 - 2018
Event End Date October 7 - 2018
Place of Publication New York, NY, USA
Publisher ACM
Abstract Text Modern distributed version control systems, such as Git, offer support for branching - the possibility to develop parts of software outside the master trunk. Consideration of the repository structure in Mining Software Repository (MSR) studies requires a thorough approach to mining, but there is no well-documented, widespread methodology regarding the handling of merge commits and branches. Moreover, there is still a lack of knowledge of the extent to which considering branches during MSR studies impacts the results of the studies. In this study, we set out to evaluate the importance of proper handling of branches when calculating file modification histories. We analyze over 1,400 Git repositories of four open source ecosystems and compute modification histories for over two million files, using two different algorithms. One algorithm only follows the first parent of each commit when traversing the repository, the other returns the full modification history of a file across all branches. We show that the two algorithms consistently deliver different results, but the scale of the difference varies across projects and ecosystems. Further, we evaluate the importance of accurate mining of file histories by comparing the performance of common techniques that rely on file modification history - reviewer recommendation, change recommendation, and defect prediction - for two algorithms of file history retrieval. We find that considering full file histories leads to an increase in the techniques' performance that is rather modest.
Digital Object Identifier 10.1145/3238147.3238169
Other Identification Number merlin-id:20232
PDF File Download from ZORA
Export BibTeX
EP3 XML (ZORA)