Not logged in.

Quick Search - Contribution

Contribution Details

Type	Conference or Workshop Paper
Scope	Discipline-based scholarship
Published in Proceedings	Yes
Title	Fair and balanced? Bias in bug-fix datasets
Organization Unit	Dynamic and Distributed Information Systems (Abraham Bernstein)
Authors	C Bird A Bachmann E Aune J Duffy Abraham Bernstein V Filkov P Devanbu
Presentation Type	paper
Item Subtype	Original Work
Refereed	Yes
Status	Published in final form
Language	English
Page Range	121 - 130
Event Title	ESEC/FSE '09: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering on European software engineering conference and foundations of software engineering
Event Type	conference
Event Location	Amsterdam, The Netherlands
Event Start Date	August 1 - 2009
Event End Date	August 1 - 2009
Abstract Text	Software engineering researchers have long been interested in where and why bugs occur in code, and in predicting where they might turn up next. Historical bug-occurence data has been key to this research. Bug tracking systems, and code version histories, record when, how and by whom bugs were fixed; from these sources, datasets that relate file changes to bug fixes can be extracted. These historical datasets can be used to test hypotheses concerning processes of bug introduction, and also to build statistical bug prediction models. Unfortunately, processes and humans are imperfect, and only a fraction of bug fixes are actually labelled in source code version histories, and thus become available for study in the extracted datasets. The question naturally arises, are the bug fixes recorded in these historical datasets a fair representation of the full population of bug fixes? In this paper, we investigate historical data from several software projects, and find strong evidence of systematic bias. We then investigate the potential effects of "unfair, imbalanced" datasets on the performance of prediction techniques. We draw the lesson that bias is a critical problem that threatens both the effectiveness of processes that rely on biased datasets to build prediction models and the generalizability of hypotheses tested on biased data.
Digital Object Identifier	10.1145/1595696.1595716
Other Identification Number	merlin-id:171
PDF File	Download from ZORA
Export	BibTeX EP3 XML (ZORA)