Marco di Biase, Magiel Bruntink, Alberto Bacchelli, A Security Perspective on Code Review: The Case of Chromium, In: 2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation, IEEE, USA, 2016-11-02. (Conference or Workshop Paper published in Proceedings)
Modern Code Review (MCR) is an established software development process that aims to improve software quality. Although evidence showed that higher levels of review coverage relates to less post-release bugs, it remains unknown the effectiveness of MCR at specifically finding security issues. We present a work we conduct aiming to fill that gap by exploring the MCR process in the Chromium open source project. We manually analyzed large sets of registered (114 cases) and missed (71 cases) security issues by backtracking in the project's issue, review, and code histories. This enabled us to qualify MCR in Chromium from the security perspective from several angles: Are security issues being discussed frequently? What categories of security issues are often missed or found? What characteristics of code reviews appear relevant to the discovery rate? Within the cases we analyzed, MCR in Chromium addresses security issues at a rate of 1% of reviewers' comments. Chromium code reviews mostly tend to miss language-specific issues (e.g., C++ issues and buffer overflows) and domain-specific ones (such as Cross-Site Scripting), when code reviews address issues, mostly they address those that pertain to the latter type. Initial evidence points to reviews conducted by more than 2 reviewers being more successful at finding security issues. |
|
Anand Ashok Sawant, Romain Robbes, Alberto Bacchelli, On the Reaction to Deprecation of 25,357 Clients of 4+1 Popular Java APIs, In: ICSME 2016: IEEE International Conference on Software Maintenance and Evolution, IEEE, USA, 2016-11-02. (Conference or Workshop Paper published in Proceedings)
Application Programming Interfaces (APIs) are a tremendous resource-that is, when they are stable. Several studies have shown that this is unfortunately not the case. Of those, a large-scale study of API changes in the Pharo Smalltalk ecosystem documented several findings about API deprecations and their impact on API clients. We conduct a partial replication of this study, considering more than 25,000 clients of five popular Java APIs on GitHub. This work addresses several shortcomings of the previous study, namely: a study of several distinct API clients in a popular, statically-typed language, with more accurate version information. We compare and contrast our findings with the previous study and highlight new ones, particularly on the API client update practices and the startling similarities between reaction behavior in Smalltalk and Java. |
|
Joop Aue, Michiel Haisma, Kristín Fjola Tomasdottir, Alberto Bacchelli, Social Diversity and Growth Levels of Open Source Software Projects on GitHub, In: the 10th ACM/IEEE International Symposium, ACM Press, New York, New York, USA, 2016-10-08. (Conference or Workshop Paper published in Proceedings)
Background: Projects of all sizes and impact are leveraging the services of the social coding platform GitHub to collaborate. Since users' information and actions are recorded, GitHub has been mined for over 6 years now to investigate aspects of the collaborative open source software (OSS) development paradigm. Aim: In this research, we use this data to investigate the relation between project growth as a proxy for success, and social diversity. Method: We first categorize active OSS projects into a five-star rating using a benchmarking system we based on various project growth metrics; then we study the relation between this rating and the reported social diversities for the team members of those projects. Results: Our findings highlight a statistically significant relation; however, the effect is small. Conclusions: Our findings suggest the need for further research on this topic; moreover, the proposed benchmarking method may be used in future work to determine OSS project success on collaboration platforms such as GitHub. |
|
Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, Premkumar Devanbu, On the "naturalness" of buggy code, In: the 38th International Conference, ACM Press, New York, New York, USA, 2016-06-14. (Conference or Workshop Paper published in Proceedings)
Real software, the kind working programmers produce by the kLOC to solve real-world problems, tends to be "natural", like speech or natural language; it tends to be highly repetitive and predictable. Researchers have captured this naturalness of software through statistical models and used them to good effect in suggestion engines, porting tools, coding standards checkers, and idiom miners. This suggests that code that appears improbable, or surprising, to a good statistical language model is "unnatural" in some sense, and thus possibly suspicious. In this paper, we investigate this hypothesis. We consider a large corpus of bug fix commits (ca. 7,139), from 10 different Java projects, and focus on its language statistics, evaluating the naturalness of buggy code and the corresponding fixes. We find that code with bugs tends to be more entropic (i.e. unnatural), becoming less so as bugs are fixed. Ordering files for inspection by their average entropy yields cost-effectiveness scores comparable to popular defect prediction methods. At a finer granularity, focusing on highly entropic lines is similar in cost-effectiveness to some well-known static bug finders (PMD, FindBugs) and ordering warnings from these bug finders using an entropy measure improves the cost-effectiveness of inspecting code implicated in warnings. This suggests that entropy may be a valid, simple way to complement the effectiveness of PMD or FindBugs, and that search-based bug-fixing methods may benefit from using entropy both for fault-localization and searching for fixes. |
|
Georgios Gousios, Margaret-Anne Storey, Alberto Bacchelli, Work practices and challenges in pull-based development the contributor's perspective, In: 38th International Conference on Software Engineering (ICSE), Institute of Electrical and Electronics Engineers, New York, New York, USA, 2016-06-14. (Conference or Workshop Paper published in Proceedings)
The pull-based development model is an emerging way of contributing to distributed software projects that is gaining enormous popularity within the open source software (OSS) world. Previous work has examined this model by focusing on projects and their owners---we complement it by examining the work practices of project contributors and the challenges they face.
We conducted a survey with 645 top contributors to active OSS projects using the pull-based model on GitHub, the prevalent social coding site. We also analyzed traces extracted from corresponding GitHub repositories. Our research shows that: contributors have a strong interest in maintaining awareness of project status to get inspiration and avoid duplicating work, but they do not actively propagate information; communication within pull requests is reportedly limited to low-level concerns and contributors often use communication channels external to pull requests; challenges are mostly social in nature, with most reporting poor responsiveness from integrators; and the increased transparency of this setting is a confirmed motivation to contribute. Based on these findings, we present recommendations for practitioners to streamline the contribution process and discuss potential future research directions. |
|
A. Bacchelli, Structure your unstructured data first!, In: Perspectives on Data Science for Software Engineering, Elsevier, USA, p. 161 - 168, 2016. (Book Chapter)
Unstructured software data, such as emails and discussions in technical forum, are a rich form of information about software systems. Nevertheless, mining this form of data is hard as it comprises different languages that cannot be processed with the same techniques.
In this chapter, we show how we can summarize unstructured software data by first giving it the structure it needs. |
|
Alberto Bacchelli, Christian Bird, Expectations, outcomes, and challenges of modern code review, In: 35th IEEE/ACM International Conference on Software Engineering, IEEE, USA, 2013-06-18. (Conference or Workshop Paper published in Proceedings)
Code review is a common software engineering practice employed both in open source and industrial contexts. Review today is less formal and more “lightweight” than the code inspections performed and studied in the 70s and 80s. We empirically explore the motivations, challenges, and outcomes of tool-based code reviews. We observed, interviewed, and surveyed developers and managers and manually classified hundreds of review comments across diverse teams at Microsoft. Our study reveals that while finding defects remains the main motivation for review, reviews are less about defects than expected and instead provide additional benefits such as knowledge transfer, increased team awareness, and creation of alternative solutions to problems. Moreover, we find that code and change understanding is the key aspect of code reviewing and that developers employ a wide range of mechanisms to meet their understanding needs, most of which are not met by current tools. We provide recommendations for practitioners and researchers. |
|
Alberto Bacchelli, Marco D'Ambros, Michele Lanza, Romain Robbes, Benchmarking lightweight techniques to link e-mails and source code, In: 16th Working Conference on Reverse Engineering (WCRE), Los Alamitos, CA, USA, 2009. (Conference or Workshop Paper published in Proceedings)
During the evolution of a software system, a large amount of information, which is not always directly related to the source code, is produced. Several researchers have provided evidence that the contents of mailing lists represent a valuable source of information: Through e-mails, developers discuss design decisions, ideas, known problems and bugs, etc. which are otherwise not to be found in the system.A technical challenge in this context is how to establish the missing link between free-form e-mails and the system artifacts they refer to. Although the range of approaches is vast, establishing their accuracy remains a problem, as there is no benchmark against which to compare their performance.To overcome this issue, we manually inspected a statistically significant number of e-mails pertaining to the ArgoUML system. Based on this benchmark, we present a variety of lightweight techniques to assign e-mails to software artifacts and measure their effectiveness in terms of precision and recall. |
|
Alberto Bacchelli, Paolo Ciancarini, Davide Rossi, On the effectiveness of manual and automatic unit test generation, In: The Third International Conference on Software Engineering Advances (ICSEA), IEEE, Los Alamitos, CA, US, 2008. (Conference or Workshop Paper published in Proceedings)
The importance of testing has recently seen a significant growth, thanks to its benefits to software design (e.g. think of test-driven development), implementation and maintenance support. As a consequence of this, nowadays it is quite common to introduce a test suite into an existing system, which was not designed for it. The software engineer must then decide whether using tools which automatically generate unit tests (test suites necessary foundations) and how. This paper tries to deal with the issue of choosing the best approach. We will describe how different generation techniques (both manual and automatic) have been applied to a real case study. We will compare achieved results using several metrics in order to identify different approaches benefits and shortcomings. We will conclude showing the measure how the adoption of tools for automatic test creation can shift the trade-off between time and quality. |
|