David Ackermann, Curiosity Guided Fuzz Testing, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Master's Thesis)
Fuzz testing is a widely used automated testing technique and has recently enjoyed great success in finding new vulnerabilities. The most successful fuzzing approach, coverage guided fuzzing, randomly mutates a test seed in hopes of uncovering further program coverage. However, random mutation rarely reveals new insights, making coverage guided fuzzing highly inefficient. Additionally, using coverage alone as a proxy for finding bugs is inadequate as it might not correlate highly and lacks information context.
In this work, we combine coverage guided fuzzing with a state of the art reinforcement learning technique. We treat the input space of a program as an exploration problem, in which the fuzzer explores by curiosity. Not only does it add information context, but also a dense reward structure for the fuzzer to consider. We built a prototype, called a CuriousAFL that combines a widely recognized coverage guided fuzzer with exploration. Our evaluation shows that CuriousAFL significantly outperforms several state-of-the-art fuzzers on real-world programs in terms of coverage and finding new bugs.
|
|
Andrea Capobianco, Software Vulnerability Prediction on Commit Level, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Master's Thesis)
Software vulnerabilities are security weaknesses in system design, or implementation, that if exploited may cause large damages to software developers and vendors, and most importantly software users. Although efforts are made in preventing, detecting and removing vulnerabilities before software releases, the number of publicly disclosed software vulnerabilities is increasing. Software vulnerabilities are difficult to detect during development even by experienced developers if they lack security expertise and even with use of supporting tools. In this thesis, we aim at helping developers to avoid introducing vulnerabilities during the development process. We propose a machine learning approach to predict vulnerabilities on commit level, in which, given a commit, we report whether a change in a touched file might induce a vulnerability. We analyse a set of 19 metrics, such as code churn, ownership, code metrics, and code smells, for three open source C projects (Tcpdump, FFmpeg, ImageMagick). The dataset consists of 149 vulnerability-inducing commits and 168 non-vulnerability-inducing commits, including 12 different types of vulnerabilities, but most prominently buffer related vulnerabilities. We first analyse how the metrics relate to vulnerable code changes and then create different models to predict vulnerability-inducing commits. As a result, we found 7 out of 19 metrics to have a consistently strong relationship to vulnerability-inducing commits. Further our best models reached precision, recall and F1-score of greater than 0.8 and therefore indicate that such a prediction of vulnerability-inducing commits may be a useful additional tool for software developers. |
|
Moritz Eck, Fabio Palomba, Marco Castelluccio, Alberto Bacchelli, Understanding flaky tests: the developer’s perspective, In: ESEC/FSE '19: 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ACM, New York, NY, USA, 2019-09-26. (Conference or Workshop Paper published in Proceedings)
Flaky tests are software tests that exhibit a seemingly random outcome (pass or fail) despite exercising unchanged code. In this work, we examine the perceptions of software developers about the nature, relevance, and challenges of flaky tests.
We asked 21 professional developers to classify 200 flaky tests they previously fixed, in terms of the nature and the origin of the flakiness, as well as of the fixing effort. We also examined developers' fixing strategies. Subsequently, we conducted an online survey with 121 developers with a median industrial programming experience of five years. Our research shows that: The flakiness is due to several different causes, four of which have never been reported before, despite being the most costly to fix; flakiness is perceived as significant by the vast majority of developers, regardless of their team's size and project's domain, and it can have effects on resource allocation, scheduling, and the perceived reliability of the test suite; and the challenges developers report to face regard mostly the reproduction of the flaky behavior and the identification of the cause for the flakiness. Public preprint [http://arxiv.org/abs/1907.01466], data and materials [https://doi.org/10.5281/zenodo.3265785]. |
|
Pavlína Wurzelová, Fabio Palomba, Alberto Bacchelli, Characterizing Women (Not) Contributing to Open-Source, In: 2019 IEEE/ACM 2nd International Workshop on Gender Equality in Software Engineering (GE), IEEE, USA, 2019-06-27. (Conference or Workshop Paper published in Proceedings)
Women are under-represented not only in software development, but even more so in the Open-Source Software (OSS) community. In this study we examine whether there are differences between women in OSS community and outside of it. Identifying these differences may help to attract other women to contribute to OSS. Furthermore, it might uncover potential biases in data about female developers that are gathered through the mining of software repositories research.
Using the data from the Stack Overflow Developer Survey 2018, counting 100,000+ respondents (6.9% female), we compare the characteristics of women who report to contribute to OSS and those who report to not contribute. Surprisingly, we did not found the expected differences to be present, thus suggesting that open-source software data represents well the closed-source population of female developers. However, our results did not identify potential correlates of higher under-representation of women in OSS than in closed-source setting. |
|
Vladimir Kovalenko, Egor Bogomolov, Timofey Bryksin, Alberto Bacchelli, PathMiner: A Library for Mining of Path-Based Representations of Code, In: 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), IEEE, USA, 2019-06-25. (Conference or Workshop Paper published in Proceedings)
One recent, significant advance in modeling source code for machine learning algorithms has been the introduction of path-based representation - an approach consisting in representing a snippet of code as a collection of paths from its syntax tree. Such representation efficiently captures the structure of code, which, in turn, carries its semantics and other information. Building the path-based representation involves parsing the code and extracting the paths from its syntax tree; these steps build up to a substantial technical job. With no common reusable toolkit existing for this task, the burden of mining diverts the focus of researchers from the essential work and hinders newcomers in the field of machine learning on code. In this paper, we present PathMiner - an open-source library for mining path-based representations of code. PathMiner is fast, flexible, well-tested, and easily extensible to support input code in any common programming language. Preprint [https://doi.org/10.5281/zenodo.2595271]; released tool [https://doi.org/10.5281/zenodo.2595257]. |
|
Vincent J. Hellendoorn, Sebastian Proksch, Harald C. Gall, Alberto Bacchelli, When Code Completion Fails: A Case Study on Real-World Completions, In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), IEEE, USA, 2019. (Conference or Workshop Paper published in Proceedings)
Code completion is commonly used by software developers and is integrated into all major IDE's. Good completion tools can not only save time and effort but may also help avoid incorrect API usage. Many proposed completion tools have shown promising results on synthetic benchmarks, but these benchmarks make no claims about the realism of the completions they test. This lack of grounding in real-world data could hinder our scientific understanding of developer needs and of the efficacy of completion models. This paper presents a case study on 15,000 code completions that were applied by 66 real developers, which we study and contrast with artificial completions to inform future research and tools in this area. We find that synthetic benchmarks misrepresent many aspects of real-world completions; tested completion tools were far less accurate on real-world data. Worse, on the few completions that consumed most of the developers' time, prediction accuracy was less than 20% -- an effect that is invisible in synthetic benchmarks. Our findings have ramifications for future benchmarks, tool design and real-world efficacy: Benchmarks must account for completions that developers use most, such as intra-project APIs; models should be designed to be amenable to intra-project data; and real-world developer trials are essential to quantifying performance on the least predictable completions, which are both most time-consuming and far more typical than artificial data suggests. We publicly release our preprint [https://doi.org/10.5281/zenodo.2565673] and replication data and materials [https://doi.org/10.5281/zenodo.2562249]. |
|
Davide Spadini, Fabio Palomba, Tobias Baum, Stefan Hanenberg, Magiel Bruntink, Alberto Bacchelli, Test-Driven Code Review: An Empirical Study, In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), IEEE, USA, 2019-06-25. (Conference or Workshop Paper published in Proceedings)
Test-Driven Code Review (TDR) is a code review practice in which a reviewer inspects a patch by examining the changed test code before the changed production code. Although this practice has been mentioned positively by practitioners in informal literature and interviews, there is no systematic knowledge of its effects, prevalence, problems, and advantages. In this paper, we aim at empirically understanding whether this practice has an effect on code review effectiveness and how developers' perceive TDR. We conduct (i) a controlled experiment with 93 developers that perform more than 150 reviews, and (ii) 9 semi-structured interviews and a survey with 103 respondents to gather information on how TDR is perceived. Key results from the experiment show that developers adopting TDR find the same proportion of defects in production code, but more in test code, at the expenses of fewer maintainability issues in production code. Furthermore, we found that most developers prefer to review production code as they deem it more critical and tests should follow from it. Moreover, general poor test code quality and no tool support hinder the adoption of TDR. Public preprint: [https: //doi.org/10.5281/zenodo.2551217], data and materials: [https:// doi.org/10.5281/zenodo.2553139]. |
|
Jonas Klass, A Machine Learning Approach to Predicting Developers' Behaviour and Build Results in Continuous Integration, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Bachelor's Thesis)
Continuous Integration extended by Continuous Code Quality as a software development practice is a popular approach for providing Software Quality Assurance. One of the main shortcomings of this approach is that developers only learn about insufficient code quality after their changes have been built and analyzed. Therefore, researches examined different approaches to give Just-in-Time quality predictions. As no systematic overview of the topic is available, in this paper a Systematic Literature Review on the subject is performed. The review shows that these approaches work well and are usually based on Machine Learning classifiers trained with the data of projects' change histories.
To learn more about developers' behaviour in Continuous Integration, a study utilizing the change histories of projects using Continuous Integration is conducted. For this purpose, different Machine Learning classifiers are trained with the data from the change histories. The study shows that prediction models for the behaviour of developers regarding continuous quality control and for the build status on the build server work well. Further, the results highlight the need for suggestion methods when code quality checks need to be performed. |
|
Davide Spadini, Maurício Aniche, Magiel Bruntink, Alberto Bacchelli, Mock objects for testing java systems Why and how developers use them, and how they evolve, Empirical Software Engineering, Vol. 24 (3), 2019. (Journal Article)
When testing software artifacts that have several dependencies, one has the possibility of either instantiating these dependencies or using mock objects to simulate the dependencies’ expected behavior. Even though recent quantitative studies showed that mock objects are widely used both in open source and proprietary projects, scientific knowledge is still lacking on how and why practitioners use mocks. An empirical understanding of the situations where developers have (and have not) been applying mocks, as well as the impact of such decisions in terms of coupling and software evolution can be used to help practitioners adapt and improve their future usage. To this aim, we study the usage of mock objects in three OSS projects and one industrial system. More specifically, we manually analyze more than 2,000 mock usages. We then discuss our findings with developers from these systems, and identify practices, rationales, and challenges. These results are supported by a structured survey with more than 100 professionals. Finally, we manually analyze how the usage of mock objects in test code evolve over time as well as the impact of their usage on the coupling between test and production code. Our study reveals that the usage of mocks is highly dependent on the responsibility and the architectural concern of the class. Developers report to frequently mock dependencies that make testing difficult (e.g., infrastructure-related dependencies) and to not mock classes that encapsulate domain concepts/rules of the system. Among the key challenges, developers report that maintaining the behavior of the mock compatible with the behavior of original class is hard and that mocking increases the coupling between the test and the production code. Their perceptions are confirmed by our data, as we observed that mocks mostly exist since the very first version of the test class, and that they tend to stay there for its whole lifetime, and that changes in production code often force the test code to also change. |
|
Luca Pascarella, Fabio Palomba, Alberto Bacchelli, Fine-grained just-in-time defect prediction, The Journal of Systems and Software, Vol. 150, 2019. (Journal Article)
Defect prediction models focus on identifying defect-prone code elements, for example to allow practitioners to allocate testing resources on specific subsystems and to provide assistance during code reviews. While the research community has been highly active in proposing metrics and methods to predict defects on long-term periods (i.e.,at release time), a recent trend is represented by the so-called short-term defect prediction (i.e.,at commit-level). Indeed, this strategy represents an effective alternative in terms of effort required to inspect files likely affected by defects. Nevertheless, the granularity considered by such models might be still too coarse. Indeed, existing commit-level models highlight an entire commit as defective even in cases where only specific files actually contain defects.
In this paper, we first investigate to what extent commits are partially defective; then, we propose a novel fine-grained just-in-time defect prediction model to predict the specific files, contained in a commit, that are defective. Finally, we evaluate our model in terms of (i) performance and (ii) the extent to which it decreases the effort required to diagnose a defect. Our study highlights that: (1) defective commits are frequently composed of a mixture of defective and non-defective files, (2) our fine-grained model can accurately predict defective files with an AUC-ROC up to 82% and (3) our model would allow practitioners to save inspection efforts with respect to standard just-in-time techniques. |
|
Enrico Fregnan, Tobias Baum, Fabio Palomba, Alberto Bacchelli, A survey on software coupling relations and tools, Information and Software Technology, Vol. 107, 2019. (Journal Article)
Context
Coupling relations reflect the dependencies between software entities and can be used to assess the quality of a program. For this reason, a vast amount of them has been developed, together with tools to compute their related metrics. However, this makes the coupling measures suitable for a given application challenging to find.
Goals
The first objective of this work is to provide a classification of the different kinds of coupling relations, together with the metrics to measure them. The second consists in presenting an overview of the tools proposed until now by the software engineering academic community to extract these metrics.
Method
This work constitutes a systematic literature review in software engineering. To retrieve the referenced publications, publicly available scientific research databases were used. These sources were queried using keywords inherent to software coupling. We included publications from the period 2002 to 2017 and highly cited earlier publications. A snowballing technique was used to retrieve further related material.
Results
Four groups of coupling relations were found: structural, dynamic, semantic and logical. A fifth set of coupling relations includes approaches too recent to be considered an independent group and measures developed for specific environments. The investigation also retrieved tools that extract the metrics belonging to each coupling group.
Conclusion
This study shows the directions followed by the research on software coupling: e.g., developing metrics for specific environments. Concerning the metric tools, three trends have emerged in recent years: use of visualization techniques, extensibility and scalability. Finally, some coupling metrics applications were presented (e.g., code smell detection), indicating possible future research directions. Public preprint [https://doi.org/10.5281/zenodo.2002001]. |
|
Tobias Baum, Kurt Schneider, Alberto Bacchelli, Associating working memory capacity and code change ordering with code review performance, Empirical Software Engineering, Vol. 24 (4), 2019. (Journal Article)
Change-based code review is a software quality assurance technique that is widely used in practice. Therefore, better understanding what influences performance in code reviews and finding ways to improve it can have a large impact. In this study, we examine the association of working memory capacity and cognitive load with code review performance and we test the predictions of a recent theory regarding improved code review efficiency with certain code change part orders. We perform a confirmatory experiment with 50 participants, mostly professional software developers. The participants performed code reviews on one small and two larger code changes from an open source software system to which we had seeded additional defects. We measured their efficiency and effectiveness in defect detection, their working memory capacity, and several potential confounding factors. We find that there is a moderate association between working memory capacity and the effectiveness of finding delocalized defects, influenced by other factors, whereas the association with other defect types is almost non-existing. We also confirm that the effectiveness of reviews is significantly larger for small code changes. We cannot conclude reliably whether the order of presenting the code change parts influences the efficiency of code review. |
|
Luca Pascarella, Magiel Bruntink, Alberto Bacchelli, Classifying code comments in Java software systems, Empirical Software Engineering, Vol. 24 (3), 2019. (Journal Article)
Code comments are a key software component containing information about the underlying implementation. Several studies have shown that code comments enhance the readability of the code. Nevertheless, not all the comments have the same goal and target audience. In this paper, we investigate how 14 diverse Java open and closed source software projects use code comments, with the aim of understanding their purpose. Through our analysis, we produce a taxonomy of source code comments; subsequently, we investigate how often each category occur by manually classifying more than 40,000 lines of code comments from the aforementioned projects. In addition, we investigate how to automatically classify code comments at line level into our taxonomy using machine learning; initial results are promising and suggest that an accurate classification is within reach, even when training the machine learner on projects different than the target one. Data and Materials [https://doi.org/10.5281/zenodo.2628361]. |
|
Davide Spadini, Maurício Aniche, Alberto Bacchelli, PyDriller: Python framework for mining software repositories, In: 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ACM Press, New York, New York, USA, 2018-12-04. (Conference or Workshop Paper published in Proceedings)
Software repositories contain historical and valuable information about the overall development of software systems. Mining software repositories (MSR) is nowadays considered one of the most interesting growing fields within software engineering. MSR focuses on extracting and analyzing data available in software repositories to uncover interesting, useful, and actionable information about the system. Even though MSR plays an important role in software engineering research, few tools have been created and made public to support developers in extracting information from Git repository. In this paper, we present PyDriller, a Python Framework that eases the process of mining Git. We compare our tool against the state-of-the-art Python Framework GitPython, demonstrating that PyDriller can achieve the same results with, on average, 50% less LOC and significantly lower complexity.
URL: https://github.com/ishepard/pydriller
Materials: https://doi.org/10.5281/zenodo.1327363
Pre-print: https://doi.org/10.5281/zenodo.1327411 |
|
Achyudh Ram, Anand Ashok Sawant, Marco Castelluccio, Alberto Bacchelli, What Makes a Code Change Easier To Review? An Empirical Investigation on Code Change Reviewability, In: 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), ACM Press, New York, NY, 2018. (Conference or Workshop Paper published in Proceedings)
Peer code review is a practice widely adopted in software projects to improve the quality of code. In current code review practices, code changes are manually inspected by developers other than the author before these changes are integrated into a project or put into production. We conducted a study to obtain an empirical understanding of what makes a code change easier to review. To this end, we surveyed published academic literature and sources from gray literature (e.g., blogs and white papers), we interviewed ten professional developers, and we designed and deployed a reviewability evaluation tool that professional developers used to rate the reviewability of 98 changes. We find that reviewability is defined through several factors, such as the change description, size, and coherent commit history. We provide recommendations for practitioners and researchers. Preprint [https://pure.tudelft.nl/portal/files/45941832/reviewability.pdf]. Data and Materials [https://doi.org/10.5281/zenodo.1323659]. |
|
Anand Ashok Sawant, Guangzhe Huang, Gabriel Vilen, Stefan Stojkovski, Alberto Bacchelli, Why are Features Deprecated? An Investigation Into the Motivation Behind Deprecation, In: 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, USA, 2018-10-23. (Conference or Workshop Paper published in Proceedings)
In this study, we investigate why API producers deprecate features. Previous work has shown us that knowing the rationale behind deprecation of an API aids a consumer in deciding to react, thus hinting at a diversity of deprecation reasons. We manually analyze the Javadoc of 374 deprecated methods pertaining four mainstream Java APIs to see whether the reason behind deprecation is mentioned. We find that understanding the rationale from just the Javadoc is insufficient; hence we add other data sources such as the source code, issue tracker data and commit history. We observe 12 reasons that trigger API producers to deprecate a feature. We evaluate an automated approach to classify these motivations. |
|
Davide Spadini, Fabio Palomba, Andy Zaidman, Magiel Bruntink, Alberto Bacchelli, On the Relation of Test Smells to Software Code Quality, In: 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, USA, 2018-10-23. (Conference or Workshop Paper published in Proceedings)
Test smells are sub-optimal design choices in the implementation of test code. As reported by recent studies, their presence might not only negatively affect the comprehension of test suites but can also lead to test cases being less effective in finding bugs in production code. Although significant steps toward understanding test smells, there is still a notable absence of studies assessing their association with software quality. In this paper, we investigate the relationship between the presence of test smells and the change-and defect-proneness of test code, as well as the defect-proneness of the tested production code. To this aim, we collect data on 221 releases of ten software systems and we analyze more than a million test cases to investigate the association of six test smells and their co-occurrence with software quality. Key results of our study include:(i) tests with smells are more change-and defect-prone, (ii) "Indirect Testing", "Eager Test", and "Assertion Roulette" are the most significant smells for change-proneness and, (iii) production code is more defect-prone when tested by smelly tests. |
|
Carmine Vassallo, Fabio Palomba, Alberto Bacchelli, Harald C Gall, Continuous code quality: are we (really) doing that?, In: ASE '18: 33rd ACM/IEEE International Conference on Automated Software Engineering, ACM, New York, NY, USA, 2018-10-03. (Conference or Workshop Paper published in Proceedings)
Continuous Integration (CI) is a software engineering practice where developers constantly integrate their changes to a project through an automated build process. The goal of CI is to provide developers with prompt feedback on several quality dimensions after each change. Indeed, previous studies provided empirical evidence on a positive association between properly following CI principles and source code quality. A core principle behind CI is Continuous Code Quality (also known as CCQ, which includes automated testing and automated code inspection) may appear simple and effective, yet we know little about its practical adoption. In this paper, we propose a preliminary empirical investigation aimed at understanding how rigorously practitioners follow CCQ. Our study reveals a strong dichotomy between theory and practice: developers do not perform continuous inspection but rather control for quality only at the end of a sprint and most of the times only on the release branch. Preprint [https://doi.org/10.5281/zenodo.1341036]. Data and Materials [http://doi.org/10.5281/zenodo.1341015]. |
|
Vladimir Kovalenko, Fabio Palomba, Alberto Bacchelli, Mining file histories: should we consider branches?, In: ASE '18: 33rd ACM/IEEE International Conference on Automated Software Engineering, ACM, New York, NY, USA, 2018-10-03. (Conference or Workshop Paper published in Proceedings)
Modern distributed version control systems, such as Git, offer support for branching - the possibility to develop parts of software outside the master trunk. Consideration of the repository structure in Mining Software Repository (MSR) studies requires a thorough approach to mining, but there is no well-documented, widespread methodology regarding the handling of merge commits and branches. Moreover, there is still a lack of knowledge of the extent to which considering branches during MSR studies impacts the results of the studies. In this study, we set out to evaluate the importance of proper handling of branches when calculating file modification histories. We analyze over 1,400 Git repositories of four open source ecosystems and compute modification histories for over two million files, using two different algorithms. One algorithm only follows the first parent of each commit when traversing the repository, the other returns the full modification history of a file across all branches. We show that the two algorithms consistently deliver different results, but the scale of the difference varies across projects and ecosystems. Further, we evaluate the importance of accurate mining of file histories by comparing the performance of common techniques that rely on file modification history - reviewer recommendation, change recommendation, and defect prediction - for two algorithms of file history retrieval. We find that considering full file histories leads to an increase in the techniques' performance that is rather modest. |
|
Moritz Eck, Understanding Flaky Tests: Relevance, Nature, and Challenges, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
Regression testing allows developers to control that newly committed code changes do not introduce new defects. Unfortunately, even tests might be defective. One of the issues reported by practitioners and researchers concerning tests is flakiness, which consists in tests that exhibit a seemingly random passing and failing outcome when run against the same code. While the research community has investigated automated solutions to locate and fix flaky tests, there is still limited knowledge on (i) how much the problem is relevant in practice, (ii) what is the cause of the flakiness, (iii) what are the challenges developers perceive when dealing with flaky tests and (iv) to what extent the cause for flakiness can be classified automatically.
With the aim of increasing our scientific knowledge on these aspects, we conduct an empirical investigation that relies on (1) software repository data (pertaining to 391 software systems), (2) a novel dataset of 200 flaky tests classified by the practitioners who fixed those tests, (3) the opinions of 120 developers collected in an online questionnaire and (4) a machine learning approach automatically classifying the cause of the flakiness. The results of our study highlight that (i) the problem of test flakiness is relevant and (ii) characterized by several causes, of which four have never been reported before, despite being costly to fix. Furthermore, we find that, among the challenges faced by developers, (iii) test reproduction and the classification of the cause for flakiness
are the most pressing. Finally, (iv) we propose a two stage machine learning approach to automatically classify the cause for flakiness, achieving an F-Measure of 75% and AUC-ROC of 91% when considering the seven most frequent types of flakiness. |
|