Csanad Erdei-Griff, CodeRAT: An Open Source Chromium-Based Browser Extension for GitHub Code Review Analysis, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis)
CodeRAT (Code Review Analysis Tool) is an open-source, chromium-based web browser extension for GitHub code review analysis. Requirements and features for the tool came from own ideas, ideas from my supervising professor and supervisor as well as input from interviewing seven professional developers who work with code review. As of CodeRAT 1.0, the main features of the extension include collecting, logging and analyzing data about file accesses, tab activity, various timers and timestamps as well as the ultimate outcome of code review sessions. Future development on the tool may involve porting it to different browsers and version control platforms as well as introducing additional features. |
|
Lorenzo Gasparini, Enrico Fregnan, Larissa Braz Brasileiro Barbosa, Tobias Baum, Alberto Bacchelli, ChangeViz: Enhancing the GitHub Pull Request Interface with Method Call Information, In: 2021 Working Conference on Software Visualization (VISSOFT), IEEE, 2021-10-27. (Conference or Workshop Paper published in Proceedings)
|
|
Fernando Petrulio, Anand Ashok Sawant, Alberto Bacchelli, The indolent lambdification of Java: Understanding the support for lambda expressions in the Java ecosystem, Empirical Software Engineering, Vol. 26 (6), 2021. (Journal Article)
As Java 8 introduced functional interfaces and lambda expressions to the Java programming language, the JDK API was changed to introduce support for lambda expressions, thus allowing consumers to define lambda functions when using Java’s collections. While the JDK API allows for a functional paradigm, for API consumers to be able to completely embrace Java’s new functional features, third-party APIs must also support lambda expressions. To understand the current state of the Java ecosystem, we investigate (i) the extent to which third-party Java APIs have changed their interfaces, (ii) why or why not they introduce functional interface support and (iii) in the case the API has changed its interface how it does so. We also investigate the consumers’ perspective, particularly their ease in using lambda expressions in Java with APIs. We perform our investigation by manually analyzing the top 50 popular Java APIs, conducting in-person and email interviews with 23 API producers, and surveying 110 developers. We find that only a minority of the top 50 APIs support functional interfaces, the rest does not support them, predominantly in the interest of backward compatibility. Java 7 support is still greatly desirable due to enterprise projects not migrating to newer versions of Java. This suggests that the Java ecosystem is stagnant and that the introduction of new language features will not be enough to save it from the advent of new languages such as Kotlin (JVM based) and Rust (non-JVM based). |
|
Egor Bogomolov, Vladimir Kovalenko, Yurii Rebryk, Alberto Bacchelli, Timofey Bryksin, Authorship attribution of source code: a language-agnostic approach and applicability in software engineering, In: ESEC/FSE '21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ACM, New York, NY, USA, 2021-09-23. (Conference or Workshop Paper published in Proceedings)
|
|
Oliver Kamer, Fuzzing Playground: Easy-to-Use Web-Based Tool to Demonstrate Fuzzing, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis)
Fuzzing describes the fully automatic testing of software for bugs. While fuzzing has become more popular in recent times, the process of setting up a fuzzer for learning purposes is complicated and the output of the fuzzer is hard to understand. This report presents a Fuzzing Playground that gives the user the possibility to easily and quickly start a fuzzing process and see what is happening under the hood. This is being done by running the fuzzing process in-browser and by having precompiled fuzzing targets, ready for the user to pick. The output visualizes the processes of the fuzzer and presents the user with the real fuzzed data. |
|
Christian Aeberhard, Checklists in code review, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis)
Code review is an important tool for reducing software defects and detecting vulnerabilities in earlier phases of software development. However, reviewing source code from a security perspective is a difficult task and requires extensive knowledge and experience. To make code reviews less dependent on the reviewers' security skills, one can consider incorporating security checklists. Checklist-based reading has been shown to be less dependent on the skill, knowledge, and experience of the reviewer. In this paper, we investigate the extent to which a security checklist assists the reviewer in detecting software vulnerabilities during code review activities. Furthermore, we investigate how the length of the checklist affects the reviewers' performance in this regard. To do so, we devised an online code review experiment with 106 experienced developers. Our results indicate that checklist support has no significant effect on detecting vulnerabilities during code review. Moreover, the length of the checklist does not affect the performance of the reviewer. |
|
Stefano Anzolut, Fuzzing - An Automation Pipeline With Harness Generation in Java, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis)
Fuzzing is an automatic vulnerability discovery technique that makes software more resilient andsecure. Fuzzing is becoming more relevant than ever, nonetheless the lack of intuitive tools makeit less accessible to newcomers. To ease access to fuzzing, we create initial fuzz harnesses thatspecify what and how to fuzz a target function inside a program explicitly. As fuzzing harnessesare usually written by a domain expert and require in-depth knowledge about a project. Thisthesis proposes an automation pipeline that leverages existing Java test suites to generate fuzzingharnesses automatically and fuzzes them with Jazzer. |
|
Larissa Braz Brasileiro Barbosa, Enrico Fregnan, Gül Çalikli, Alberto Bacchelli, Why Don’t Developers Detect Improper Input Validation?, In: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), IEEE, Washington, DC, United States, 2021-05-22. (Conference or Workshop Paper published in Proceedings)
Improper Input Validation (IIV) is a software vulnerability that occurs when a system does not safely handle input data. Even though IIV is easy to detect and fix, it still commonly happens in practice. In this paper, we study to what extent developers can detect IIV and investigate underlying reasons. This knowledge is essential to better understand how to support developers in creating secure software systems. We conduct an online experiment with 146 participants, of which 105 report at least three years of professional software development experience. Our results show that the existence of a visible attack scenario facilitates the detection of IIV vulnerabilities and that a significant portion of developers who did not find the vulnerability initially could identify it when warned about its existence. Yet, a total of 60 participants could not detect the vulnerability even after the warning. Other factors, such as the frequency with which the participants perform code reviews, influence the detection of IIV. Preprint: https://arxiv.org/abs/2102.06251. Data and materials: https://doi.org/10.5281/zenodo.3996696. |
|
Pavlína Wurzel Gonçalves, Enrico Fregnan, Tobias Baum, Kurt Schneider, Alberto Bacchelli, Do Explicit Review Strategies Improve Code Review Performance?, In: MSR '20: 17th International Conference on Mining Software Repositories, ACM, New York, NY, USA, 2020-07-29. (Conference or Workshop Paper published in Proceedings)
Context: Code review is a fundamental, yet expensive part of software engineering. Therefore, research on understanding code review and its efficiency and performance is paramount.
Objective: We aim to test the effect of a guidance approach on review effectiveness and efficiency. This effect is expected to work by lowering the cognitive load of the task; thus, we analyze the mediation relationship as well.
Method: To investigate this effect, we employ an experimental design where professional developers have to perform three code reviews. We use three conditions: no guidance, a checklist, and a checklist-based review strategy. Furthermore, we measure the reviewers' cognitive load.
Limitations: The main limitations of this study concern the specific cohort of participants, the mono-operation bias for the guidance conditions, and the generalizability to other changes and defects. Full registered report: https://doi.org/10.17605/OSF.IO/5FPTJ; Materials: https://doi.org/10.6084/m9.figshare.11806656 |
|
Davide Spadini, Martin Schvarcbacher, Ana-Maria Oprescu, Magiel Bruntink, Alberto Bacchelli, Investigating Severity Thresholds for Test Smells, In: MSR '20: 17th International Conference on Mining Software Repositories, ACM, New York, NY, USA, 2020-07-29. (Conference or Workshop Paper published in Proceedings)
Test smells are poor design decisions implemented in test code, which can have an impact on the effectiveness and maintainability of unit tests. Even though test smell detection tools exist, how to rank the severity of the detected smells is an open research topic. In this work, we aim at investigating the severity rating for four test smells and investigate their perceived impact on test suite maintainability by the developers. To accomplish this, we first analyzed some 1,500 open-source projects to elicit severity thresholds for commonly found test smells. Then, we conducted a study with developers to evaluate our thresholds. We found that (1) current detection rules for certain test smells are considered as too strict by the developers and (2) our newly defined severity thresholds are in line with the participants' perception of how test smells have an impact on the maintainability of a test suite. Preprint [https://doi.org/10.5281/zenodo.3744281], data and material [https://doi.org/10.5281/zenodo.3611111]. |
|
Davide Spadini, Gül Çalikli, Alberto Bacchelli, Primers or reminders? The effects of existing review comments on code review, In: ICSE '20: 42nd International Conference on Software Engineering, ACM, New York, NY, USA, 2020-07-27. (Conference or Workshop Paper published in Proceedings)
In contemporary code review, the comments put by reviewers on a specific code change are immediately visible to the other reviewers involved. Could this visibility prime new reviewers' attention (due to the human's proneness to availability bias), thus biasing the code review outcome? In this study, we investigate this topic by conducting a controlled experiment with 85 developers who perform a code review and a psychological experiment. With the psychological experiment, we find that ≈70% of participants are prone to availability bias. However, when it comes to the code review, our experiment results show that participants are primed only when the existing code review comment is about a type of bug that is not normally considered; when this comment is visible, participants are more likely to find another occurrence of this type of bug. Moreover, this priming effect does not influence reviewers' likelihood of detecting other types of bugs. Our findings suggest that the current code review practice is effective because existing review comments about bugs in code changes are not negative primers, rather positive reminders for bugs that would otherwise be overlooked during code review. Data and materials: https://doi.org/10.5281/zenodo.3653856 |
|
Tobias Famos, The Coupling Between Test and Production Code A Multivocal Literature Review, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Bachelor's Thesis)
This thesis investigates the definitions, problems, identification methods, and solutions surrounding the coupling between test and production code (called test coupling) to present the current state of knowledge and identify gaps to be closed by future research.
A literature review of white and grey literature was performed and three definition categories, two identification methods, three problems, and five solution categories have been identified. The connections between the definitions, problems, identification methods, and solutions were shown, and a structuring of the solutions was proposed to identify gaps in the current research and knowledge.
We concluded that most of the proposed solutions to mitigate the effects of test coupling and the problem descriptions are based on anecdotal evidence and that further empirical evidence for those solutions and the consequences of test coupling on a project must be provided.
|
|
Josua Fröhlich, Code Review Visualizations With CodeDiVis for Java, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Master's Thesis)
Code review is a part of the software development cycle to improve code quality, the detection of bugs, knowledge transfer, and team awareness. Research in this field has focused on minimizing developers' effort while increasing their performance. An example is the creation of a changes ordering theory to reduce reviewers’ cognitive load. However, code review can easily become a complex task, when the number of changes to review increases. We hypothesized that providing supportive figures such as a call or dependency graph would be helpful for a reviewer during code review. Therefore, we developed a tool to visualize the relationships among entities in the code to be reviewed. To evaluate our tool, we have conducted two qualitative studies: a user study with nine professional developers in a software development company and an online survey with 29 participants. In both studies, participants generally responded positively about the tool. Participants reported they were or would be able to understand the code changes quicker. Moreover, they found easier to navigate through the changes and to orient themselves in the merge requests. |
|
Linda Di Geronimo, Larissa Braz Brasileiro Barbosa, Enrico Fregnan, Fabio Palomba, Alberto Bacchelli, UI Dark Patterns and Where to Find Them A Study on Mobile Applications and User Perception, In: CHI '20: CHI Conference on Human Factors in Computing Systems, ACM, New York, NY, USA, 2020-05-25. (Conference or Workshop Paper published in Proceedings)
A Dark Pattern (DP) is an interface maliciously crafted to deceive users into performing actions they did not mean to do. In this work, we analyze Dark Patterns in 240 popular mobile apps and conduct an online experiment with 589 users on how they perceive Dark Patterns in such apps. The results of the analysis show that 95% of the analyzed apps contain one or more forms of Dark Patterns and, on average, popular applications include at least seven different types of deceiving interfaces. The online experiment shows that most users do not recognize Dark Patterns, but can perform better in recognizing malicious designs if informed on the issue. We discuss the impact of our work and what measures could be applied to alleviate the issue. |
|
Cristian De Iaco, Developers’ Perception of Unit Test Quality An Empirical Investigation of Unit Test Quality from a Developer’s Point of View, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Bachelor's Thesis)
"How good are my tests?", "did I test enough?" These are common questions raised by developers and thus extensively investigated by the research community. Code coverage is most commonly consulted as a readily available metric to assess unit test quality. However, code coverage is not a sufficient indicator of unit test quality, as it does not account for the behavior of the production code. Therefore, novel approaches need to go beyond code coverage to assess the quality of a unit test based on different aspects such as understandability, maintainability and the ability to find faults. This thesis aims to bridge this gap by investigating unit test quality from a developer’s point of view to understand the areas of interest when dealing with unit tests. To that end, a taxonomy of most important characteristics of unit tests was elaborated based on the feedback gathered from 5 developers during interviews and from 70 developers in the survey. Additionally, I investigated the most prevalent unit testing practices applied by developers of this study, to provide an insight on how developers currently assess unit test quality. The results of this study show the scope of a unit test as the main area of interest to judge its quality named by 67% of developers of this study and judged to be at least very important by 82% of them. Readability is regarded as the second most important characteristic of a high-quality unit test, being reported by57% of developers and judged to be at least very important in 81% of those cases. The third most stated feature of unit test quality is the general structure of the test (33% of developers, rated as at least very important in 86% of the reported features attributed to the test structure). Moreover, 40% of the features named in the survey are found to be assessed in practice, where manual control in the form of peer review and observation of test results prevail as the predominant practices applied by developers of this study to assess unit test quality. |
|
Luca Pascarella, Fabio Palomba, Alberto Bacchelli, On the performance of method-level bug prediction: A negative result, The Journal of Systems and Software, Vol. 161, 2020. (Journal Article)
Bug prediction is aimed at identifying software artifacts that are more likely to be defective in the future. Most approaches defined so far target the prediction of bugs at class/file level. Nevertheless, past research has provided evidence that this granularity is too coarse-grained for its use in practice. As a consequence, researchers have started proposing defect prediction models targeting a finer granularity (particularly method-level granularity), providing promising evidence that it is possible to operate at this level. Particularly, models mixing product and process metrics provided the best results.
We present a study in which we first replicate previous research on method-level bug-prediction, by using different systems and timespans. Afterwards, based on the limitations of existing research, we (1) re-evaluate method-level bug prediction models more realistically and (2) analyze whether alternative features based on textual aspects, code smells, and developer-related factors can be exploited to improve method-level bug prediction abilities. Key results of our study include that (1) the performance of the previously proposed models, tested using the same strategy but on different systems/timespans, is confirmed; but, (2) when evaluated with a more practical strategy, all the models show a dramatic drop in performance, with results close to that of a random classifier. Finally, we find that (3) the contribution of alternative features within such models is limited and unable to improve the prediction capabilities significantly. As a consequence, our replication and negative results indicate that method-level bug prediction is still an open challenge. |
|
Vladimir Kovalenko, Nava Tintarev, Evgeny Pasynkov, Christian Bird, Alberto Bacchelli, Does Reviewer Recommendation Help Developers?, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 46 (7), 2020. (Journal Article)
Selecting reviewers for code changes is a critical step for an efficient code review process. Recent studies propose automated reviewer recommendation algorithms to support developers in this task. However, the evaluation of recommendation algorithms, when done apart from their target systems and users (i.e., code review tools and change authors), leaves out important aspects: perception of recommendations, influence of recommendations on human choices, and their effect on user experience. This study is the first to evaluate a reviewer recommender in vivo. We compare historical reviewers and recommendations for over 21,000 code reviews performed with a deployed recommender in a company environment and set out to measure the influence of recommendations on users' choices, along with other performance metrics. Having found no evidence of influence, we turn to the users of the recommender. Through interviews and a survey we find that, though perceived as relevant, reviewer recommendations rarely provide additional value for the respondents. We confirm this finding with a larger study at another company. The confirmation of this finding brings up a case for more user-centric approaches to designing and evaluating the recommenders. Finally, we investigate information needs of developers during reviewer selection and discuss promising directions for the next generation of reviewer recommendation tools. Preprint: https://doi.org/10.5281/zenodo.1404814. |
|
Anand Ashok Sawant, Romain Robbes, Alberto Bacchelli, To react, or not to react: Patterns of reaction to API deprecation, Empirical Software Engineering, Vol. 24 (6), 2020. (Journal Article)
Application Programming Interfaces (API) provide reusable functionality to aid developers in the development process. The features provided by these APIs might change over time as the API evolves. To allow API consumers to peacefully transition from older obsolete features to new features, API producers make use of the deprecation mechanism that allows them to indicate to the consumer that a feature should no longer be used. The Java language designers noticed that no one was taking these deprecation warnings seriously and continued using outdated features. Due to this, they decided to change the implementation of this feature in Java 9. We question as to what extent this issue exists and whether the Java language designers have a case. We start by identifying the various ways in which an API consumer can react to deprecation. Following this we benchmark the frequency of the reaction patterns by creating a dataset consisting of data mined from 50 API consumers totalling 297,254 GitHub based projects and 1,322,612,567 type-checked method invocations. We see that predominantly consumers do not react to deprecation and we try to explain this behavior by surveying API consumers and by analyzing if the API’s deprecation policy has an impact on the consumers’ decision to react. |
|
Marco di Biase, Magiel Bruntink, Arie van Deursen, Alberto Bacchelli, The effects of change decomposition on code review—a controlled experiment, PeerJ. Computer science, Vol. 5, 2020. (Journal Article)
Background
Code review is a cognitively demanding and time-consuming process. Previous qualitative studies hinted at how decomposing change sets into multiple yet internally coherent ones would improve the reviewing process. So far, literature provided no quantitative analysis of this hypothesis.
Aims
(1) Quantitatively measure the effects of change decomposition on the outcome of code review (in terms of number of found defects, wrongly reported issues, suggested improvements, time, and understanding); (2) Qualitatively analyze how subjects approach the review and navigate the code, building knowledge and addressing existing issues, in large vs. decomposed changes.
Method
Controlled experiment using the pull-based development model involving 28 software developers among professionals and graduate students.
Results
Change decomposition leads to fewer wrongly reported issues, influences how subjects approach and conduct the review activity (by increasing context-seeking), yet impacts neither understanding the change rationale nor the number of found defects.
Conclusions
Change decomposition reduces the noise for subsequent data analyses but also significantly supports the tasks of the developers in charge of reviewing the changes. As such, commits belonging to different concepts should be separated, adopting this as a best practice in software engineering. |
|
Davide Spadini, Gül Çalikli, Alberto Bacchelli, Primers or Reminders? The Effects of Existing Review Comments on Code Review, In: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, Association for Computing Machinery, New York, NY, USA, 2020. (Conference or Workshop Paper published in Proceedings)
In contemporary code review, the comments put by reviewers on a specific code change are immediately visible to the other reviewers involved. Could this visibility prime new reviewers' attention (due to the human's proneness to availability bias), thus biasing the code review outcome? In this study, we investigate this topic by conducting a controlled experiment with 85 developers who perform a code review and a psychological experiment. With the psychological experiment, we find that ≈70% of participants are prone to availability bias. However, when it comes to the code review, our experiment results show that participants are primed only when the existing code review comment is about a type of bug that is not normally considered; when this comment is visible, participants are more likely to find another occurrence of this type of bug. Moreover, this priming effect does not influence reviewers' likelihood of detecting other types of bugs. Our findings suggest that the current code review practice is effective because existing review comments about bugs in code changes are not negative primers, rather positive reminders for bugs that would otherwise be overlooked during code review. Data and materials: https://doi.org/10.5281/zenodo.3653856 |
|