Pasquale Salza, Fabio Palomba, Dario Di Nucci, Andrea De Lucia, Filomena Ferrucci, Third-Party Libraries in Mobile Apps: When, How, and Why Developers Update Them, Empirical Software Engineering (EMSE), Vol. 25 (3), 2020. (Journal Article)
|
|
Fiorella Zampetti, Carmine Vassallo, Sebastiano Panichella, Gerardo Canfora, Harald Gall, Massimiliano Di Penta, An Empirical Characterization of Bad Practices in Continuous Integration, Empirical Software Engineering, Vol. 25 (2), 2020. (Journal Article)
Continuous Integration (CI) has been claimed to introduce several benefits in software development, including high software quality and reliability. However, recent work pointed out challenges, barriers and bad practices characterizing its adoption. This paper empirically investigates what are the bad practices experienced by developers applying CI. The investigation has been conducted by leveraging semi-structured interviews of 13 experts and mining more than 2,300 Stack Overflow posts. As a result, we compiled a catalog of 79 CI bad smells belonging to 7 categories related to different dimensions of a CI pipeline management and process. We have also investigated the perceived importance of the identified bad smells through a survey involving 26 professional developers, and discussed how the results of our study relate to existing knowledge about CI bad practices. Whilst some results, such as the poor usage of branches, confirm existing literature, the study also highlights uncovered bad practices, e.g., related to static analysis tools or the abuse of shell scripts, and contradict knowledge from existing literature, e.g., about avoiding nightly builds. We discuss the implications of our catalog of CI bad smells for (i) practitioners, e.g., favor specific, portable tools over hacking, and do not ignore nor hide build failures, (ii) educators, e.g., teach CI culture, not just technology, and teach CI by providing examples of what not to do, and (iii) researchers, e.g., developing support for failure analysis, as well as automated CI bad smell detectors. |
|
Carmine Vassallo, Sebastiano Panichella, Fabio Palomba, Sebastian Proksch, Harald C Gall, Andy Zaidman, How Developers Engage with Static Analysis Tools in Different Contexts, Empirical Software Engineering, Vol. 25 (2), 2020. (Journal Article)
Automatic static analysis tools (ASATs) are instruments that support code quality assessment by automatically detecting defects and design issues. Despite their popularity, they are characterized by (i) a high false positive rate and (ii) the low comprehensibility of the generated warnings. However, no prior studies have investigated the usage of ASATs in different development contexts (e.g., code reviews, regular development), nor how open source projects integrate ASATs into their workflows. These perspectives are paramount to improve the prioritization of the identified warnings. To shed light on the actual ASATs usage practices, in this paper we first survey 56 developers (66% from industry and 34% from open source projects) and interview 11 industrial experts leveraging ASATs in their workflow with the aim of understanding how they use ASATs in different contexts. Furthermore, to investigate how ASATs are being used in the workflows of open source projects, we manually inspect the contribution guidelines of 176 open-source systems and extract the ASATs’ configuration and build files from their corresponding GitHub repositories. Our study highlights that (i) 71% of developers do pay attention to different warning categories depending on the development context; (ii) 63% of our respondents rely on specific factors (e.g., team policies and composition) when prioritizing warnings to fix during their programming; and (iii) 66% of the projects define how to use specific ASATs, but only 37% enforce their usage for new contributions. The perceived relevance of ASATs varies between different projects and domains, which is a sign that ASATs use is still not a common practice. In conclusion, this study confirms previous findings on the unwillingness of developers to configure ASATs and it emphasizes the necessity to improve existing strategies for the selection and prioritization of ASATs warnings that are shown to developers. |
|
Carmine Vassallo, Sebastian Proksch, Timothy Zemp, Harald C Gall, Every Build You Break: Developer-Oriented Assistance for Build Failure Resolution, Empirical Software Engineering, Vol. 25 (3), 2020. (Journal Article)
Continuous integration is an agile software development practice. Instead of integrating features right before a release, they are constantly being integrated into an automated build process. This shortens the release cycle, improves software quality, and reduces time to market. However, the whole process will come to a halt when a commit breaks the build, which can happen for several reasons, e.g., compilation errors or test failures, and fixing the build suddenly becomes a top priority. Developers not only have to find the cause of the build break and fix it, but they have to be quick in all of it to avoid a delay for others. Unfortunately, these steps require deep knowledge and are often time-consuming. To support developers in fixing a build break, we propose Bart, a tool that summarizes the reasons for Maven build failures and suggests possible solutions found on the internet. We will show in a case study with 17 participants that developers find Bart useful to understand build breaks and that using Bart substantially reduces the time to fix a build break, on average by 37%. We have also conducted a qualitative study to better understand the workflows and information needs when fixing builds. We found that typical workflows differ substantially between various error categories, and that several uncommon build errors are both very hard to investigate and to fix. These findings will be useful to inform future research in this area. |
|
Andrea Aquino, Pietro Braione, Giovanni Denaro, Pasquale Salza, Facilitating Program Performance Profiling via Evolutionary Symbolic Execution, Software Testing, Verification & Reliability (STVR), Vol. 30 (2), 2020. (Journal Article)
|
|
Stefan Würsten, Software Microbenchmark Reconfiguration Reducing Execution Time without Sacrificing Quality, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Master's Thesis)
In recent years, performance testing has gained increasing popularity. Such tests measure a software component’s execution time. Compared to traditional functional tests, it is not sufficient to execute a test once. Instead, performance should be measured multiple times to obtain a representative distribution. However, this results in a far more time-intensive execution.
First, we investigate how developers currently configure software microbenchmarks, including how the proposed default values are modified and how this affects the execution time. Our analysis reveals that software microbenchmarks are often never modified after being written. Many projects reuse the default values for certain parameters. However, if user-defined values are set, this often results in a shorter execution time.
Second, we investigate the consequences of dynamically determined execution configurations. In regular intervals, we check the characteristics of the performance distribution and decide whether more data points are required. We compare our novel approach with the standard execution. For a preponderant majority of software microbenchmarks, the novel approach produces a similar performance distribution for which an A/A test cannot detect significant differences. However, depending on the stoppage criteria, up to 82\% of the execution time can be saved. Our novel approach should help developers to shorten the time-consuming execution while still producing a sound result. |
|
Carmine Vassallo, Enabling Continuous Improvement of a Continuous Integration Process, In: 34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, IEEE, Washington, DC, United States, 2019-11-11. (Conference or Workshop Paper published in Proceedings)
Continuous Integration (CI) is a widely-adopted software engineering practice. Despite its undisputed benefits, like higher software quality and improved developer productivity, mastering CI is not easy. Among the several barriers when transitioning to CI, developers need to face a new type of software failures (i.e., build failures) that requires them to understand complex build logs. Even when a team has successfully introduced a CI culture, living up to its principles and improving the CI practice are also challenging. In my research, I want to provide developers with the right support for establishing CI and the proper recommendations for continuously improving their CI process. |
|
Carol Alexandru, Sebastian Proksch, Pooyan Behnamghader, Harald C Gall, Evo-Clocks: Software Evolution at a Glance, In: 7th IEEE Working Conference on Software Visualization, IEEE, Cleveland, OH, USA, 2019-09-30. (Conference or Workshop Paper published in Proceedings)
Understanding the evolution of a project is crucial in reverse-engineering, auditing and otherwise understanding existing software. Visualizing how software evolves can be challenging, as it typically abstracts a multi-dimensional graph structure where individual components undergo frequent but localized changes. Existing approaches typically consider either only a small number of revisions or they focus on one particular aspect, such as the evolution of code metrics or architecture. Approaches using a static view with a time axis (such as line charts) are limited in their expressiveness regarding structure, and approaches visualizing structure quickly become cluttered with an increasing number of revisions and components. We propose a novel trade-off between displaying global structure over a large time period with reduced accuracy and visualizing fine-grained changes of individual components with absolute accuracy. We demonstrate how our approach displays changes by blending redundant visual features (such as scales or repeating data points) where they are not expressive. We show how using this approach to explore software evolution can reveal ephemeral information when familiarizing oneself with a new project. We provide a working implementation as an extension to our open-source library for fine-grained evolution analysis, LISA. |
|
Matej Jakovljevic, Investigation of Python Documentation Comments in Open-Source Projects, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Bachelor's Thesis)
Source code comments are ubiquitous in the field of software engineering. Developers use them to describe their intentions and improve the comprehensibility of the associated source code. The
reusability of existing source code increases the importance of documentations comments, which provide information about the reused source code entity to end-users. However, the benefits of comments heavily depend on their content. To help developers write comments, and thus add value to the them, it is necessary to understand what content they usually contain and how they are structured. Since this has not been explored deeply for documentation comments in Python yet, this thesis provides a basis for future researchers by investigating whether Python
documentation comments in open-source projects follow a particular syntax, and what content they contain. Our investigation revealed that most Python documentation comments in open-source projects contain only one line of text, and that more than every third comment that contains more than one line of text follows a certain syntax. Our empirical study on the contents of documentation comments showed that a general description of the documented source code element is dominant,
but descriptions of parameters and return values are also common to occur. |
|
Mikael Basmaci, A Correlation Study Between Source Code Features and Benchmark Stability, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Bachelor's Thesis)
Software microbenchmarks are used in Software Performance Engineering (SPE) to assess the performance of code fragments of a software system. Because they give non-deterministic results upon execution, there exists a certain variability in the results of microbenchmarks. Since the variability is an indicator of stability, the more variable the results, the less stable a microbenchmark is said to be. Studies show that the variability of microbenchmarks depend on different factors such as the hardware or platform the benchmarks are run on, the source code under the benchmark, or even the programming language itself. One of the factors that may affect the variability, hence the stability of a microbenchmark can be the source code features of the benchmark and the code it calls. Therefore, studying the correlation between stability and source code features of benchmarks may provide an insight into this aspect. In this thesis, I analyze variabilities
of 4589 microbenchmarks from 223 open-source projects written in Go to find out about their stability, collect source code features of benchmarks by Abstract Syntax Tree (AST) parsing and callgraph analysis, and finally perform a correlation analysis based on the stability and source code features of benchmarks. Results show that 98.17% of the benchmarks have a variation below 10% as calculated by the 99th percent relative confidence interval width. Moreover, 87.70% of these benchmarks have a variability below 1%, meaning that most benchmarks are very stable. The results of correlation analysis show that 14 out of 59 collected source code features have a correlation coefficient value higher than 0.25, where the usage of sync library is the most correlating feature with a value of 0.36. This is followed by the usage of go keyword, which has the value 0.29. Generally, concurrency and control flow related features correlate with stability with a higher value than 0.22, while sync, sync/atomic and math/rand libraries correlate even more with a value above 0.25. An interesting finding reveals that pointers and defer statements are also relevant to stability. The findings from this study can be used by developers or tool designers to assess the stability of benchmarks as well as serve as a basis for predicting variability causes of microbenchmarks. |
|
Carol Alexandru, Efficient software evolution analysis: algorithmic and visual tools for investigating fine-grained software histories, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Dissertation)
Software analysis and its diachronic sibling, software evolution analysis, rely heavily on data computed by processing existing software.
Countless tools have been created for the analysis of source code, binaries and other artifacts.
The majority of these tools are written for one particular programming language and their modus operandi typically comprises the analysis of artifacts contained in file system directories representing the current version of a software system.
Researchers repurpose these tools for investigating software evolution by analyzing multiple revisions over the lifetime of a project.
But even though changes between revisions are usually tiny compared to the size of the affected artifacts, existing software evolution analysis techniques usually rely on redundantly re-analyzing entire files at best, or entire projects at worst, for every additional revision analyzed.
These limitations of being tied to a single ecosystem and of treating software as a static, timeless construct, affects how we do software evolution research: it often self-restricts, rather arbitrarily, to the analysis of only a subset of revisions, instead of the full, high-resolution history of a project.
Thus, there exist both a need and the potential for representing and analyzing software artifacts more efficiently.
In this thesis, we identify several processes in existing software evolution analysis pipelines that suffer from redundancies and inefficiencies.
We then develop purpose-agnostic solutions for improving these processes and combine them in a generic, reusable, and extensible analysis library, called LISA.
We evaluate our approach extensively by computing (and publishing) code metrics for millions of program revisions, testing its generalizability by supporting multiple types of artifacts, analyses and programming languages, and by applying our tool to conduct concrete code studies.
Our findings indicate that analyzing software evolution using traditional tools incurs significant redundancies.
We demonstrate that the individual techniques we present are generalizable to multiple programming languages and artifact types and that they can accelerate the processing of evolving software by multiple orders of magnitude.
Alongside these core findings, our research has resulted in a state-of-the-art, open-source software analysis library, a large public dataset of historical code metrics, and incremental advancements in understanding the pace of software evolution, developer behavior and the visualization of software evolution. |
|
Giovanni Grano, Timofey V Titov, Sebastiano Panichella, Harald C Gall, Branch Coverage Prediction in Automated Testing, Journal of Software: Evolution and Process, Vol. 31 (9), 2019. (Journal Article)
Software testing is crucial in continuous integration (CI). Ideally, at every commit, all the test cases should be executed and, moreover, new test cases should be generated for the new source code.This is especially true in a Continuous Test Generation (CTG) environment, where the automatic generation of test cases is integrated into the continuous integration pipeline. In this context, developers want to achieve a certain minimum level of coverage for every software build. However, executing all the test cases and, moreover, generating new ones for all the classes at every commit is not feasible. As a consequence, developers have to select which subset of classes has to be tested and/or targeted by test-case generation.We argue that knowing a priori the branch-coverage that can be achieved with test-data generation tools can help developers into taking informed-decision about those issues. In this paper, we investigate the possibility to use source-code metricsto predict the coverage achieved by test-data generation tools.
We use four different categories of source-code features and assess the prediction on a large dataset involving more than 3'000 Java classes.
We compare different machine learning algorithms and conduct a fine-grained feature analysis aimed at investigating the factors that most impact the prediction accuracy.
Moreover, we extend our investigation to four different search-budgets.
Our evaluation shows that the best model achieves an average 0.15 and 0.21 MAE on nested cross-validation over the different budgets, respectively on EvoSuite and Randoop. Finally, the discussion of the results demonstrate the relevance of coupling-related features for the prediction accuracy. |
|
Christoph Laaber, Continuous Software Performance Assessment: Detecting Performance Problems of Software Libraries on Every Build, In: The 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ACM Press, New York, New York, USA, 2019-08-15. (Conference or Workshop Paper published in Proceedings)
Degradation of software performance can become costly for companies and developers, yet it is hardly assessed continuously. A strategy that would allow continuous performance assessment of software libraries is software microbenchmarking, which faces problems such as excessive execution times and unreliable results that hinder wide-spread adoption in continuous integration. In my research, I want to develop techniques that allow including software microbenchmarks into continuous integration by utilizing cloud infrastructure and execution time reduction techniques. These will allow assessing performance on every build and therefore catching performance problems before they are released into the wild. |
|
Marc Zwimpfer, Investigating Plugin Usage in Open Source Maven Projects, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Bachelor's Thesis)
Continuous Integration (CI) has become a widely-used software-engineering practice, in which the automated software build plays a central role. Software build systems, like Maven, execute this task by automatically generating executable software from source code and hence, play a crucial role in the whole CI process. Studies have shown that the configuration of such systems grows with increasing age and size of the underlying project. However, little research was conducted on the actual content of their configurations.
In this thesis, we examine in-depth how Maven plugins are configured in practice by analysing configurations of Open-Source projects using Maven. In Maven, all functionalities are provided by plugins, and thus they form the core of a every project. Analysing how plugins are used in Maven is important to gain further understanding of the usage of Maven and build systems in general.
Despite the importance of plugins, we find that plugin management only makes up a small portion of the complete Maven configuration.
However, we show that plugins and their configurations are strongly influenced by inheritance in Maven projects.
With this thesis, we provide further insight into how developers actually use build systems. We show that the standard configuration of Maven regarding plugins suffices in most cases and thus, the concept of Maven - "Convention over Configuration" - is also successfully realized in the plugin configuration of Maven.
Moreover, we propose a method which encodes Maven configurations into vectors which can be used for various analysis without information loss. |
|
Noah Chavannes, Build Log Differencing, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Bachelor's Thesis)
Continuous integration (CI) is transforming from a trend to the industry standard in software development [23]. A goal of CI is to develop software in a productive manner and with high quality. Building software is a sub-process of CI and does unfortunately not always succeed on the first try. Finding the failure cause in a broken build can be a tedious process and delay the work of other developers [25]. Comparing two successive build logs could yield the failure cause by highlighting the differences, but current text differencing tools fail to do so in an efficient manner. We introduce the concept of filtering build logs for irrelevant artifacts, which allows a meaningful comparison. Furthermore, we developed BLogDiff, a tool that applies the concept of noise filtering
and allows developers to calculate the difference between two build logs. We evaluated the tool by conducting a survey and could show that the tool supports inexperienced developers in finding the cause of failing builds. Lastly, we analyzed the build log domain quantitatively and
found different characteristics of successful and failing builds. |
|
Yves Rutishauser, Suggesting Meaningful Method Names, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Bachelor's Thesis)
Good identifier names provide a high-level summary of source code and are therefore beneficial during software maintenance. Hence, automatically suggesting descriptive and accurate method names reduces time spent maintaining code and improves understandability and readability of software corpora. Since source code has many similar properties to natural language many models originally developed for Natural Language Processing (NLP) are successfully applied to code.
In this thesis, I propose 2 different approaches and experiment with a total of 6 models that are specifically adjusted to solve the method naming problem. These models learn to assign tokens to
locations (embeddings) such that tokens with similar meanings have similar embeddings. Based on the combination of these embeddings, I can suggest accurate method names. I demonstrate that models are more effective if partially trained on the current project than models that predict on projects completely unobserved during training. Furthermore, I show the effectiveness of splitting a method name into sub-tokens. These models can predict neologisms (names that are not in the vocabulary). In a quantitative analysis, I compare the different models and approaches with different metrics. I furthermore adapt a metric, which is specifically designed for this task and has been used in the past. Additionally, I evaluate the models with different input parameters
and show the effectiveness of using the type, parameters, and the method body to suggest its name. In a qualitative analysis, I discuss 8 different use cases, demonstrate visualizations and show the limits of the proposed models.
The code and data are available on Github [12] and Zenondo [56]. |
|
Christoph Laaber, Joel Scheuner, Philipp Leitner, Software Microbenchmarking in the Cloud. How Bad is it Really?, Empirical Software Engineering, Vol. 24 (4), 2019. (Journal Article)
Rigorous performance engineering traditionally assumes measuring on bare-metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance-testing hardware, making public clouds an attractive alternative. However, shared public cloud environments are inherently unpredictable in terms of the system performance they provide. In this study, we explore the effects of cloud environments on the variability of performance test results and to what extent slowdowns can still be reliably detected even in a public cloud. We focus on software microbenchmarks as an example of performance tests and execute extensive experiments on three different well-known public cloud services (AWS, GCE, and Azure) using three different cloud instance types per service. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 4.5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (by a coefficient of variation from 0.03% to > 100%). However, executing test and control experiments on the same instances (in randomized order) allows us to detect slowdowns of 10% or less with high confidence, using state-of-the-art statistical tests (i.e., Wilcoxon rank-sum and overlapping bootstrapped confidence intervals). Finally, our results indicate that Wilcoxon rank-sum manages to detect smaller slowdowns in cloud environments. |
|
Giovanni Grano, A New Dimension of Test Quality: Assessing and Generating Higher Quality Unit Test Cases, In: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ACM, New York, NY, USA, 2019-07-17. (Conference or Workshop Paper published in Proceedings)
Unit tests form the first defensive line against the introduction of bugs in software systems. Therefore, their quality is of a paramount importance to produce robust and reliable software. To assess test quality, many organizations relies on metrics like code and mutation coverage. However, they are not always optimal to fulfill such a purpose. In my research, I want to make mutation testing scalable by devising a lightweight approach to estimate test effectiveness. Moreover, I plan to introduce a new metric measuring test focus—as a proxy for the effort needed by developers to understand and maintain a test— that both complements code coverage to assess test quality and can be used to drive automated test case generation of higher quality tests. |
|
Carmine Vassallo, Giovanni Grano, Fabio Palomba, Harald C Gall, Alberto Bacchelli, A large-scale empirical exploration on refactoring activities in open source software projects, Science of Computer Programming, Vol. 180, 2019. (Journal Article)
Refactoring is a well-established practice that aims at improving the internal structure of a software system without changing its external behavior. Existing literature provides evidence of how and why developers perform refactoring in practice. In this paper, we continue on this line of research by performing a large-scale empirical analysis of refactoring practices in 200 open source systems. Specifically, we analyze the change history of these systems at commit level to investigate: (i) whether developers perform refactoring operations and, if so, which are more diffused and (ii) when refactoring operations are applied, and (iii) which are the main developer-oriented factors leading to refactoring. Based on our results, future research can focus on enabling automatic support for less frequent refactorings and on recommending refactorings based on the developer's workload, project's maturity and developer's commitment to the project. |
|
Carmine Vassallo, Sebastian Proksch, Harald C Gall, Massimiliano Di Penta, Automated Reporting of Anti-Patterns and Decay in Continuous Integration, In: 41st International Conference on Software Engineering, ICSE 2019, IEEE / ACM, New York, United States, 2019-05-25. (Conference or Workshop Paper published in Proceedings)
Continuous Integration (CI) is a widely-used software engineering practice. The software is continuously built so that changes can be easily integrated and issues such as unmet quality goals or style inconsistencies get detected early. Unfortunately, it is not only hard to introduce CI into an existing project, but it is also challenging to live up to the CI principles when facing tough deadlines or business decisions. Previous work has identified common anti-patterns that reduce the promised benefits of CI. Typically, these anti-patterns slowly creep into a project over time before they are identified. We argue that automated detection can help with early identification and prevent such a process decay. In this work, we further analyze this assumption and survey 124 developers about CI anti-patterns. From the results, we build CI-Odor, a reporting tool for CI processes that detects the existence of four relevant anti-patterns by analyzing regular build logs and repository information. In a study on the 18,474 build logs of 36 popular Java projects, we reveal the presence of 3,823 high-severity warnings spread across projects. We validate our reports in a survey among 13 original developers of these projects and through general feedback from 42 developers that confirm the relevance of our reports. |
|