Giovanni Grano, The multiple facets of test case quality: analyzing effectiveness and going beyond, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Dissertation)
Nowadays, software pervades our life. Being software so deeply rooted into our society, software failures can cause enormous consequences. Unit test cases represent the first line of defense against the introduction of software bugs and a pillar of any software development pipeline.
Higher is their quality, the better they can fulfill their role. This research aims at supporting developers in measuring and optimizing test suite quality. To fulfill this goal, we fist characterized the test code quality aspects deemed important by practitioners. We learned that test quality does not have an exact definition and includes a variety of different facets. We also discovered that, while developers value test effectiveness, they believe it is not sufficient to achieve test quality since non-functional aspects also play a crucial role in it.
These insights motivated us to devise novel approaches to measure and optimize test effectiveness and non-functional quality aspects both in the context of manually written and automatically generated tests.
While mutation testing is widely used to measure effectiveness, its computational cost hinders its practical usage. We tackled the problem by exploiting machine learning (ML) models trained on source code features to estimate test effectiveness.
We relied on similar features to tackle the problem of code coverage prediction in the context of test case generation (TCG).
The ML models we proposed are able to suggest developers whether TCG is able to produce satisfactory result for their software projects.
To optimize non-functional aspects along with code coverage in TCG, we proposed an adaptive search-based algorithm suitable to arbitrary secondary objectives. We instantiated it to focus on test resource demands, obtaining more parsimonious tests at equal levels of code coverage. |
|
Giovanni Grano, Fabio Palomba, Harald Gall, Lightweight Assessment of Test-Case Effectiveness using Source-Code-Quality Indicators, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 47 (4), 2021. (Journal Article)
Test cases are crucial to help developers preventing the introduction of software faults. Unfortunately, not all the tests are properly designed or can effectively capture faults in production code. Some measures have been defined to assess test-case effectiveness: the most relevant one is the mutation score, which highlights the quality of a test by generating the so-called mutants, ie variations of the production code that make it faulty and that the test is supposed to identify. However, previous studies revealed that mutation analysis is extremely costly and hard to use in practice. The approaches proposed by researchers so far have not been able to provide practical gains in terms of mutation testing efficiency. This leaves the problem of efficiently assessing test-case effectiveness as still open. In this paper, we investigate a novel, orthogonal, and lightweight methodology to assess test-case effectiveness: in particular, we study the feasibility to exploit production and test-code-quality indicators to estimate the mutation score of a test case. We firstly select a set of 67 factors and study their relation with test-case effectiveness. Then, we devise a mutation score estimation model exploiting such factors and investigate its performance as well as its most relevant features. The key results of the study reveal that our estimation model only based on static features has 86% of both F-Measure and AUC-ROC. This means that we can estimate the test-case effectiveness, using source-code-quality indicators, with high accuracy and without executing the tests. As a consequence, we can provide a practical approach that is beyond the typical limitations of current mutation testing techniques. |
|
Luka Lapanashvili, Elemental UI: Portable and performant solution for modern GUI rendering, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Master's Thesis)
Interacting with software is commonplace in modern society. Music, video, images, and text are routinely consumed on smartphones, laptops, smart wearables, stationary workstations, and embedded devices. Simultaneously, the number of network-enabled devices per person rises, which increases the demand for the preferred media playback software or the social media application to be available on any device in such a way that a podcast paused on a laptop can be resumed on a smartphone.
However, rarely is it possible to use the same frontend code or even the same programming language to create an application that can be run on different devices. Application developers often have to adapt implementations for every single platform or even write bespoke implementations for individual operating systems. Naturally, such a fragmented codebase is difficult to maintain and to uphold feature parity across all the devices. Having one unified solution, where one codebase can target a large set of devices, would be highly beneficial.
In this thesis, we discuss the difficulties of developing a cross-platform application and evaluate available solutions. Finally, we introduce Elemental UI, our cross-platform solution for modern GUI application development, and discuss its strengths and shortcomings compared to other established frameworks. |
|
Andrea Di Sorbo, Giovanni Grano, Corrado Aaron Visaggio, Sebastiano Panichella, Investigating the criticality of user‐reported issues through their relations with app rating, Journal of Software: Evolution and Process, Vol. 33 (3), 2021. (Journal Article)
App quality impacts user experience and satisfaction. As a consequence, both app ratings and user feedback reported in app reviews are directly influenced by the user-perceived app quality. Through an empirical study involving 210,517 reviews related to 317 Android apps, in this paper, we experiment with the combined usage of app rating and user reviews analysis (i) to investigate the most important factors influencing the perceived app quality, (ii) focusing on the topics discussed in user review that most relate with app rating. Besides, we investigate whether specific code quality metrics could be monitored to prevent the rising of negative user feedback (i.e., types of user review comments), connected with low ratings. Our study demonstrates that user comments reporting bugs are negatively correlated with the rating, while reviews reportingfeature requests do not. Interestingly, depending on the app category, we observed that different kinds of issues have rather different relationships with the rating and the user-perceived quality of the app. In particular, we observe that for specific app categories (e.g., communication), some code quality factors have significant relationships with the raising of certain types of feedback, which, in turn, are negatively connected with app ratings. |
|
Timothy Zemp, CINDER: Find the matching project for your next CI Study, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Master's Thesis)
Continuous Integration (CI) is a software development practice introduced by the Agile movement with the aim of delivering reliable software releases quickly by regularly integrating changes to the software. The spread and success of CI has lead to a spike in empirical software engineering research, examining the benefits and the impact of this new practice. Implementing Continuous Integration is relatively simple because it is only required to add a configuration file to the repository and register with a CI cloud provider. Unfortunately, due to its easy adaptability, in many software repositories the process is poorly implemented. This is a substantial risk that threatens the validity of CI-based studies unless care is taken in the selection of repositories. To overcome this risk we present CInder, a tool that detects genuine CI configuration files. The tool works by using a random forest classifier trained on a labeled ground truth data set and various features describing the characteristics of configuration files. With CInder we show that significant action within the pipeline and its regular adaptation is a strong indicator of the genuineness of a configuration file. By replicating a study we show that the selection of projects has a significant impact on the results of CI based studies. With CInder we provide researchers with a tool to enhance the process of selecting applicable software repositories, consequently improving the quality and validity of their studies. |
|
Christoph Laaber, Stefan Würsten, Harald C Gall, Philipp Leitner, Dynamically reconfiguring software microbenchmarks: reducing execution time without sacrificing result quality, In: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ACM, New York, NY, USA, 2020-12-08. (Conference or Workshop Paper published in Proceedings)
Executing software microbenchmarks, a form of small-scale performance tests predominantly used for libraries and frameworks, is a costly endeavor. Full benchmark suites take up to multiple hours or days to execute, rendering frequent checks, e.g., as part of continuous integration (CI), infeasible. However, altering benchmark configurations to reduce execution time without considering the impact on result quality can lead to benchmark results that are not representative of the software’s true performance.
We propose the first technique to dynamically stop software microbenchmark executions when their results are sufficiently stable. Our approach implements three statistical stoppage criteria and is capable of reducing Java Microbenchmark Harness (JMH) suite execution times by 48.4% to 86.0%. At the same time it retains the same result quality for 78.8% to 87.6% of the benchmarks, compared to executing the suite for the default duration.
The proposed approach does not require developers to manually craft custom benchmark configurations; instead, it provides automated mechanisms for dynamic reconfiguration. Hence, making dynamic reconfiguration highly effective and efficient, potentially paving the way to inclusion of JMH microbenchmarks in CI. |
|
Carmine Vassallo, Sebastian Proksch, Anna Jancso, Harald C Gall, Massimiliano Di Penta, Configuration Smells in Continuous Delivery Pipelines: A Linter and a Six-Month Study on GitLab, In: ESEC/FSE '20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ACM, New York, United States, 2020-11-08. (Conference or Workshop Paper published in Proceedings)
An effective and efficient application of Continuous Integration (CI) and Delivery (CD) requires software projects to follow certain principles and good practices. Configuring such a CI/CD pipeline is challenging and error-prone. Therefore, automated linters have been proposed to detect errors in the pipeline. While existing linters identify syntactic errors, detect security vulnerabilities or misuse of the features provided by build servers, they do not support developers that want to prevent common misconfigurations of a CD pipeline that potentially violate CD principles (“CD smells”). To this end, we propose CD-Linter, a semantic linter that can automatically identify four different smells in pipeline configuration files. We have evaluated our approach through a large-scale and long-term study that consists of (i) monitoring 145 issues (opened in as many open-source projects) over a period of 6 months, (ii) manually validating the detection precision and recall on a representative sample of issues, and (iii) assessing the magnitude of the observed smells on 5,312 open-source projects on GitLab. Our results show that CD smells are accepted and fixed by most of the developers and our linter achieves a precision of 87% and a recall of 94%. Those smells can be frequently observed in the wild, as 31% of projects with long configurations are affected by at least one smell. |
|
Valerio Terragni, Pasquale Salza, Mauro Pezzè, Measuring Software Testability Modulo Test Quality, In: IEEE/ACM International Conference on Program Comprehension (ICPC), Seoul, South Korea, 2020. (Conference or Workshop Paper published in Proceedings)
|
|
Valerio Terragni, Pasquale Salza, Filomena Ferrucci, A Container-Based Infrastructure for Fuzzy-Driven Root Causing of Flaky Tests, In: ACM/IEEE International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), Seoul, South Korea, 2020. (Conference or Workshop Paper published in Proceedings)
|
|
Carmine Vassallo, Principle-driven continuous integration: simplifying failure discovery and raising anti-pattern awareness, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Dissertation)
Continuous Integration (CI) is a software development practice that enables developers to build software more reliably and quickly. Most organizations have started adopting CI, however, only a few of them achieve the expected benefits. The reason is that living up to the recommended practices (called principles) of CI is not easy and developers tend to follow anti-patterns, which are ineffective solutions to recurrent problems. Anti-patterns violate principles and lower the effectiveness of CI. In this dissertation, we characterize the problem of anti-patterns to implement solutions that help developers follow principles. We start with classifying the anti-patterns encountered by developers in practice and identifying their four root causes, which are (i) the poor knowledge of the prerequisites for adopting CI, (ii) the difficulty of inspecting build failure logs, (iii) the presence of bad configurations, and (iv) the wrong usage of a CI process. While only better coaching in CI can efficiently remove the former, we implement several approaches to address the other causes. To improve the understandability of build failure logs, we develop Bart, a tool that produces summaries for the most common build failure types. To identify anti-patterns caused by configuration smells that developer should remove, we propose CD-Linter, a semantic linter for CI/CD configuration files. We implement CI-Odor, an automated reporting tool that leverages information from repository and build history, to monitor the wrong adoption of CI over time. The results of multiple empirical studies conducted with professional developers show that the proposed approaches are effective at identifying and removing the aforementioned causes of anti-patterns and, consequently, at enforcing a principle-driven continuous integration practice. |
|
Giovanni Grano, Cristian De Iaco, Fabio Palomba, Harald Gall, Pizza versus Pinsa: On the Perception and Measurability of Unit Test Code Quality, In: IEEE International Conference on Software Maintenance and Evolution, ICSME 2020, IEEE, 2020-09-28. (Conference or Workshop Paper published in Proceedings)
Test cases are an essential asset to evaluate software quality. The research community has provided various alternatives to help developers assessing the quality of tests, like code or mutation coverage. Despite the effort spent so far, however, little is known on how practitioners perceive unit test code quality and whether the existing metrics reflect their perception. This paper aims at addressing this gap of knowledge. We first conduct semi-structured interviews and surveys with practitioners to establish a taxonomy of relevant factors for unit test quality and collect a dataset of tests rated by developers based on their perceived quality. Then, we devise a statistical model to measure how the metrics available in literature reflect the perceived quality of test cases. The findings of our study show that readability and maintainability are the key aspects for developers to diagnose the outcome of test cases and drive debugging activities. On the contrary, code coverage metrics are necessary but not sufficient to evaluate the capability of tests. Finally, we discover that available metrics are effective in characterizing poor-quality tests, while limited in distinguishing high-quality ones. |
|
Marios Visos, Retail Product Classifier: A mobile app for retail packaged product image classification and dataset management, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Master's Thesis)
Deep Convolutional Neural Network’s performance heavily depends on the quality as well as the quantity of the data used for training. This thesis is focused on the design and implementation of a software system to ease the traditional, tedious process of manually generating and managing meaningful labelled datasets. The software system includes a mobile client, an application server, and two computer vision components used for object detection. The mobile application gives users the power to generate and manage labelled image datasets, including useful metadata, and persist them in a database via the application server. In addition, users can easily trigger actions such as training, validating, testing as well as performing predictions on new unseen images. To evaluate the quality of the devised software system, we conducted a usability study in collaboration with a Swiss company named Valora at one of their supermarket stores called "avec" located at the Zurich main station. The overall results of our usability study were positive, and the feedback acquired from our 14 participants’ answers to our questionnaire render our mobile application valuable. As a result of our study, a publicly available dataset was generated, including a total of 88 product labels and 2630 images. Finally, we evaluated the generated datasets by using them as training data in two models, one for object localization and another for object detection. The results of testing the object detection model with our datasets make our application a sufficient replacement for the manual process of creating and annotating datasets used in an image classification task. |
|
Christoph Schwizer, Transfer Learning for Code Search How Pre-training Improves Deep Learning on Source Code, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Master's Thesis)
The Transformer architecture and transfer learning have marked a quantum leap in natural language processing (NLP), improving on the state of the art across a range of NLP tasks. This thesis examines how these advancements can be applied to and improve code search. To this end, we pre-train a BERT-based model on combinations of natural language and source code data and evaluate it on pairs of StackOverflow question titles and code answers. Our results show, that the pre-trained models consistently outperform the models that were not pre-trained. In cases where the model was pre-trained on natural language and source code data, it also outperforms our Elasticsearch baseline. Furthermore, transfer learning is particularly effective in cases where a lot of pre-training data is available and fine-tuning data is limited.
We demonstrate that NLP models based on the Transformer architecture can be directly applied to source code analysis tasks, such as code search. With the development of Transformer models that are designed more specifically for dealing with source code data, we believe the results on source code analysis tasks can be further improved. |
|
Janik Lüchinger, A Cloud Framework for Polyglot Parallel Genetic Algorithms, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Bachelor's Thesis)
Genetic Algorithms are a potent tool when computing an exact solution for a problem is too expensive,
but a near-optimal approximation can be sufficient instead. Most Genetic Algorithms
are sequential programs that are prone to scalability issues. Increasing their performance is possible
by executing resource-intensive steps in parallel, therefore reducing the required computation
time. Some previous proposals for cloud-based Genetic Algorithm distribution already provided
frameworks exploiting well-known parallelization techniques. We devised a new, more flexible
framework. We propose PGAcloud, a cloud framework capable of including and executing
polyglot, i.e., multi-language, Genetic Algorithms and deploying them to a prepared cloud environment.
Developers of Genetic Algorithms can include their custom implementations into a
wrapping software container, effectively deploying a local algorithm to the cloud, without worrying
about the underlying implementation details of the framework. Deploying a Genetic Algorithm
to the cloud for parallelization makes it a Parallel Genetic Algorithm. Allowing developers
to include any code into the framework directly makes our proposed framework very flexible.
PGAcloud employs an easily scalable architecture and takes care of cloud orchestration, load balancing,
provisioning, and deployment of the required software containers. After any adjustments
to the provided Parallel Genetic Algorithm configuration template, the user simply needs to execute
the desired commands from the local client’s command-line interface and point out the
configuration file to be used. By basing the main capabilities of any Parallel Genetic Algorithm
computation on a user-defined configuration file, we keep the possibilities for future additions as
versatile as possible. |
|
Christian Birchler, Identifying flaky tests by classifiers: A performance analysis of various machine learning models, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Bachelor's Thesis)
Testing is a crucial part in software development. Most of the current bigger software projects integrate the testing in a continuous integration (CI) pipeline. A failing test will prevent the deployment of the software. In case of flaky tests where tests may fail and pass non-deterministically without a change to the code or the environment is an issue to deal with, since the developer is probably spending time to find a bug although the code under test is not defect. Previous studies focused mainly on the root causes of flakiness but only a few research was done on how to mitigate flaky tests. In this thesis we investigated the impact of different memory related JVM metrics on the predictability of flaky tests. For this purpose we took JVM metrics of 82 open-source Maven projects which had already recorded flaky tests. In order to take the measurements a toolchain script was developed, that injected the necessary code in to the test code so that JVM metrics could be collected during test executions. The toolchain ran for each project the test suites ten times to identify flaky tests that have different outcomes. The toolchain ran on different machines with different RAM sizes to see if there is a difference in the data. We did a PCA and a biplot to identify cluster structure in a lower dimensional space and applied various parametric and non-parametric classification models on the data. The results show that flaky tests are to a certain degree predictable by JVM metrics and the RAM size has also an impact on the predictability of flakiness. This insights allows to develop new tools to handle flaky tests and motivate more research.
|
|
Emirald Mateli, Automatically repairing environmental build failures, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Master's Thesis)
Continuous Integration is a widely-used software engineering practice in both industry and opensource
projects to automate compilation, testing, and quality assurance tasks. Recent studies
reveal that troubleshooting build failures is the main barrier that developers encounter when
adopting CI. Because of their complexity, developers usually spend at least one hour per day in
fixing build failures. While the majority of build failures are caused by expected human mistakes
such as the wrong implementation of a method, a non-negligible part of failures (33%) are caused
by environmental factors such as flakiness of the build infrastructure. In this thesis, we want
to understand how developers fix environmental failures and the extent to which they can be
automatically repaired. We inspect 380 failed builds belonging to 42 different environmental
failure types from 97 open-source projects written in Java and Ruby and built on Travis CI. Based
on the analysis of the resolution patterns of these failures, we devise and implement an approach
for automatically repairing 10 environmental build failure types. To show the applicability of our
approach, we run our tool against 67 environmental build failures from popular GitHub projects
achieving an overall success rate of 55.22%. To assess the usefulness of our automatic repair,
we successfully fix 37 builds from GitHub projects and open issues on these projects where we
propose to accept the generated patches for those failures. 66.6% agree with the proposed fixes
and are willing to use our tool. |
|
Timothée Wildhaber, Visualizing Business Landscapes Using Maps Acquiring, maintaining, processing, clustering and visualizing business data from public sources, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Bachelor's Thesis)
The Swiss business landscape is vast and diverse, making it difficult to quickly gain a high-level overview of companies and industries in Switzerland. This thesis investigates how publicly available resources can be utilized to facilitate such a bird's-eye view. Different sources were considered and multiple continuous data-scraping applications were created. To cluster the data, for example according to legal forms and business sectors, different methods such as machine learning and keyword mapping were applied with varying degrees of success. In working with the data, numerous deficiencies were identified in the sources publicly available, such as incomplete, outdated, and inaccurate records. Nonetheless, via extensive data cleaning, insights have been obtained and visualized to create an overview of the Swiss business landscape, also giving the possibility to find similar businesses given a search query. It is concluded that, while the developed visualizations provide a broad overview of Swiss businesses, a cleaner data set would have given more space for differentiated clustering as well as allowing for more creative visualizations. |
|
Philipp Mockenhaupt, Generating Code Documentation from Context Information Using Deep Learning A StackOverflow Case Study, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Master's Thesis)
The Transformer model structure is state-of-the-art in sequence generation, and BERT-like (Bidirectional Encoder Representations from Transformers) general language models are among the best-performing language models for general natural language processing tasks. We investigate whether these state-of-the-art deep learning model architectures can be used to generate code documentation from context information. We do this by defining StackOverflow content as our context information case study and we create the code documentation oracle from documentation that includes a link to StackOverflow found on GitHub. To understand if the structure of the context information affects the results we use two datasets: one is a simple concatenation of different StackOverflow page parts, whereas the other, in addition, indicates the different parts by a keyword. We compare different deep learning approaches and our results indicate that the state-of-the-art in deep learning in connection with natural language processing can be used to generate code documentation from context information, as we outperform the model structure created for the general natural language processing task. The best-performing model is a Transformer with a BERT encoder additionally trained on the StackOverflow data and equipped with additional vocabulary. This model structure outperforms our baseline model Transformer with classical BERT encoder, producing a 14% higher BLEU (Bilingual Evaluation Understudy) score, 8% higher accuracy and a 1.4% decrease in the word mover distance. An improved performance of the raw Transformer structure in connection with a more structured context information indicates the Transformer can gather context information in a flexible way by focusing on different context information parts based on keyword indication. |
|
Atif Ghulam Nabi, Investigating Bug Prediction and Static Analysis tools in the Cloud Applications An empirical investigation of bug prediction and static analysis tools in the cloud applications, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Master's Thesis)
|
|
Adelina Ciurumelea, Sebastian Proksch, Harald C Gall, Suggesting Comment Completions for Python using Neural Language Models, In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, London, ON, Canada, 2020-03-18. (Conference or Workshop Paper published in Proceedings)
|
|