Contributions published at Software Evolution and Architecture Lab (Harald Gall)
Contribution | |
---|---|
Shubhankar Joshi, enseMbLer: Designing a Scalable Architecture for Ensemble Machine Learning & Collaboration, University of Zurich, Faculty of Business, Economics and Informatics, 2022. (Master's Thesis) The past few decades have seen a huge rise in the amount of data being generated online and a significant boom in the adoption of big data analytics and machine learning techniques with the aim of solving complex problems and driving innovation further. However, this is easier said than done as big data presents significant challenges. This thesis aims to explore this intersection of fields of big data and machine learning, examining the challenges and reviewing current state-of-the-art techniques. We further design and develop our own cloud ready scalable architecture, ensembler, building on the principles of data parallelism and massively parallel ensemble learning to tackle the big data problem. We demonstrate the capability of our solution by preforming an empirical analysis. To this end, we augment a popular public dataset using WGAN-GP, a generative adversarial network. We then develop and train standard models, and run a bunch of different experiments over Google Cloud Platform, comparing our results to those of others. We successfully demonstrate the effectiveness of REST based HTTP infrastructure to handle distributed machine learning without significant overheads and further provide evidence of the increased performance gains of ensemble based techniques. Finally, we contribute a solution to tackle real-time stream processing and machine learning by suggesting a lambda architecture using our solution. |
|
Hoàng Ben Lê Giang, Syntax Highlighting Web Service with Continuous Fine-tuning; A Usage-Driven Web Service, University of Zurich, Faculty of Business, Economics and Informatics, 2022. (Bachelor's Thesis) Syntax Highlighting (SH) plays a substantial role in the daily lives of software developers and can be truly found everywhere where code is developed and shared. It enhances productivity by assigning different colors to text to not only serve the user information about the features and grammatical structure of a language but to also increase the readability of code. With the goal to provide a smart and user-friendly SH solution to the public for the mainstream programming languages Java, Kotlin and Python, we implement a web service that is quick to set up, easily accessible and requires no manual maintenance since it autonomously and continuously improves with the number of submitted requests by incorporating a fine-tuning logic for deep learning models. We show that our SH solution produces instantaneous response times and is able to continuously achieve decent to highly accurate SH results by learning from user requests. |
|
Remus Nichiteanu, Finding common use patterns for a React web-application; A user tracking library, University of Zurich, Faculty of Business, Economics and Informatics, 2022. (Bachelor's Thesis) Graphical User Interfaces (GUIs) are an integral part of any web-application and the first feature users interact with when using it. Thus, it is important for their behavior to be error free, which can be achieved through rigorous testing. Most tests are created by scripting scenarios in order to verify the behavior of intended functionality of a GUI. Many of the state-of-the-art approaches like Model-Based Testing create testing scenarios through a model, which is derived from the GUI. Since manually creating a model for an application is time-consuming and requires a lot of effort, they are abstracted and thus tests derived from them do not reflect actual user behavior in a production environment. Traditional Record-Replay Testing like Selenium, a testing strategy based on recording actions and then replaying them to achieve for example regression testing, use manually recorded scenarios as test cases, which also do not encompass real end-user behavior, due to being highly scripted. Synthetic End-User Testing is a novel strategy in which testing scenarios based on real enduser data are created. It records and synthesizes end-user data into agents, which, compared to traditional testing approaches that exhaustively analyze all possibilities, only consider action sequences that are likely to occur in production and thus test a software in a smaller search-space. Due to the search-space being restricted to scenarios real users might go through when using the application, improvements of the application are based on real use-cases encountered. This work presents a prototype for collecting user data of React web-applications with Synthetic End-User Testing in mind. It is a React library, that records each action a user performs on the GUI of the application and creates a sequence, representing the path a user took throughout the GUI. Each object representing a user action contains data about the state of the GUI before and after the action was performed, thus giving insight into how the GUI of the application adjusts after each interaction by the user. The presented tool can be utilized as a React specific implementation of the recording step in Synthetic End-User Testing, with it collecting as much information as possible about the GUI state as a valuable addition to the process. The presented library was implemented on an open-source project as validation. The findings suggest that the proposed technique is a valid approach for recording action data of a React webapplication. |
|
Qasim Warraich, CLI-Tutor: Can interactive learning make the command line more approachable?, University of Zurich, Faculty of Business, Economics and Informatics, 2022. (Master's Thesis) Despite the arguably dated appearance, difficult learning curve and practical non-existence in the modern personal computing space, Command Line Interfaces (CLIs) have more than stood the test of time in the software development world. There are a multitude of extremely popular tools and applications that primarily focus on the command line as an interaction medium. Some examples include version control software like git, compilers and interpreters for programming languages, package managers and various core utilities that are popular in areas such as software development, scripting and system administration. Command line interfaces are also utilised in areas outside of software development. For example, the infamous Bloomberg Terminal in the financial sector and in general computing applications such as email e.g. mutt, neomutt and text editing e.g. Vim, Neovim, Wordstar. As mentioned before, the use of the command line as an interaction paradigm has effectively disappeared from a mainstream personal computer usage perspective. This reality contributes greatly to the intimidation factor and learning difficulty for those interested in getting into software engineering or system administration. This unfamiliarity, paired with the inevitability of usage of CLIs in the development space, highlights a need to make the command line more accessible to new users for whom text-based interaction with their computer is an alien concept. In recent years, interactive learning tools utilising features such as sandboxed environments have been gaining in popularity and have the potential to be a suitable medium for learning command line basics through actual usage, examples and practise. In this work, we have created just such an interactive tutoring tool tailored for the command line. CLI-Tutor is a forgiving CLI application that aims to teach topics such as shell basics and Unix-like core utility usage through the use of guided lessons with interactive examples and feedback. |
|
Alex Wolf, Assisted interactive programming: Generating context-aware single-line code from Natural Language, University of Zurich, Faculty of Business, Economics and Informatics, 2022. (Master's Thesis) The growing significance of programming languages is manifested by the change in educational curricula, which include programming lectures as early as in primary school. The growing significance also demonstrates that the means to acquire programming skills need to improve. We propose an interactive assistant to help novice programmers acquire knowledge through natural language during their programming tasks. Thus, learning by doing with an assistant that supports the user by providing source code in case of difficulties or missing know-how. Our approach is targeting single-line code generation while considering the natural language intent and an extensive context to provide accurate and relevant recommendations in the form of a single-line of source code. This approach is based on the idea that learning programming languages require solving programming exercises in addition to a fast feedback loop. Our approach aims to provide the learner with more opportunities to learn and understand the programming language by assisting them with minimal source code to help them continue with their task. Thus, only generating single-lines of source code instead of providing full functions. We implemented two models intending to contextualize single-line code generation using a custom context workflow. Our evaluations show that our approach is able to learn strong context representations from our custom context workflow as well as an option to improve context compression. In summary, we contribute a trained context-sensible model that takes natural language input, context, and predicts the next line. Using a custom workflow to deal with contexts of variable size, a new Javabased dataset, of over 200’000 samples tailored to our task of generating single-line Java code. Each sample contains context information, natural language intent, and the target line. Allowing us to contextualize our model, and we compare our work with other state-of-the-art approaches. |
|
Gabriela Eugenia López Magaña, Overcoming the gap between structured dependencies and change coupling, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Master's Thesis) In this thesis, we conduct an empirical analysis of the history of object-oriented complex cyber- physical systems to discover contextual metrics computed on top of software changes, based on call graph analysis and evolution of software entities (e.g., call graph changes of a given calling functions at a specific commit, release, or time frame). These metrics serve as proxies to measure how high (or low) the change coupling of subsequent software changes will be. Additionally, they should reveal if the coupled changes happened within the call graph or outside of it. Specifically, we conjecture that such metrics are valuable indicators of complex types of changes that directly impact the maintainability of the system code. For the support of our investigation and future research, we developed an automatic approach to compute the designed metrics. To validate our research questions we carry on a case study involving four open-source projects in the domains of house automation, health devices, small robots controlling, and robot real-time visual processing. The results of this study highlight how these metrics, accompanied by a user-friendly tool, provide to practitioners quantitative views of dependencies and evolution, based on call graph analysis. As future work, we plan to quantitatively and qualitatively assess the change-proneness ex- posed by our metrics in further projects and organizations from different industrial domains. As motivation for this thesis, we inquire three fundamental research questions: RQ1: To what extent is it possible to build evolutionary call-graphs based on software version man- agement information? RQ2: What is the relation between structural coupling (on a function level) and the call-graphs software evolution?RQ3: Is there a relation between conceptual (non-structural) coupling and the call-graphs software evolution? |
|
Olajoke Oladipo, Understanding User Reviews, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Master's Thesis) A product development team must continually upgrade their applications by understanding their users' experience and reviews through addressing bug reports and introducing new functionalities. Researchers have devised numerous methods to assist developers in retrieving relevant information from user reviews. These methods include automatic extraction, categorization and the use of crowd-sourcing. Some studies have attempted to cluster user reviews by categorizing each review as a bug report, feature request, enhancement, and more. These strategies use Machine learning (ML) techniques in conjunction with Natural Language Processing (NLP) to extract meaningful information from the reviews. A significant leap remains in effectively understanding the user reviews and their intentions. Transform-based models effectively understand user reviews and aid in the extraction of valuable insights (such as BERT is a pre-trained bidirectional transformer-based model developed by Google). This thesis proposes three unsupervised clustering models combined with BERT to capture in-depth user demands. The thesis aims to determine whether a transformer-based model, particularly BERT, can improve the clustering of user reviews in the absence of a priori information on the number of clusters. Additionally, we leveraged Latent Dirichlet Allocation (LDA) and a text summarization model to derive FURTHER? insights from these clusters. |
|
Nadine Muller, Interactive Command History Visualization for the REPL; Proof-of-concept implementation around the Python interactive mode, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis) REPLs play an important part in the programming world. They have many useful features, but are lacking in user-friendliness. This thesis presents the design and implementation of a web application built around a Python console, aimed at improving the user experience of the console with additional features. The main addition is an interactive visualization of the command history, helping users keep an overview over what has been programmed already, letting them restore previous program states to try something else, and generating a script from the command history that can be used in other environments. |
|
Michael Brülisauer, GitHub Repository Search Bot; Design of a GitHub Repository Search Chatbot, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis) Searching code is a daily task for every software engineer. With the growing amount of data available on the internet, software engineers are actively researching new advanced techniques to find certain publicly available code for reusage. This thesis further contributes to this active research by developing a new conversational-based approach for software engineers to find publicly available software. With the development of a conversational agent (chatbot), this thesis describes the design and implementation of a new approach that is built on the growing demand of conversational agents to fulfil a specific task. The chatbot is able to return a repository that best matches a project description provided by a user throughout a natural language conversation. The chatbot is capable of asking the user questions about the repository to search for and remembers past answers from the user. This chatbot offers an easy-to-use interface for software engineers to retrieve a repository with certain specifications. The implementation presented in this thesis is further expandable in future work by increasing the knowledge domain of the chatbot. |
|
Liburn Gjonbalaj, Comand Line Interfaces – Loved or Loathed?, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis) Graphical user interfaces (GUI) have surpassed command line interfaces (CLI) as the most widely used interface by software developers and are also recommended by the scientific literature as a less error prone, easier to use alternative to the CLI. However, studies show a certain percentage of software developers still choose to use the CLI on daily basis. The goal of our work is to investigate the reasons which lead these developers to choose the CLI over GUI, what are the difficulties developers are facing when learning the CLI and how to overcome these difficulties. We collected responses and opinions from 165 software developers with the help of an online survey and the experiences and thoughts of software developers from 11 one-to-one interviews. Most of our respondents are CLI users with over 10 years of experience using the CLI. Our results show that CLI aspects like automation, scripting, flexibility, parallelizing work are all areas in which CLI is superior to GUI. With man pages/documentation, discoverability, remembering commands being the biggest difficulties learning programmers face, we also give recommandations such as online courses, cheatsheets, newer help pages called tldr and alternative shells such as the fish shell to overcome these hurdles and shorten the longer learning curve that comes with using the CLI. |
|
Eleonora Pura, Designing a Chatbot; An investigation for the University of Zurich, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis) This thesis presents a use case created with the collaboration of the Faculty of Informatics (IfI) of the University of Zurich. Their student office interacts with students on a daily basis through emails, phone calls, and one-site meetings. However, many of their interactions are repetitive or on topics that do not fall under their responsibility. These aspects could be improved thanks to the creation of a chatbot, which would allow them to take care of this type of interactions in a more structured and faster way. Chatbots are systems capable of simulating a natural language conversation with a human. Their responses can be generated either on the basis of predefined rules or with the help of machine learning approaches. Chatbots can be useful in different areas such as entertainment, information retrieval, and e-commerce. The aim of this thesis is on one hand to find potential areas of interest for which a chatbot could help improve the efficiency of the interactions between the student office and the students. But most importantly, it serves as a design exercise to see how a chatbot can replace these areas. After conducting an interview and its subsequent in-depth analysis, it was possible to better understand what the main activities of the student office are. This allowed the application of a Software Engineering process, for which the first step was the definition of the requirements, and the second the conception of a first ad-hoc design for a chatbot. Finally, some suggestions for practical solutions are presented. The final result is a complete design of a rule-based chatbot, able to answer questions about the writing process of Bachelor and Master theses and other important topics. |
|
Giovanni Grano, Christoph Laaber, Annibale Panichella, Sebastiano Panichella, Testing with Fewer Resources: An Adaptive Approach to Performance-Aware Test Case Generation, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 47 (11), 2021. (Journal Article) Automated test case generation is an effective technique to yield high-coverage test suites. While the majority of research effort has been devoted to satisfying coverage criteria, a recent trend emerged towards optimizing other non-coverage aspects. In this regard, runtime and memory usage are two essential dimensions: less expensive tests reduce the resource demands for the generation process and later regression testing phases. This study shows that performance-aware test case generation requires solving two main challenges: providing a good approximation of resource usage with minimal overhead and avoiding detrimental effects on both final coverage and fault detection effectiveness. To tackle these challenges, we conceived a set of performance proxies -inspired by previous work on performance testing- that provide a reasonable estimation of the test execution costs (i.e., runtime and memory usage). Thus, we propose an adaptive strategy, called aDynaMOSA, which leverages these proxies by extending DynaMOSA, a state-of-the-art evolutionary algorithm in unit testing. Our empirical study -involving 110 non-trivial Java classes- reveals that our adaptive approach generates test suite with statistically significant improvements in runtime (-25%) and heap memory consumption (-15%) compared to DynaMOSA. Additionally, aDynaMOSA has comparable results to DynaMOSA over seven different coverage criteria and similar fault detection effectiveness. Our empirical investigation also highlights that the usage of performance proxies (i.e., without the adaptiveness) is not sufficient to generate more performant test cases without compromising the overall coverage. |
|
Seungwoo Han, CodexBot; A conversational agent suggesting code examples, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Master's Thesis) Finding the right code example fast and easy can help developers to have better performance. However, the current way of searching code examples such as web search has several drawbacks. First, the massive amount of information about the source code on the web requires a considerable amount of time to preprocess them to use. Furthermore, the repeated search activity will hinder their performance since it breaks the flow of work. In order to resolve these downsides, this master thesis proposes a chatbot, a conversational agent, for searching the right code example. By using a chatbot, rather than searching via the web, developers can have several advantages. First of all, they can directly check the relevant code example without any accessory information. This will help them to stay focused on their work. Second, it removes the time required to preprocess the information when they use the web search. Lastly, the communication component of the chatbot can guide developers to make a good query in a step-by-step manner, since it is not easy to make a good query in one attempt. Thus, they can have a more relevant code example in the end. Implementation of our chatbot is based on three major components: Elasticsearch, FastAPI, and DialogFlow. Elasticsearch is a very powerful search engine; FastAPI is a modern way of creating API in Python; DialogFlow is a tool for creating chatbot frames, pre-trained by Google, with very easy integration with other popular chat applications. Even though it is a preliminary result, the performance of our chatbot is promising in retrieving correct search output in a single-round conversation. In order to evaluate this, we create mock queries from the docstring in our dataset. The result shows that over 60% search accuracy for the tf-idf based mock queries and over 30% search accuracy for the mock queries created from the randomly selected words. Last but not least, we also propose the possible experiment design to test our chatbot in how well it can guide the developers to end up finding the right code example in the multiple-round conversations. |
|
Andy Aidoo, Dependency Usage Analysis, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis) Software artifacts are frequently reused to efficiently develop software. Build tools greatly sim- plify the usage of dependencies. Previous research mainly focused on the providing aspect of de- pendencies in effort to reduce the size of the dependencies themselves. State of the art build tools currently still package more dependencies than necessary for a build when unused dependencies are declared. We aim to pave the way for further research which focuses on reducing redundantly declared dependencies in build files of Java open-source projects. After having determined a data set of 21 Java open-source projects utilizing Maven or Gradle as build tools, we separate the used from the unused imports, followed by assigning the used imports to their corresponding depen- dency declared in their build files. We attribute the 9% of not assignable imports to the usage of transitive dependencies. 49% of dependencies declared are neither directly or inherently used throughout or sample data. These results give reason to study a larger sample and develop tools that aim to reduce the number of redundant code packaged during a build. |
|
Tim Moser, "Comment Quality Assessment and Classification", University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis) Almost all tasks concerning the evolution and maintenance of software require a developer to understand the code. Multiple studies have shown in the past that commented code is more readable than code without any comments, indicating that comments are containing vital information about the implementation. However, not all comments are of equal quality: some are often incomplete, inconsistent with the code, hard to read and understand, or entirely missing. In this thesis, we investigated how we can provide an approach to analyze comments and rate them with respect to their quality in four different programming languages: Java, C, C++ and C#. The goal of this thesis is twofold: - RQ1: Propose a deep learning approach to classify comments into different categories, based on purpose and semantics. We show that our classification pipeline reached an accuracy of over 90% (F1-score), out-performing traditional machine learning models on the same data set in the same environment. - RQ2: Develop a tool for assessing comment quality with respect to readability, coherence, usefulness, completeness and consistency. We will also demonstrate with said tool the evolution of comment quality in four repositories, written in different programming languages and suggest directions for future work, like an empiric evaluation with real developers. |
|
Janosch Baltensperger, Continuous Deep Learning; An in-depth investigation of the deep learning workflow, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis) Deep learning has gained immense attraction with the emergence of big data and advanced computing power. Through the use of artificial neural networks, various breakthroughs were achieved in fields such as language understanding and image recognition. Nevertheless, it has soon become clear that deep learning and machine learning in general impose various additional challenges besides building an accurate model. Researchers have been highly active to investigate the classical machine learning workflow and integrate best practices from the software engineering lifecycle. However, deep learning exhibits deviations which are not yet covered in this conceptual development process. This includes the requirement of dedicated hardware, dispensable feature engineering, extensive hyperparameter optimization, large-scale data management and model compression to reduce size and inference latency. Individual problems of deep learning are under thorough examination, and numerous concepts and implementations have gained traction. However, the complete end-to-end development process still remains unspecified. In this thesis, we defined a detailed deep learning workflow that incorporates the aforementioned characteristics on the baseline of the classical machine learning workflow. We further transferred the conceptual idea into practice by building a prototypic deep learning system using the latest technologies on the market. To examine the feasibility of the workflow, two use cases were applied to the prototype. The first use case represented a text classification problem, while the second use case focused on image processing. We thereby successfully demonstrated the application of the workflow on distinct examples. In summary, it becomes apparent that the deep learning lifecycle compromises a large set of steps and involves various roles. With our defined workflow, we present a profound guideline for the deep learning development process. Moreover, we conclude that the technologies currently available on the market are not fully mature. Great effort is required to manage all deep learning artifacts and keep versions aligned within continuous iterations over the lifecycle. |
|
Fabio Greter, An isolated containerized infrastructure for flakiness discovery; Implementation of a Prototype, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis) As software projects increase in complexity, testing becomes ncreasingly important. It is essential that tests can be built to ensure that a test failure reliably indicates problems in production code, which can then easily be fixed. Because modern software systems have become inherently nondeterministic, intermittent test failures are also becoming more frequent, in what is known as flaky tests. Existing techniques to remedy this problem focus on efficient detection of flaky tests without identifying the root causes of their intermittent behaviour, or are specific to certain root causes. Other approaches rely on instrumentation of the production code, which may affect test outcomes. In this thesis, we present a prototype implementation of an architecture to induce test flakiness by executing tests under various circumstances, called execution scenarios. Our prototype allows for flexible implementation of these scenarios and provides an API to be used in various environments, such as continuous integration pipelines. We find that it can reproduce known flaky test behaviour, and that in some cases, our prototype execution scenarios can exhibit different failure rates for certain test cases. We also propose future enhancements to continue development on the prototype. |
|
Dylan Puser, Flaky Tests Detection in a Continuous Integration Pipeline; Implementation of a Proof-of-Concept System, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis) Tests in software engineering are used to control the validity of code, to make sure newly written or modified code does not have unintended consequences and to create a more maintainable project. Sometimes however, a test can be flaky. A flaky test will fail occasionally, even though neither the test nor the code under test were modified. Such tests erode the trust in the tests, are difficult and costly to identify and rectify, and can have a considerable negative impact on companies and developers. Furthermore, they can be an indication of a deeper fault in the system itself. Based on a proposal of identifying the root cause of flaky tests using a container-based fuzzy-driven approach and an implementation of such a system, we discuss how to best make it available to a typical user. We then present an implementation of such a system and evaluateshortly its value. |
|
Martin Grambow, Christoph Laaber, Philipp Leitner, David Bermbach, Using application benchmark call graphs to quantify and improve the practical relevance of microbenchmark suites, PeerJ Computer Science, Vol. 7, 2021. (Journal Article) Performance problems in applications should ideally be detected as soon as they occur, i.e., directly when the causing code modification is added to the code repository. To this end, complex and cost-intensive application benchmarks or lightweight but less relevant microbenchmarks can be added to existing build pipelines to ensure performance goals. In this paper, we show how the practical relevance of microbenchmark suites can be improved and verified based on the application flow during an application benchmark run. We propose an approach to determine the overlap of common function calls between application and microbenchmarks, describe a method which identifies redundant microbenchmarks, and present a recommendation algorithm which reveals relevant functions that are not covered by microbenchmarks yet. A microbenchmark suite optimized in this way can easily test all functions determined to be relevant by application benchmarks after every code change, thus, significantly reducing the risk of undetected performance problems. Our evaluation using two time series databases shows that, depending on the specific application scenario, application benchmarks cover different functions of the system under test. Their respective microbenchmark suites cover between 35.62% and 66.29% of the functions called during the application benchmark, offering substantial room for improvement. Through two use cases—removing redundancies in the microbenchmark suite and recommendation of yet uncovered functions—we decrease the total number of microbenchmarks and increase the practical relevance of both suites. Removing redundancies can significantly reduce the number of microbenchmarks (and thus the execution time as well) to ~10% and ~23% of the original microbenchmark suites, whereas recommendation identifies up to 26 and 14 newly, uncovered functions to benchmark to improve the relevance. By utilizing the differences and synergies of application benchmarks and microbenchmarks, our approach potentially enables effective software performance assurance with performance tests of multiple granularities. |
|
Christoph Laaber, Deliberate microbenchmarking of software systems, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Dissertation) Software performance faults have severe consequences for users, developers, and companies. One way to unveil performance faults before they manifest in production is performance testing, which ought to be done on every new version of the software, ideally on every commit. However, performance testing faces multiple challenges that inhibit it from being applied early in the development process, on every new commit, and in an automated fashion. In this dissertation, we investigate three challenges of software microbenchmarks, a performance testing technique on unit granularity which is predominantly used for libraries and frameworks. The studied challenges affect the quality aspects (1) runtime, (2) result variability, and (3) performance change detection of microbenchmark executions. The objective is to understand the extent of these challenges in real-world software and to find solutions to address these. To investigate the challenges’ extent, we perform a series of experiments and analyses. We execute benchmarks in bare-metal as well as multiple cloud environments and conduct a large-scale mining study on benchmark configurations. The results show that all three challenges are common: (1) benchmark suite runtimes are often longer than 3 hours; (2) result variability can be extensive, in some cases up to 100%; and (3) benchmarks often only reliably detect large performance changes of 60% or more. To address the challenges, we devise targeted solutions as well as adapt well-known techniques from other domains for software microbenchmarks: (1) a solution that dynamically stops benchmark executions based on statistics to reduce runtime while maintaining low result variability; (2) a solution to identify unstable benchmarks that does not require execution, based on statically-computable source code features and machine learning algorithms; (3) traditional test case prioritization (TCP) techniques to execute benchmarks earlier that detect larger performance changes; and (4) specific execution strategies to detect small performance changes reliably even when executed in unreliable cloud environments. We experimentally evaluate the solutions and techniques on real-world benchmarks and find that they effectively deal with the three challenges. (1) Dynamic reconfiguration enables to drastically reduce runtime by between 48.4% and 86.0% without changing the results of 78.8% to 87.6% of the benchmarks, depending on the project and statistic used. (2) The instability prediction model allows to effectively identify unstable benchmarks when relying on random forest classifiers, having a prediction performance between 0.79 and 0.90 area under the receiver operating characteristic curve (AUC). (3) TCP applied to benchmarks is effective and efficient, with APFD-P values for the best technique ranging from 0.54 to 0.71 and a computational overhead of 11%. (4) Batch testing, i.e., executing the benchmarks of two versions on the same instances interleaved and repeated as well as repeated across instances, enables to reliably detect performance changes of 10% or less, even when using unreliable cloud infrastructure as execution environment. Overall, this dissertation shows that real-world software microbenchmarks are considerably affected by all three challenges (1) runtime, (2) result variability, and (3) performance change detection; however, deliberate planning and execution strategies effectively reduce their impact. |