Carol Alexandru, Sebastiano Panichella, Sebastian Proksch, Harald C Gall, Redundancy-free analysis of multi-revision software artifacts, Empirical Software Engineering, Vol. 24 (1), 2019. (Journal Article)
Researchers often analyze several revisions of a software project to obtain historical data about its evolution. For example, they statically analyze the source code and monitor the evolution of certain metrics over multiple revisions. The time and resource requirements for running these analyses often make it necessary to limit the number of analyzed revisions, e.g., by only selecting major revisions or by using a coarse-grained sampling strategy, which could remove significant details of the evolution. Most existing analysis techniques are not designed for the analysis of multi-revision artifacts and they treat each revision individually. However, the actual difference between two subsequent revisions is typically very small. Thus, tools tailored for the analysis of multiple revisions should only analyze these differences, thereby preventing re-computation and storage of redundant data, improving scalability and enabling the study of a larger number of revisions. In this work, we propose the Lean Language-Independent Software Analyzer (LISA), a generic framework for representing and analyzing multi-revisioned software artifacts. It employs a redundancy-free, multi-revision representation for artifacts and avoids re-computation by only analyzing changed artifact fragments across thousands of revisions. The evaluation of our approach consists of measuring the effect of each individual technique incorporated, an in-depth study of LISA resource requirements and a large-scale analysis over 7 million program revisions of 4,000 software projects written in four languages. We show that the time and space requirements for multi-revision analyses can be reduced by multiple orders of magnitude, when compared to traditional, sequential approaches. |
|
Giovanni Grano, Fabio Palomba, Dario Di Nucci, Andrea De Lucia, Harald C Gall, Scented Since the Beginning: On the Diffuseness of Test Smells in Automatically Generated Test Code, The Journal of Systems and Software, Vol. 156, 2019. (Journal Article)
Software testing represents a key software engineering practice to ensure source code quality and reliability. To support developers in this activity and reduce testing effort, several automated unit test generation tools have been proposed. Most of these approaches have the main goal of covering as more branches as possible. While these approaches have good performance, little is still known on the maintainability of the test code they produce, i.e., whether the generated tests have a good code quality and if they do not possibly introduce issues threatening their effectiveness. To bridge this gap, in this paper we study to what extent existing automated test case generation tools produce potentially problematic test code. We consider seven test smells, i.e., suboptimal design choices applied by programmers during the development of test cases, as measure of code quality of the generated tests, and evaluate their diffuseness in the unit test classes automatically generated by three state-of-the-art tools such as Randoop, JTExpert, and Evosuite. Moreover, we investigate whether there are characteristics of test and production code influencing the generation of smelly tests. Our study shows that all the considered tools tend to generate a high quantity of two specific test smell types, i.e., Assertion Roulette and Eager Test, which are those that previous studies showed to negatively impact the reliability of production code. We also discover that test size is correlated with the generation of smelly tests. Based on our findings, we argue that more effective automated generation algorithms that explicitly take into account test code quality should be further investigated and devised. |
|
Domenico Serra, Giovanni Grano, Fabio Palomba, Filomena Ferrucci, Harald C Gall, Alberto Bacchelli, On the Effectiveness of Manual and Automatic Unit Test Generation: Ten Years Later, In: Proceedings of the 16th International Conference on Mining Software Repositories, IEEE Press, Piscataway, NJ, USA, 2019-01-01. (Conference or Workshop Paper published in Proceedings)
Good unit tests play a paramount role when it comes to foster and evaluate software quality. However, writing effective tests is an extremely costly and time consuming practice. To reduce such a burden for developers, researchers devised ingenious techniques to automatically generate test suite for existing code bases. Nevertheless, how automatically generated test cases fare against manually written ones is an open research question. In 2008, Bacchelli et al. conducted an initial case study comparing automatic and manually generated test suites. Since in the last ten years we have witnessed a huge amount of work on novel approaches and tools for automatic test generation, in this paper we revise their study using current tools as well as complementing their research method by evaluating these tools' ability in finding regressions. Preprint [https://doi.org/10.5281/zenodo. 2595232], dataset [https://doi.org/10.6084/m9.figshare.7628642]. |
|
Carol Alexandru, José J. Merchante, Sebastiano Panichella, Sebastian Proksch, Harald Gall, Gregorio Robles, On the Usage of Pythonic Idioms, In: Onward!, ACM, Boston, MA, USA, 2018-11-07. (Conference or Workshop Paper published in Proceedings)
|
|
Gerald Schermann, Philipp Leitner, Search-Based Scheduling of Experiments in Continuous Deployment, In: 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, Madrid, Spain, 2018-10-23. (Conference or Workshop Paper published in Proceedings)
|
|
Manuela Züger, Sensing and indicating interruptibility in office workplaces, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Dissertation)
In office workplaces, interruptions by co-workers, emails or instant messages are common. Many of these interruptions are useful as they might help resolve questions quickly and increase the productivity of the team. However, knowledge workers interrupted at inopportune moments experience longer task resumption times, lower overall performance, more negative emotions, and make more errors than if they were to be interrupted at more appropriate moments.
To reduce the cost of interruptions, several approaches have been suggested, ranging from simply closing office doors to automatically measuring and indicating a knowledge worker’s interruptibility - the availability for interruptions - to co-workers. When it comes to computer-based interruptions, such as emails and instant messages, several studies have shown that they can be deferred to automatically detected breakpoints during task execution, which reduces their interruption cost. For in-person interruptions, one of the most disruptive and time-consuming types of interruptions in office workplaces, the predominant approaches are still manual strategies to physically indicate interruptibility, such as wearing headphones or using manual busy lights. However, manual approaches are cumbersome to maintain and thus are not updated regularly, which reduces their usefulness.
To automate the measurement and indication of interruptibility, researchers have looked at a variety of data that can be leveraged, ranging from contextual data, such as audio and video streams, keyboard and mouse interaction data, or task characteristics all the way to biometric data, such as heart rate data or eye traces. While studies have shown promise for the use of such sensors, they were predominantly conducted on small and controlled tasks over short periods of time and mostly limited to either contextual or biometric sensors. Little is known about their accuracy and applicability for long-term usage in the field, in particular in office workplaces. In this work, we developed an approach to automatically measure interruptibility in office workplaces, using computer interaction sensors, which is one type of contextual sensors, and biometric sensors. In particular, we conducted one lab and two field studies with a total of 33 software developers. Using the collected computer interaction and biometric data, we used machine learning to train interruptibility models. Overall, the results of our studies show that we can automatically predict interruptibility with high accuracy of 75.3%, improving on a baseline majority classifier by 26.6%.
An automatic measure of interruptibility can consequently be used to indicate the status to others, allowing them to make a well-informed decision on when to interrupt. While there are some automatic approaches to indicate interruptibility on a computer in the form of contact list applications, they do not help to reduce in-person interruptions. Only very few researchers combined the benefits of an automatic measurement with a physical indicator, but their effect in office workplaces over longer periods of time is unknown. In our research, we developed the FlowLight, an automatic interruptibility indicator in the form of a traffic-light like LED placed on a knowledge worker's desk. We evaluated the FlowLight in a large-scale field study with 449 participants from 12 countries. The evaluation revealed that after the introduction of the FlowLight, the number of in-person interruptions decreased by 46% (based on 36 interruption logs), the awareness on the potential harm of interruptions was elevated and participants felt more productive (based on 183 survey responses and 23 interview transcripts), and 86% remained active users even after the two-month study period ended (based on 449 online usage logs).
Overall, our research shows that we can successfully reduce in-person interruption cost in office workplaces by sensing and indicating interruptibility. In addition, our research can be extended and opens up new opportunities to further support interruption management, for example, by the integration of other more accurate biometric sensors to improve the interruptibility model, or the use of the model to reduce self-interruptions. |
|
Faruk Acibal, Investigating Continuous Delivery Practices and their Effectiveness in Open Source Projects, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
Continuous Integration(CI) and Continuous Deployment(CD) is a heavily used tool in software development in open source as well as industrial environments. To understand the effectiveness and efficiency of this tool, we start off by defining a taxonomy of variables that directly or indirectly could influence the effectiveness/efficiency of CI/CD practices. By performing an extensive literature review, we extract around 77 variables from 42 sources and StackExchange posts. We then state possible theoretical effects between these variables. We continue by performing an empirical study going in depth for the build failure rate as well as the build duration and how they are affected by other variables. We use the datasets provided by TravisTorrent as well as GHTorrent. Looking at over 1200 projects and more than 680’000 builds, we confirm older studies but also contribute new findings. Our work should help identifying problematic CI/CD practices that could influence the CI/CD
effectiveness. The taxonomy we defined, should help with many upcoming research questions regarding the efficient and effective use of CI/CD practices. With these results, CI/CD effectiveness could be heightened in industrial as well as open source environments by manually or even automatically inspecting these variables and warning the maintainers of projects if problematic instances of these variables are detected. |
|
Sali Zumberi, Changelyzer: Learning Change Type Classifications for Software Evolution from Big Code on GitHub, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
Software needs to be adapted to its rapidly changing environment. A key issue in software evolution analysis is the identification of particular changes that occur across several versions of a program. In order to understand and analyse source code changes, it is crucial to make them tangible. While plain-text diffs are a straight-forward way of keeping track of changes in a software project, they are poorly suited for understanding those changes. Different semantic changes might be mixed together in a single diff and it is difficult to further process diffs using automated tools. Approaches like ChangeDistiller extract changes between two revisions based on abstract syntax trees (ASTs) instead of plain text source code. This allows them to recognize semantic changes. More specifically, whether certain elements (if conditions, classes, methods, etc.) have been added, removed, modified or even moved to other locations in the source code. However, change types identified by ChangeDistiller have been manually crafted by researchers. This thesis tries to extract common change types by applying big data analytics including clustering, word embeddings and topic modelling techniques on changes of over 500 projects (18.6 GB). We were able to find more than 70 common change types, use neural network to show similar changes, cluster similar changes into 55 clusters, then extract 35 topics with help of topic modelling and last but not least prove existence of change-groups within larger diffs by implementing a sophisticated algorithm. Finally, we propose tools and tasks based on provided data corpus. |
|
Carmine Vassallo, Fabio Palomba, Harald C Gall, Continuous Refactoring in CI: A Preliminary Study on the Perceived Advantages and Barriers, In: 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME 2018, IEEE, Washington, DC, United States, 2018-09-23. (Conference or Workshop Paper published in Proceedings)
By definition, the practice of Continuous Integration (CI) promotes continuous software quality improvement. In systems adopting such a practice, quality assurance is usually performed by using static and dynamic analysis tools (e.g., SonarQube) that compute overall metrics such as maintainability or reliability measures. Furthermore, developers usually define quality gates, i.e., source code quality thresholds that must be reached by the software product after every newly committed change. If certain quality gates fail (e.g., a maintainability metric is below a settled threshold), developers should refactor the code possibly addressing some of the proposed warnings. While previous research findings showed that refactoring is often not done in practice, it is still unclear whether and how the adoption of a CI philosophy has changed the way developers perceive and adopt refactoring. In this paper, we preliminarily study—running a survey study that involves 31 developers—how developers perform refactoring in CI, which needs they have and the barriers they face while continuously refactor source code. |
|
Nik Zaugg, "What Should I Change in My Patch?": An Empirical Investigation of Relevant Changes and Automation Needs in Modern Code Review, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
Recent research has shown that available tools for Modern Code Review (MCR) are still far from meeting the current expectations of developers. The objective of this thesis is to investigate the most recurrent change types in MCR as well as the developersí expectations and needs regarding the automation of reviewing activities that, from a developer point of view, are still needed to facilitate MCR activities by considering current literature, manually analyzing code review changes and conducting a survey. Additionally, we explore approaches and tools that are still needed to facilitate MCR activities and extract various metrics describing a patch in Gerrit code review with the goal to predict required changes in the future. To that end, we first empirically elicited a taxonomy of recurrent review change types that characterize code reviews. The taxonomy was designed by performing three steps: (i) we generated an initial version of the taxonomy by qualitatively and quantitatively analyzing 211 review commits and 648 review comments of
ten open source projects; then (ii) we integrated topics and code review change types of an existing taxonomy available from the literature into this initial taxonomy; finally, (iii) we surveyed 52 developers to integrate eventually missing change types into the taxonomy. We then evaluated
the survey feedback to find out more about current developers' expectations towards code review and how code review activities can be facilitated by novel tools and approaches. Results of our taxonomy evaluation supports previous research, showing that the majority of changes in code review are related to maintainability issues. Furthermore, our findings highlight that the availability of emerging development technologies (e.g., cloud-based technologies) and practices (e.g., continuous delivery and continuous integration) further widens the gap between the expectations developers have towards code review and its outcome. This has pushed developers to perform additional activities during code reviews and shows that additional types of feedback are expected from reviewers, especially regarding changes in non-source-code artifacts (e.g., configurations of Automated Static Analysis Tools). Our survey participants provided recommendations,
specified techniques to employ, and highlighted the data to analyze for implementing approaches able to automate the code review activities related to our taxonomy. Most promising recommendations towards the automation of MCR involve the use of Machine Learning and Natural Language Processing techniques to study recurrent patterns and anti-patterns as well as code, change and object-oriented metrics. This study sheds some more light on the most critical and recurring changes in code review, the developersí expectations and needs, and ultimately on
the approaches and tools that are still needed to facilitate MCR activities. We believe that this is an essential step towards closing the gap between developersí expectations in code review and its outcome as well as supporting the vision of full or partial automation in MCR. |
|
Gerald Schermann, Sali Zumberi, Jürgen Cito, Structured information on state and evolution of dockerfiles on github, In: MSR'18 Proceedings of the 15th International Conference on Mining Software Repositories, ACM Press, New York, New York, USA, 2018-06-28. (Conference or Workshop Paper published in Proceedings)
|
|
Manuela Züger, Thomas Fritz, Sensing and Supporting Software Developers' Focus, In: the 26th Conference on Program Comprehension, ACM Press, New York, New York, USA, 2018-06-28. (Conference or Workshop Paper)
Software developers regularly have to focus in order to successfully perform their work. At the same time, developers experience many disruptions to their focus, especially in today's highly demanding, collaborative and open office work environments. When these disruptions happen during tasks that require a lot of focus, such as comprehending a difficult piece of source code, they can be very costly, causing a decrease in performance and quality. By sensing how focused a developer is, we might be able to reduce the cost of such disruptions.
In our previous work, we investigated the use of biometric and computer interaction sensors to sense interruptibility - the availability for interruptions - and developed the FlowLight approach - a traffic light like LED indicator of a person's interruptibility - to reduce the cost of external in-person interruptions, a particularly expensive kind of disruption. Our results demonstrate the potential of accurately sensing interruptibility in the field and of reducing external interruption cost to increase focus and productivity of knowledge workers. |
|
Adelina Ciurumelea, Sebastiano Panichella, Harald C. Gall, Automated user reviews analyser, In: ICSE '18: 40th International Conference on Software Engineering, ACM, New York, NY, USA, 2018. (Conference or Workshop Paper published in Proceedings)
|
|
Christoph Laaber, Philipp Leitner, An Evaluation of Open-Source Software Microbenchmark Suites for Continuous Performance Assessment, In: MSR ’18: 15th International Conference on Mining Software Repositories, ACM, New York, NY, USA, 2018-05-28. (Conference or Workshop Paper published in Proceedings)
Continuous integration (CI) emphasizes quick feedback to devel- opers. This is at odds with current practice of performance testing, which predominantely focuses on long-running tests against entire systems in production-like environments. Alternatively, software microbenchmarking attempts to establish a performance baseline for small code fragments in short time. This paper investigates the quality of microbenchmark suites with a focus on suitability to deliver quick performance feedback and CI integration. We study ten open-source libraries written in Java and Go with benchmark suite sizes ranging from 16 to 983 tests, and runtimes between 11 minutes and 8.75 hours. We show that our study subjects include benchmarks with result variability of 50% or higher, indicating that not all benchmarks are useful for reliable discovery of slow- downs. We further arti cially inject actual slowdowns into public API methods of the study subjects and test whether test suites are able to discover them. We introduce a performance-test quality metric called the API benchmarking score (ABS). ABS represents a benchmark suite’s ability to nd slowdowns among a set of de ned core API methods. Resulting benchmarking scores (i.e., fraction of discovered slowdowns) vary between 10% and 100% for the study subjects. This paper’s methodology and results can be used to (1) assess the quality of existing microbenchmark suites, (2) select a set of tests to be run as part of CI, and (3) suggest or generate benchmarks for currently untested parts of an API. |
|
Carmine Vassallo, Sebastian Proksch, Timothy Zemp, Harald C Gall, Un-Break My Build: Assisting Developers with Build Repair Hints, In: 26th Conference on Program Comprehension, ICPC 2018, ACM, New York, United States, 2018-05-27. (Conference or Workshop Paper published in Proceedings)
Continuous integration is an agile software development practice. Instead of integrating features right before a release, they are constantly being integrated in an automated build process. This shortens the release cycle, improves software quality, and reduces time to market. However, the whole process will come to a halt when a commit breaks the build, which can happen for several reasons, e.g., compilation errors or test failures, and fixing the build suddenly becomes a top priority. Developers not only have to find the cause of the build break and fix it, but they have to be quick in all of it to avoid a delay for others. Unfortunately, these steps require deep knowledge and are often time consuming. To support developers in fixing a build break, we propose Bart, a tool that summarizes the reasons of the build failure and suggests possible solutions found on the Internet. We will show in a case study with eight participants that developers find Bart useful to understand build breaks and that using Bart substantially reduces the time to fix a build break, on average by 41%. |
|
Manuela Züger, Sebastian Müller, André Meyer, Thomas Fritz, Sensing Interruptibility in the Office: A Field Study on the Use of Biometric and Computer Interaction Sensors, In: CHI 2018, s.n., Montreal, QC, Canada, 2018-04-21. (Conference or Workshop Paper published in Proceedings)
Knowledge workers experience many interruptions during their work day. Especially when they happen at inopportune moments, interruptions can incur high costs, cause time loss and frustration. Knowing a person's interruptibility allows optimizing the timing of interruptions and minimize disruption. Recent advances in technology provide the opportunity to collect a wide variety of data on knowledge workers to predict interruptibility. While prior work predominantly examined interruptibility based on a single data type and in short lab studies, we conducted a two-week field study with 13 professional software developers to investigate a variety of computer interaction, heart-, sleep-, and physical activity-related data. Our analysis shows that computer interaction data is more accurate in predicting interruptibility at the computer than biometric data (74.8% vs. 68.3% accuracy), and that combining both yields the best results (75.7% accuracy). We discuss our findings and their practical applicability also in light of collected qualitative data. |
|
Carmine Vassallo, Sebastiano Panichella, Fabio Palomba, Sebastian Proksch, Andy Zaidman, Harald C Gall, Context Is King: The Developer Perspective on the Usage of Static Analysis Tools, In: 25th International Conference on Software Analysis, Evolution and Reengineering, SANER 2018, IEEE Computer Society, Washington, DC, United States, 2018-03-20. (Conference or Workshop Paper published in Proceedings)
Automatic static analysis tools (ASATs) are tools that support automatic code quality evaluation of software systems with the aim of (i) avoiding and/or removing bugs and (ii) spotting design issues. Hindering their wide-spread acceptance are their (i) high false positive rates and (ii) low comprehensibility of the generated warnings. Researchers and ASATs vendors have proposed solutions to prioritize such warnings with the aim of guiding developers toward the most severe ones. However, none of the proposed solutions considers the development context in which an ASAT is being used to further improve the selection of relevant warnings. To shed light on the impact of such contexts on the warnings configuration, usage and adopted prioritization strategies, we surveyed 42 developers (69% in industry and 31% in open source projects) and interviewed 11 industrial experts that integrate ASATs in their workflow. While we can confirm previous findings on the reluctance of developers to configure ASATs, our study highlights that (i) 71% of developers do pay attention to different warning categories depending on the development context, and (ii) 63% of our respondents rely on specific factors (e.g., team policies and composition) when prioritizing warnings to fix during their programming. Our results clearly indicate ways to better assist developers by improving existing warning selection and prioritization strategies. |
|
Sandro Wirth, Generating Documentation to Reflect Side Effects of Methods, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
Studies have shown that reducing side effects in software projects has a
variety of advantages. It has a positive impact on testability, code comprehension, verifiability or performing refactorings.
Although there are existing studies about detecting purity and side effects there is no study about.
This thesis presents an approach on automatically showing the purity and side effects of Java methods during the workflow of reading and writing code to evaluate a positive influence on code comprehension.
It is based on the implementation of a prototype that uses purity information from existing tools and transforms the information into readable Javadoc.
To measure the influence on code comprehension, this thesis presents an evaluation of the implemented prototype by executing a quantitative empirical study with 14 participants. The results show that having the information can improve code comprehension for certain cases but also small details can lead to possible negative influences. |
|
Gerald Schermann, Jürgen Cito, Philipp Leitner, Continuous Experimentation: Challenges, Implementation Techniques, and Current Research, IEEE Software, Vol. 35 (2), 2018. (Journal Article)
|
|
Carol Alexandru, Harald Gall, Of Cyborg Developers and Big Brother Programming AI, In: The Hawaii International Conference on System Sciences, s.n., Hawaii, USA, 2018-01-03. (Conference or Workshop Paper published in Proceedings)
The main reason modern machine learning mechanisms outperform hand-crafted solutions is the availability of high-quality data in large quantities. We observe that although many day-to-day activities in software engineering (such as bug triaging, reverting regressions, or even implementing code for properly scoped problems) could possibly be automated, we lack the necessary monitoring tools to capture all relevant information. Bug trackers and version control rely mostly on plain text, and specifications are informal or at best semi-structured. After setting the stage via a short excursion to the year 2047, we discuss how a ubiquitous AI, which can learn from every interaction a human developer has with a machine, could take over more and more of the mundane responsabilities in software engineering. We outline how this change will affect software engineering practice as well as education. |
|