Patrick Knab, Martin Pinzger, Abraham Bernstein, Predicting Defect Densities in Source Code Files with Decision Tree Learners, In: MSR '06: Proceedings of the 2006 International Workshop on Mining Software Repositories, ACM, New York, NY, USA, May 2006. (Conference or Workshop Paper)

With the advent of open source software repositories the data available
for defect prediction in source files increased tremendously.
Although traditional statistics turned out to derive reasonable results
the sheer amount of data and the problem context of defect prediction
demand sophisticated analysis such as provided by current data
mining and machine learning techniques.
In this work we focus on defect density prediction and present
an approach that applies a decision tree learner on evolution data
extracted from the Mozilla open source web browser project. The
evolution data includes different source code, modification, and defect
measures computed from seven recent Mozilla releases. Among
the modification measures we also take into account the change coupling,
a measure for the number of change-dependencies between
source files. The main reason for choosing decision tree learners,
instead of for example neural nets, was the goal of finding underlying
rules which can be easily interpreted by humans. To find these
rules, we set up a number of experiments to test common hypotheses
regarding defects in software entities. Our experiments showed, that
a simple tree learner can produce good results with various sets of
input data. |
|
Tobias Sager, Abraham Bernstein, Martin Pinzger, Christoph Kiefer, Detecting Similar Java Classes Using Tree Algorithms, In: Proceedings of the International Workshop on Mining Software Repositories, ACM, Shanghai, China, May 2006. (Conference or Workshop Paper)

Similarity analysis of source code is helpful during development to provide, for instance, better support for code reuse. Consider a development environment that analyzes code while typing and that suggests similar code examples or existing implementations from a source code repository. Mining software repositories by means of similarity measures enables and enforces reusing existing code and reduces the developing effort needed by creating a shared knowledge base of code fragments. In information retrieval similarity measures are often used to find documents similar to a given query document. This paper extends this idea to source code repositories. It introduces our approach to detect similar Java classes in software projects using tree similarity algorithms. We show how our approach allows to find similar Java classes based on an evaluation of three tree-based similarity measures in the context of five user-defined test cases as well as a preliminary software evolution analysis of a medium-sized Java project. Initial results of our technique indicate that it (1) is indeed useful to identify similar Java classes, (2) successfully identifies the ex ante and expost versions of refactored classes, and (3) provides some interesting insights into within-version and between-version dependencies of classes within a Java project. |
|
Abraham Bernstein, Christoph Kiefer, Imprecise RDQL: Towards Generic Retrieval in Ontologies Using Similarity Joins, In: 21th Annual ACM Symposium on Applied Computing (ACM SAC 2006), ACM, New York, NY, USA, April 2006. (Conference or Workshop Paper)
 
Traditional semantic web query languages support a logic-based
access to the semantic web. They offer a retrieval (or reasoning) of
data based on facts. On the traditional web and in databases,
however, exact querying often provides an incomplete answer as
queries are over-specified or the mix of multiple
ontologies/modelling differences requires ``interpretational
flexibility.'' Therefore, similarity measures or ranking approaches
are frequently used to extend the reach of a query. This paper
extends this idea to the semantic web. It introduces iRDQL---a
semantic web query language with support for similarity joins. It is
an extension of RDQL (RDF Data Query Language) that enables its
users to query for similar resources ranking the results using a
similarity measure. We show how iRDQL allows to extend the reach of
a query by finding additional results. We quantitatively evaluated
four similarity measures for their usefulness in iRDQL in the
context of an OWL-S semantic web service retrieval test collection
and compared the results to a specialized OWL-S matchmaker. Initial
results of using iRDQL indicate that it is indeed useful for
extending the reach of queries and that it is able to improve recall
without overly sacrificing precision. We also found that our generic
iRDQL approach was only slightly outperformed by the specialized
algorithm. |
|
Adrian Bachmann, Indoornavigation mittels Ortsinterpolation, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
 
Satellite navigation is ubiquitous in our daily life. Unfortunately, satellite navigation signals can not be received somtimes. Due to this effect, a new stream of reseach has been emerged focussing on alternatives, especially in the area of indoor-navigation. In this diploma thesis a new approach is developed and presented. The new approach induces in a new way navigation information from accelerometer and magnetometer sensor data. The new approach overcomes the shortcoming of insufficient calibration - one of the major issues in current research. The main contribution of this work is an online calibration framework, which allows to adapt to changing border conditions. Thus, the result is a much more robust, precise, and up-to-date base for the path extrapolation. |
|
Peter Vorburger, Abraham Bernstein, Entropy-based Concept Shift Detection, In: IEEE International Conference on Data Mining (ICDM), March 2006. (Conference or Workshop Paper)
 
|
|
Abraham Bernstein, Peter Vorburger, A Scenario-Based Approach for Direct Interruptablity Prediction on Wearable Devices, Journal of Pervasive Computing and Communications, Vol. 3 (4), 2006. (Journal Article)
 
People are subjected to a multitude of interruptions. This situation is likely to get worse as technological devices are making us increasingly reachable. In order to manage the interruptions it is imperative to predict a person’s interruptability - his/her current readiness or inclination to be interrupted. In this paper we introduce the approach of direct interruptability inference from sensor streams (accelerometer and audio data) in a ubiquitous computing setup and show that it provides highly accurate and robust predictions. Furthermore, we argue that scenarios are central for evaluating the performance of ubiquitous computing devices (and interruptability predicting devices in particular) and prove it on our setup. We also demonstrate that scenarios provide the foundation for avoiding misleading results, assessing the results’ generalizability, and provide the basis for a stratified scenario-based learning model, which greatly speeds-up the training of such devices. |
|
Patrick Ziegler, Christoph Kiefer, Christoph Sturm, Klaus R. Dittrich, Abraham Bernstein, Detecting Similarities in Ontologies with the SOQA-SimPack Toolkit, In: 10th International Conference on Extending Database Technology (EDBT 2006), Springer, March 2006. (Conference or Workshop Paper)
 
Ontologies are increasingly used to represent the intended real-world semantics of data and services in information systems. Unfortunately, different databases often do not relate to the same ontologies when describing their semantics. Consequently, it is desirable to have information about the similarity between ontology concepts for ontology alignment and integration. This paper presents the SOQA-SimPack Toolkit (SST), an ontology language independent Java API that enables generic similarity detection and visualization in ontologies. We demonstrate SST's usefulness with the SOQA-SimPack Toolkit Browser, which allows users to graphically perform similarity calculations in ontologies. |
|
Markus Stocker, The Fundamentals of iSPARQL, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)

The growing amount of semantically annotated data and published ontologies opens an interesting and challenging application for similarity measures. In face of limited knowledge about the distributed data on the SemanticWeb, similarity measures allow a retrieval which is expected to improve the performance compared to exact querying. This thesis presents iSPARQL, an extension to SPARQL which allows querying for similar resources in both RDF/RDFS and OWL ontologies, and supports the development of strategies to compute the similarity of ontological resources. Huge data volume forced the development of query optimization techniques for relational database systems. However, query engines for ontological data based on graph models, mostly execute user queries without considering any optimization. Especially for large ontologies, optimization techniques are required to ensure that query results can be delivered within reasonable time. OptARQ is a first prototype for iSPARQL query optimization based on the concept of triple pattern selectivity estimation. The evaluation we conduct demonstrates how triple pattern reordering according to their selectivity affects the query execution performance. |
|
Iwan Stierli, Entropy-based, Semi-dynamic Regulation of Incremental Algorithms in the case of Instantaneous Concept Drifts, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
 
Incremental classifiers build their prediction rules according to the known instances of a continuous data stream. As these algorithms learn from correct and incorrect predictions, their performance improves the more instances their rules are based on. In the case of an instantaneous concept drift, this assumption is no longer valid as the old concepts’ instances falsify the rules which are to be built. Therefore, it would be ideal to forget the instances. In this thesis, it is tried to regulate this forgetting rate accurately by using an adapted form of the entropy term. First, a simple, linear correlation between the entropy and the forgetting rate will be excluded. Furthermore, a second, semi-dynamic and noise-resistant switching strategy will be pursued. It will be tested on a synthetic data set and compared with the applicable benchmarks according to two different quality measures. |
|
Beat Sprenger, Semantic Crystal. Ein End-User-Interface zur Unterstützung von Ontologie-Abfragen mit SPARQL, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
 
Although standards for structuring semantic information have been available for a long time (XML, RDF, OWL, SPARQL), most typical users have not yet encountered the semantic web. There are mainly two reasons for this: There are not many semantic knowledge bases available which could be interesting to query, and there is not much software to support such queries neither. Therefore this research paper presents Semantic Crystal, a prototype of such a query software. Semantic Crystal displays classes and relationships from OWL knowledge bases graphically and supports the setup of a query – graphically as well. Although the query eventually exists in the intricate language SPARQL, it will be comprehensible by a broad number of users due to its graphical representation. |
|
Valentina Shcherba, An Experimental Analysis of Productivity in Ubiquitous Computing, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)

Mobile Devices such as PDAs and Smartphones find more and more access in our private and business life. It is thereby automatically assumed that this new technology increases productivity. But is this really so? The diploma thesis attempts to identify the effects and impacts of ubiquitous computing on labor productivity and offers a theoretical concept for the research and analysis of the changes in intra-corporate productivity (labor productivity) when ubiquitous computing is implemented. Furthermore a concept is provided which can serve as a basis for working out other experimental situations as well for the analysis of the impact of ubiquitous computing on productivity as for the impact of other IT tools. On the basis of this concept a design for an experiment will be prepared, which will be realised in the next future and will provide some data for a statement about the impact of ubiquitous computing on labor productivity. |
|
Michael Polli, Untersuchung von Regime Shifts mittels Data Mining Methoden, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)

The goal of this thesis is the detection and the identification of simultaneous and time-shifted regime drifts in economic data. To achieve this objective artificial and economic datasets have been created. The dependencies between their features and their inner structures will be described. As the economic dataset needed to be prepared for the later analysis, all the necessary steps will be explained. The problem of regime drift detection is a new field in data mining research. For this reason tools will be described which can handle familiar problems and in addition new methods will be introduced. Their results on the datasets will be presented and illustrated. It will be shown that these methods are capable of identifying simultaneous regime drifts. Although the time-shifted problem could not be solved, a major understanding for the underlying problematic has been built up. |
|
Lukas Kern, A Distributable Data Management Layer for Semantic Web Applications, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
 
This thesis introduces a generic data management framework capable of dealing with distributed knowledge represented in Semantic Web languages. Persistent data storage, data querying, retrieval, annotation, versioning and security in terms of authentication and authorization are key features and are thoroughly discussed in regard to traditional software principles such as distribution, openness, robustness and scalability. This work emerged from a process support system project named NExT whose architecture called for a novel data management framework approach. First, the envisioned framework's specific requirements are determined. In the second part, appropriate overall concepts are elaborated and subsequently a complete system architecture is presented. The thesis closes with the presentation of a reference implementation that can be used by the NExT system. The implementation furthermore reveals the architecture's feasibility as a proof-of-concept prototype. |
|
Daniel Imhof, Frameworkf für ein dynamisches Telefonsteuerungssystem, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
 
On the job one is exposed to different sources of disruption. The goal of this paper ist to reduce these disturbances. In earlier attempts it was already shown that one can determine the interruptability degree of a person with sensor data and the assistance of data mining algorithms. But these values could be acquired only afterwards. Therefore a framework has to be developed, which can manage the hole procedures in real time. Apart from the programming of the software components the whole application also has to be evaluated on its suitability for daily use. |
|
Katrin Hunt, Evaluation of Novel Algorithms to Optimize Risk Stratification Scores in Myocardial Infarction, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
 
Risk Predictors currently used in the field of Acute Myocardial Infarction (AMI) were developed on data cohorts collected in the early 90’s using traditional statistical methods. Considering the progress in the therapy of AMI as well as in the field of Data Mining, it was hypothesized that a better Risk Predictor could be developed. Working on the AMIS PLUS registry (n=7520) existing scores were evaluated and a new Risk Prediction Model developed, using the AODE algorithm fromthe Bayes family. The most accepted Risk Score (TIMI Risk Score for ST-Elevation) yielded an Area under the ROC Curve (AUC) of 0.803. The newly developed Risk Model called AMIS Model achieved an AUC of 0.875 using less input variables. Tests showed that the prediction capacity of the AMIS Model was especially good with patients undergoing PCI treatment (AUC=0.885 compared to AUC=0.783 of TIMI Risk Score). |
|
Markus Graf, Ausarbeitung und Umsetzung einer Experimentalumgebung zur Benutzeranalyse in mobilen Umgebungen, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
 
In our daily lifes, mobile communication devices attain very high and still growing significance. This gives us the ability to reach anybody, everywhere at anytime. Consequently, also insignificant calls reach and disturb us in unfavorable situations. The goal of this Thesis is to determine the key parameters for a long-term experiment to gather data on user behavior. The analysis of these data may help to predict the interruptability level of a user and, subsequently, taking appropriate measures. These key parameters are determined on the one hand by the experimental method and on the other hand by the features of the implementation environment. An application has been created in order to assess the feasibility of the experiment setup. |
|
Lorenz Fischer, NLP-Reduce Ein natürlichsprachliches Suchsystem für “Semantic Web”-Daten, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
 
The Semantic Web is the vision of the linking of various databases and the enrichment of those using semantic information. Users are able to query those databases using formal query languages. The major goal of the activities concerning the Semantic Web is to provide a collection of data that can be sought by intelligent systems rather than using simple full text search engines. The main focus in the development of the Semantic web therefore lies in making the semantic data accessible to machines. Most people are not used to query databases through formal languages. A possible solutions to deal with this complaint is the use of Natural Language Processing Systems. This thesis engages in a system that is capable of translating natural language sentences into a formal query language for OWL ontologies called SPARQL. The developed conceptual design has been implemented and evaluated in aJava prototype. |
|
Marc Eichenberger, User-Interfaces für die Vermittlung von höherwertigen Kontextinformationen, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
 
This diploma thesis deals with the research field ""Context Awareness"". While common research focuses on the determination of context per se, delivering context information to the environment has taken less into account. To increase productivity, it is very important to consider delivery and use of this information. The goal of this thesis is to make context information available in a very intuitive way. The thesis is applied to an office-based-setup. The derived context information corresponds to the availability or interruptability of the person in the office. This information is passed on to visitors. The developed application is based on actual research findings about user interface design and usability. |
|
Vijay Victor D'Silva, Widening for Automata, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
 
Computing fixpoints of increasing sequences of sets is an important problem in many areas of computer science including algorithmic verification, program analysis, inductive inference and systems biology. For most problems, the fixpoint computation does not terminate, so an approximate solution has to be found. Widening is a technique to compute an over-approximation of an infinite, increasing sequence of sets. In this thesis, we present a framework for constructing widening operators for fixpoint computations over sets represented as automata. Many widening operators for automata that appear in the literature are instances of our framework. Moreover, two inductive inference algorithms in the literature naturally fall out as instances of this framework. We identify general criteria that characterise the effect of widening and use these criteria to study various properties of widening operators. We also provide several new results and generalise existing results about widening operators and inductive inference algorithms. Finally, we show how a widening operator defined in our framework can be combined with algorithms for automated verification of infinite state systems and provide a heuristic for generating counterexamples if verification fails. |
|
Marcel Camporelli, Using a Bayesian Classifier for Probability Estimation: Analysis of the AMIS Score for Risk Stratification in Myocardial Infarction, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
 
A recent publication has presented the AMIS model, a novel model for risk prediction in acute myocardial infarction (AMI). The model proposed therein is based on AODE, a probabilistic bayesian classifier. It outperforms TIMI, a widely accepted prediction model in the field, when classifying patients on their expected in-hospital survival or non-survival. It was hypothesised that the score which serves as a basis for the classification could be used as a probability estimator, allowing a more fine-graned stratification of patients into different mortality classes. An evaluation method for the fit of probabilistic models is developed and applied to the AMIS model. In the evaluation, the AMIS model clearly outperformes TIMI as a risk estimator. |
|