C Bird, A Bachmann, E Aune, J Duffy, Abraham Bernstein, V Filkov, P Devanbu, Fair and balanced? Bias in bug-fix datasets, In: ESEC/FSE '09: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering on European software engineering conference and foundations of software engineering, 2009-08. (Conference or Workshop Paper published in Proceedings)
 
Software engineering researchers have long been interested in where and why bugs occur in code, and in predicting where they might turn up next. Historical bug-occurence data has been key to this research. Bug tracking systems, and code version histories, record when, how and by whom bugs were fixed; from these sources, datasets that relate file changes to bug fixes can be extracted. These historical datasets can be used to test hypotheses concerning processes of bug introduction, and also to build statistical bug prediction models. Unfortunately, processes and humans are imperfect, and only a fraction of bug fixes are actually labelled in source code version histories, and thus become available for study in the extracted datasets. The question naturally arises, are the bug fixes recorded in these historical datasets a fair representation of the full population of bug fixes? In this paper, we investigate historical data from several software projects, and find strong evidence of systematic bias. We then investigate the potential effects of "unfair, imbalanced" datasets on the performance of prediction techniques. We draw the lesson that bias is a critical problem that threatens both the effectiveness of processes that rely on biased datasets to build prediction models and the generalizability of hypotheses tested on biased data. |
|
K Reinecke, Abraham Bernstein, Tell me where you've lived, and I'll tell you what you like: adapting interfaces to cultural preferences, In: User Modeling, Adaptation, and Personalization (UMAP), 2009-06. (Conference or Workshop Paper published in Proceedings)
 
|
|
Thomas Scharrenbach, Abraham Bernstein, On the evolution of ontologies using probabilistic description logics, In: First ESWC Workshop on Inductive Reasoning and Machine Learning on the Semantic Web, 2009-06. (Conference or Workshop Paper published in Proceedings)
 
Exceptions play an important role in conceptualizing data,
especially when new knowledge is introduced or existing knowledge changes. Furthermore, real-world data often is contradictory and uncertain.
Current formalisms for conceptualizing data like Description Logics rely upon first-order logic. As a consequence, they are poor in addressing exceptional, inconsistent and uncertain data, in particular when evolving the knowledge base over time.
This paper investigates the use of Probabilistic Description Logics as a formalism for the evolution of ontologies that conceptualize real-world data. Different scenarios are presented for the automatic handling of inconsistencies
during ontology evolution. |
|
Abraham Bernstein, Jiwen Li, From active towards InterActive learning: using consideration information to improve labeling correctness, In: Human Computation Workshop, 2009-06. (Conference or Workshop Paper published in Proceedings)
 
Active learning methods have been proposed to reduce the labeling effort of human experts: based on the initially available labeled instances and information about the unlabeled data those algorithms choose only the most informative instances for labeling. They have been shown to significantly reduce the size of the required labeled dataset to generate a precise model [17]. However, active learning framework assumes "perfect" labelers, which is not true in practice (e.g., [22, 23]). In particular, an empirical study for hand-written digit recognition [5] has shown that active learning works poorly when a human labeler is used. Thus, as active learning enters the realm of practical applications, it will need to confront the practicalities and inaccuracies of human expert decision-making. Specifically, active learning approaches will have to deal with the problem that human experts are likely to make mistakes when labeling the selected instances. |
|
Jonas Tappolet, Abraham Bernstein, Applied temporal RDF: efficient temporal querying of RDF data with SPARQL, In: 6th European Semantic Web Conference (ESWC), 2009-06. (Conference or Workshop Paper published in Proceedings)
 
Many applications operate on time-sensitive data. Some of
these data are only valid for certain intervals (e.g., job-assignments, versions of software code), others describe temporal events that happened at certain points in time (e.g., a persons birthday). Until recently, the only way to incorporate time into Semantic Web models was as a data type property. Temporal RDF, however, considers time as an additional dimension in data preserving the semantics of time.
In this paper we present a syntax and storage format based on named graphs to express temporal RDF. Given the restriction to preexisting RDF-syntax, our approach can perform any temporal query using standard SPARQL syntax only. For convenience, we introduce a shorthand format called t-SPARQL for temporal queries and show how t-SPARQL
queries can be translated to standard SPARQL. Additionally, we show that, depending on the underlying data’s nature, the temporal RDF approach vastly reduces the number of triples by eliminating redundancies resulting in an increased performance for processing and querying. Last but not least, we introduce a new indexing approach method that can significantly reduce the time needed to execute time point queries (e.g., what happened on January 1st). |
|
Amancio Bouza, G Reif, Abraham Bernstein, Probabilistic partial user model similarity for collaborative filtering, In: 1st International Workshop on Inductive Reasoning and Machine Learning on the Semantic Web (IRMLeS2009) at the 6th European Semantic Web Conference (ESWC2009), 2009-06-01. (Conference or Workshop Paper published in Proceedings)
 
Recommender systems play an important role in supporting people getting items they like. One type of recommender systems is user-based collaborative filtering. The fundamental assumption of user-based collaborative filtering is that people who share similar preferences for common items behave similar in the future. The similarity of user preferences is computed globally on common rated items such that partial preference similarities might be missed. Consequently, valuable ratings of partially similar users are ignored. Furthermore, two users may even have similar preferences but the set of common rated items is too small to infer preference similarity. We propose first, an approach that computes user preference similarities based on learned user preference models and second, we propose a method to compute partial user preference similarities based on partial user model similarities. For users with few common rated items, we show that user similarity based on preferences significantly outperforms user similarity based on common rated items. |
|
Stefan Amstein, Evaluation und Evolution von Pattern-Matching-Algorithmen zur Betrugserkennung, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2009. (Master's Thesis)
 
Fraud detection often involves the analysis of large data sets originating from private companies or governmental agencies by means of artificial intelligence (such as data mining) but there are also pattern-matching approaches.
ChainFinder, an algorithm for graph-based pattern matching, is capable of detecting transaction chains within financial data that could indicate fraudulent behavior. In this work, relevant measurements of correctness and performance are acquired in order to evaluate and evolve the given implementation of the ChainFinder. A series of tests, both on synthetic and more realistic datasets are conduced and their results discussed. Along with this process, a number of derivative ChainFinder implementations emerged and are compared to each other.
Throughout this process, an evaluation framework application was developed in order to assist the evaluation of similar algorithms by providing certain automatisms.
|
|
J Ekanayake, Jonas Tappolet, H C Gall, Abraham Bernstein, Tracking concept drift of software projects using defect prediction quality, In: 6th IEEE Working Conference on Mining Software Repositories, 2009-05. (Conference or Workshop Paper published in Proceedings)
 
Defect prediction is an important task in the mining of software repositories, but the quality of predictions varies
strongly within and across software projects. In this paper
we investigate the reasons why the prediction quality is so
fluctuating due to the altering nature of the bug (or defect) fixing process. Therefore, we adopt the notion of a concept drift, which denotes that the defect prediction model has become unsuitable as set of influencing features has changed – usually due to a change in the underlying bug generation process (i.e., the concept). We explore four open source projects (Eclipse, OpenOffice, Netbeans and Mozilla) and construct file-level and project-level features for each of them from their respective CVS and Bugzilla repositories.
We then use this data to build defect prediction models and
visualize the prediction quality along the time axis. These
visualizations allow us to identify concept drifts and – as a consequence – phases of stability and instability expressed in the level of defect prediction quality. Further, we identify those project features, which are influencing the defect prediction quality using both a tree induction-algorithm and a linear regression model. Our experiments uncover that software systems are subject to considerable concept drifts in their evolution history. Specifically, we observe that the change in number of authors editing a file and the number of defects fixed by them contribute to a project’s concept drift and therefore influence the defect prediction quality.
Our findings suggest that project managers using defect
prediction models for decision making should be aware of
the actual phase of stability or instability due to a potential concept drift. |
|
D J Kurz, Abraham Bernstein, K Hunt, D Radovanovic, P Erne, Z Siudak, O Bertel, Simple point of care risk stratification in acute coronary syndromes: the AMIS model, Heart, Vol. 95 (8), 2009. (Journal Article)
 
Background: Early risk stratification is important in the management of patients with acute coronary syndromes (ACS).
Objective: To develop a rapidly available risk stratification tool for use in all ACS.
Design and methods: Application of modern data mining and machine learning algorithms to a derivation cohort of 7520 ACS patients included in the AMIS (Acute Myocardial Infarction in Switzerland)-Plus registry between 2001 and 2005; prospective model testing in two validation cohorts.
Results: The most accurate prediction of in-hospital mortality was achieved with the “Averaged One-Dependence Estimators” (AODE) algorithm, with input of 7 variables
available at first patient contact: Age, Killip class, systolic blood pressure, heart rate, pre-hospital cardio-pulmonary resuscitation, history of heart failure, history of cerebrovascular disease. The c-statistic for the derivation cohort (0.875) was essentially maintained in
important subgroups, and calibration over five risk categories, ranging from <1% to >30% predicted mortality, was accurate. Results were validated prospectively against an independent AMIS-Plus cohort (n=2854, c-statistic 0.868) and the Krakow-Region ACS Registry (n=2635, c-statistic 0.842). The AMIS model significantly outperformed established “point-of-care” risk prediction tools in both validation cohorts. In comparison to a logistic
regression-based model, the AODE-based model proved to be more robust when tested on the Krakow validation cohort (c-statistic 0.842 vs. 0.746). Accuracy of the AMIS model
prediction was maintained at 12-months follow-up in an independent cohort (n=1972, c-statistic 0.877).
Conclusions: The AMIS model is a reproducibly accurate point-of-care risk stratification tool for the complete range of ACS, based on variables available at first patient contact. |
|
Michael Imhof, Optimization strategies for RDFS-aware data storage, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2009. (Master's Thesis)

Indexing and storing triple-based Semantic Web data in a way to allow for efficient query
processing has long been a difficult task. A recent approach to address this issue is the index-
ing scheme Hexastore. In this work, we propose two novel on-disk storage models for Hexastore,
that use RDF Schema information to gather data that semantically belong together and store them
contiguously. In the clustering approach, elements of the same classes are stored contiguously
within the indices. In the subindex approach, data of the same categories are saved in separate
subindices. Thus, we expect to simplify and accelerate the retrieving process of Hexastore. The
experimental evaluation shows a clear advantage of the standard storage model against the pro-
posed approaches in terms of index creation time and required disk space. |
|
Adrian Bachmann, Abraham Bernstein, Data Retrieval, Processing and Linking for Software Process Data Analysis, No. IFI-2009.0003b, Version: 1, 2009. (Technical Report)
 
Many projects in the mining software repositories communities rely on software process data gathered from bug tracking databases and commit log files of version control systems. These data are then used to predict defects, gather insight into a project's life-cycle, and other tasks. In this technical report we introduce the software systems which hold such data. Furthermore, we present our approach for retrieving, processing and linking this data. Specifically, we first introduce the bug fixing process and the software products used which support this process. We then present a step by step guidance of our approach to retrieve, parse, convert and link the data sources. Additionally, we introduce an improved approach for linking the change log file with the bug tracking database. Doing that, we achieve a higher linking rate than with other approaches |
|
Ausgezeichnete Informatikdissertationen 2008, Edited by: Abraham Bernstein, Steffen Hölldobler, et al, Gesellschaft für Informatik, Bonn, 2009. (Edited Scientific Work)

|
|
The Semantic Web - ISWC 2009, Edited by: Abraham Bernstein, D R Karger, T Heath, L Feigenbaum, D Maynard, E Motta, K Thirunarayan, Springer, Berlin, 2009. (Edited Scientific Work)

This book constitutes the refereed proceedings of the 8th International Semantic Web Conference, ISWC 2009, held in Chantilly, VA, USA, during October 25-29, 2009.
The volume contains 43 revised full research papers selected from a total of 250 submissions; 15 papers out of 59 submissions to the semantic Web in-use track, and 7 papers and 12 posters accepted out of 19 submissions to the doctorial consortium.
The topics covered in the research track are ontology engineering; data management; software and service engineering; non-standard reasoning with ontologies; semantic retrieval; OWL; ontology alignment; description logics; user interfaces; Web data and knowledge; semantic Web services; semantic social networks; and rules and relatedness. The semantic Web in-use track covers knowledge management; business applications; applications from home to space; and services and infrastructure. |
|
Esther Kaufmann, Talking to the semantic web - natural language query interfaces for casual end-users, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2009. (Dissertation)
 
|
|
Abraham Bernstein, E Kaufmann, C Kiefer, Querying the semantic web with ginseng - A guided input natural language search Engine, In: Searching answers : Festschrift in honour of Michael Hess on the occasion of his 60th birthday, MV-Wissenschaft (Monsenstein und Vannerdat), Münster, p. 1 - 10, 2009. (Book Chapter)

|
|
K Reinecke, Abraham Bernstein, S Hauske, To Make or to Buy? Sourcing Decisions at the Zurich Cantonal Bank, In: International Conference on Information Systems (ICIS), 2008-12-14. (Conference or Workshop Paper published in Proceedings)
 
The case study describes the IT situation at Zurich Cantonal Bank around the turn of the millennium. Incapable to fulfill the company’s strategic goals, it is shown how the legacy systems force the company into the decision to modify or to replace the old systems with standard software packages: to make or to buy? The case study introduces the bank’s strategic goals and their importance for the three make or buy alternatives. All solutions are described in detail; however, the bank’s decision is left open for students to decide. For a thorough analysis of the situation, the student is required to put himself in the position of the key decision maker at Zurich Cantonal Bank, calculating risks and balancing advantages and disadvantages of each solution. Six video interviews reveal further technical and interpersonal aspects of the decision-making process at the bank, as well as of the situation today. |
|
Michael Meier, The extended GraphSlider-Framework, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2008. (Master's Thesis)
 
This diploma thesis addresses the automatic recognition of fraudulent activities in the
transaction databases of a bank. Therefore, the already existing fraud detection program
GraphSlider gets extended with new functions. The first function addresses the recognition of
fraud, based on temporal data in the database, because this data is almost always available but
very seldom used for fraud detection. The second new function addresses the recognition of
internal fraud on the employee level. In order to achieve this, our approach tries to track the
fraudulent actions back to the single employee. At the end, the new approaches are tested with
synthetic data if they are capable and if they have a good performance. |
|
Raphael Pirker, Erweiterung des DBDoc-Systems um inkrementelle Dokumentationserstellung und Dokumentation von Schemaänderungen, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2008. (Master's Thesis)

The database documentation tool DBDoc accurately portrays the status quo of a database schema and the database management syste itself. When using database documentation, however, the evolution of the database schema is also of interest. While the documentation could be copied at given intervals to an archive and then compared manually, the changes would be very hard to detect and out of context.
Within this thesis the implementation of plugins that tie into the existing infrastructure of DBDoc is discussed. The plugins automatically detect and store modified schemas and augment the documentation with information on what was changedover time. |
|
Basil Wirz, Dynamische Adaption von Benutzerschnittstellen an das Interaktionsverhalten, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2008. (Master's Thesis)

To fulfill the demand of the users towards an individually adapted userinterface
which represents their preferences, their ability or their cultural background
it can be changed either before or during the usage. This work follows the
second approach and describes the coherences of such an adaptation system for
userinterfaces. For the adaptation the interaction behavior of the users will
therefore be tracked to conduct appropriate dynamic adjustments. To approve the
applicability of the recommended solution, it will be implemented in an existing
webapplication. This implementation should demonstrate the adaptation
possibilities in a real example. An experiment of the userinteraction serves to
exploit basis data, which is necessary for the adaptation. |
|
Eirik Aune, Adrian Bachmann, Abraham Bernstein, Christian Bird, Premkumar Devanbu, Looking Back on Prediction: A Retrospective Evaluation of Bug-Prediction Techniques, November 2008. (Other Publication)
 
|
|