C Weiss, Abraham Bernstein, On-disk storage techniques for semantic web data - are B-trees always the optimal solution?, In: 5th International Workshop on Scalable Semantic Web Knowledge Base Systems, 2009-10. (Conference or Workshop Paper published in Proceedings)

Since its introduction in 1971, the B-tree has become the dominant index structure in database systems.
Conventional wisdom dictated that the use of a B-tree index or one of its descendants would typically lead to good results.
The advent of XML-data, column stores, and the recent resurgence of typed-graph (or triple) stores motivated by the Semantic Web has changed the nature of the data typically stored.
In this paper we show that in the case of triple-stores the usage of B-trees is actually highly detrimental to query performance.
Specifically, we compare on-disk query performance of our triple-based Hexastore when using two different B-tree implementations, and our simple and novel vector storage that leverages offsets.
Our experimental evaluation with a large benchmark data set confirms that the vector storage outperforms the other approaches by at least a factor of four in load-time, by approximately a factor of three (and up to a factor of eight for some queries) in query-time, as well as by a factor of two in required storage.
The only drawback of the vector-based approach is its time-consuming need for reorganization of parts of the data during inserts of new triples: a seldom occurrence in many Semantic Web environments.
As such this paper tries to reopen the discussion about the trade-offs when using different types of indices in the light of non-relational data and contribute to the endeavor of building scalable and fast typed-graph databases. |
|
Anthony Lymer, Ein Empfehlungsdienst für kulturelle Präferenzen in adaptiven Benutzerschnittstellen, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2009. (Master's Thesis)

This thesis addresses the refinement of adaptation rules in a web-based to-do management system named MOCCA. MOCCA is an adaptive system, which adapts the user interface using the cultural background information of each user. To achieve the goal of this thesis, a recommender system was developed, which clusters similar users into groups. In order to create new adaptation rules for similar users, the system calculates recommendations, which are assigned to the groups. The recommender system uses techniques such as collaborative filtering, k-Means and the statistical X2 goodness-of-fit test. The system was designed in a modular fashion and divided into two parts. One part of the recommender system gathers similar users and groups them accordingly. The other part uses the generated groups and calculates recommendations. For each part two concrete components were created. Those components are interchangeable, so that the recommender system can be composed as desired. All possible compositions were evaluated with a set of test users. It could be shown, that the developed recommender system generates a more accurate user interface than the initially given adaptation rules. |
|
Linard Moll, Anti Money Laundering under real world conditions - Finding relevant patterns, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2009. (Master's Thesis)
 
This Master Thesis deals with the search for new patterns to enhance the discovery of fraudulent activities within the jurisdiction of a financial institution. Therefore transactional data from a database is analyzed, scored and processed for the later usage by an internal anti-money laundering specialist. The findings are again stored in a database and processed by TV - the Transaction Visualizer, an existing and already commercially used tool. As a result of this thesis, the software module TMatch and the graphical user interface TMatchViz were developed. The interaction of these two tools was tested and evaluated using synthetically created datasets. Furthermore, the approximations made and their impact on the specification of the algorithms will be addressed in his report. |
|
Bettina Bauer-Messmer, Lukas Wotruba, Kalin Müller, Sandro Bischof, Rolf Grütter, Thomas Scharrenbach, Rolf Meile, Martin Hägeli, Jürg Schenker, The Data Centre Nature and Landscape (DNL): Service Oriented Architecture, Metadata Standards and Semantic Technologies in an Environmental Information System, In: EnviroInfo 2009: Environmental Informatics and Industrial Environmental Protection: Concepts, Methods and Tools, Shaker Verlag, Aachen, Aachen, 2009-09-01. (Conference or Workshop Paper published in Proceedings)
 
|
|
Jörg-Uwe Kietz, Floarea Serban, Abraham Bernstein, S Fischer, Towards cooperative planning of data mining workflows, In: Proc of the ECML/PKDD09 Workshop on Third Generation Data Mining: Towards Service-oriented Knowledge Discovery (SoKD-09), 2009-09. (Conference or Workshop Paper published in Proceedings)
 
A major challenge for third generation data mining and knowledge discovery systems is the integration of different data mining tools and services for data understanding, data integration, data preprocessing, data mining, evaluation and deployment, which are distributed across the network of computer systems. In this paper we outline how an intelligent assistant that is intended to support end-users in the difficult and time consuming task of designing KDD-Workflows out of these distributed services can be built. The assistant should support the user in checking the correctness of workflows, understanding the goals behind given workflows, enumeration of AI planner generated workflow completions, storage, retrieval, adaptation and repair of previous workflows. It should also be an open easy extendable system. This is reached by basing
the system on a data mining ontology (DMO) in which all the services (operators) together with their in-/output, pre-/postconditions are described. This description is compatible with OWL-S and new operators can be added importing their OWL-S specification and classifying it into
the operator ontology. |
|
A Bachmann, Abraham Bernstein, Software process data quality and characteristics - a historical view on open and closed source projects, In: IWPSE-Evol'09: Proceedings of the joint international and annual ERCIM workshops on Principles of software evolution (IWPSE) and software evolution (Evol) workshops, 2009-08. (Conference or Workshop Paper published in Proceedings)
 
Software process data gathered from bug tracking databases and version control system log files are a very valuable source to analyze the evolution and history of a project or predict its future. These data are used for instance to predict defects, gather insight into a project's life-cycle, and additional tasks. In this paper we survey five open source projects and one closed source project in order to provide a deeper insight into the quality and characteristics of these often-used process data. Specifically, we first define quality and characteristics measures, which allow us to compare the quality and characteristics of the data gathered for different projects. We then compute the measures and discuss the issues arising from these observation. We show that there are vast differences between the projects, particularly with respect to the quality in the link rate between bugs and commits. |
|
C Bird, A Bachmann, E Aune, J Duffy, Abraham Bernstein, V Filkov, P Devanbu, Fair and balanced? Bias in bug-fix datasets, In: ESEC/FSE '09: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering on European software engineering conference and foundations of software engineering, 2009-08. (Conference or Workshop Paper published in Proceedings)
 
Software engineering researchers have long been interested in where and why bugs occur in code, and in predicting where they might turn up next. Historical bug-occurence data has been key to this research. Bug tracking systems, and code version histories, record when, how and by whom bugs were fixed; from these sources, datasets that relate file changes to bug fixes can be extracted. These historical datasets can be used to test hypotheses concerning processes of bug introduction, and also to build statistical bug prediction models. Unfortunately, processes and humans are imperfect, and only a fraction of bug fixes are actually labelled in source code version histories, and thus become available for study in the extracted datasets. The question naturally arises, are the bug fixes recorded in these historical datasets a fair representation of the full population of bug fixes? In this paper, we investigate historical data from several software projects, and find strong evidence of systematic bias. We then investigate the potential effects of "unfair, imbalanced" datasets on the performance of prediction techniques. We draw the lesson that bias is a critical problem that threatens both the effectiveness of processes that rely on biased datasets to build prediction models and the generalizability of hypotheses tested on biased data. |
|
K Reinecke, Abraham Bernstein, Tell me where you've lived, and I'll tell you what you like: adapting interfaces to cultural preferences, In: User Modeling, Adaptation, and Personalization (UMAP), 2009-06. (Conference or Workshop Paper published in Proceedings)
 
|
|
Thomas Scharrenbach, Abraham Bernstein, On the evolution of ontologies using probabilistic description logics, In: First ESWC Workshop on Inductive Reasoning and Machine Learning on the Semantic Web, 2009-06. (Conference or Workshop Paper published in Proceedings)
 
Exceptions play an important role in conceptualizing data,
especially when new knowledge is introduced or existing knowledge changes. Furthermore, real-world data often is contradictory and uncertain.
Current formalisms for conceptualizing data like Description Logics rely upon first-order logic. As a consequence, they are poor in addressing exceptional, inconsistent and uncertain data, in particular when evolving the knowledge base over time.
This paper investigates the use of Probabilistic Description Logics as a formalism for the evolution of ontologies that conceptualize real-world data. Different scenarios are presented for the automatic handling of inconsistencies
during ontology evolution. |
|
Abraham Bernstein, Jiwen Li, From active towards InterActive learning: using consideration information to improve labeling correctness, In: Human Computation Workshop, 2009-06. (Conference or Workshop Paper published in Proceedings)
 
Active learning methods have been proposed to reduce the labeling effort of human experts: based on the initially available labeled instances and information about the unlabeled data those algorithms choose only the most informative instances for labeling. They have been shown to significantly reduce the size of the required labeled dataset to generate a precise model [17]. However, active learning framework assumes "perfect" labelers, which is not true in practice (e.g., [22, 23]). In particular, an empirical study for hand-written digit recognition [5] has shown that active learning works poorly when a human labeler is used. Thus, as active learning enters the realm of practical applications, it will need to confront the practicalities and inaccuracies of human expert decision-making. Specifically, active learning approaches will have to deal with the problem that human experts are likely to make mistakes when labeling the selected instances. |
|
Jonas Tappolet, Abraham Bernstein, Applied temporal RDF: efficient temporal querying of RDF data with SPARQL, In: 6th European Semantic Web Conference (ESWC), 2009-06. (Conference or Workshop Paper published in Proceedings)
 
Many applications operate on time-sensitive data. Some of
these data are only valid for certain intervals (e.g., job-assignments, versions of software code), others describe temporal events that happened at certain points in time (e.g., a persons birthday). Until recently, the only way to incorporate time into Semantic Web models was as a data type property. Temporal RDF, however, considers time as an additional dimension in data preserving the semantics of time.
In this paper we present a syntax and storage format based on named graphs to express temporal RDF. Given the restriction to preexisting RDF-syntax, our approach can perform any temporal query using standard SPARQL syntax only. For convenience, we introduce a shorthand format called t-SPARQL for temporal queries and show how t-SPARQL
queries can be translated to standard SPARQL. Additionally, we show that, depending on the underlying data’s nature, the temporal RDF approach vastly reduces the number of triples by eliminating redundancies resulting in an increased performance for processing and querying. Last but not least, we introduce a new indexing approach method that can significantly reduce the time needed to execute time point queries (e.g., what happened on January 1st). |
|
Amancio Bouza, G Reif, Abraham Bernstein, Probabilistic partial user model similarity for collaborative filtering, In: 1st International Workshop on Inductive Reasoning and Machine Learning on the Semantic Web (IRMLeS2009) at the 6th European Semantic Web Conference (ESWC2009), 2009-06-01. (Conference or Workshop Paper published in Proceedings)
 
Recommender systems play an important role in supporting people getting items they like. One type of recommender systems is user-based collaborative filtering. The fundamental assumption of user-based collaborative filtering is that people who share similar preferences for common items behave similar in the future. The similarity of user preferences is computed globally on common rated items such that partial preference similarities might be missed. Consequently, valuable ratings of partially similar users are ignored. Furthermore, two users may even have similar preferences but the set of common rated items is too small to infer preference similarity. We propose first, an approach that computes user preference similarities based on learned user preference models and second, we propose a method to compute partial user preference similarities based on partial user model similarities. For users with few common rated items, we show that user similarity based on preferences significantly outperforms user similarity based on common rated items. |
|
Stefan Amstein, Evaluation und Evolution von Pattern-Matching-Algorithmen zur Betrugserkennung, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2009. (Master's Thesis)
 
Fraud detection often involves the analysis of large data sets originating from private companies or governmental agencies by means of artificial intelligence (such as data mining) but there are also pattern-matching approaches.
ChainFinder, an algorithm for graph-based pattern matching, is capable of detecting transaction chains within financial data that could indicate fraudulent behavior. In this work, relevant measurements of correctness and performance are acquired in order to evaluate and evolve the given implementation of the ChainFinder. A series of tests, both on synthetic and more realistic datasets are conduced and their results discussed. Along with this process, a number of derivative ChainFinder implementations emerged and are compared to each other.
Throughout this process, an evaluation framework application was developed in order to assist the evaluation of similar algorithms by providing certain automatisms.
|
|
J Ekanayake, Jonas Tappolet, H C Gall, Abraham Bernstein, Tracking concept drift of software projects using defect prediction quality, In: 6th IEEE Working Conference on Mining Software Repositories, 2009-05. (Conference or Workshop Paper published in Proceedings)
 
Defect prediction is an important task in the mining of software repositories, but the quality of predictions varies
strongly within and across software projects. In this paper
we investigate the reasons why the prediction quality is so
fluctuating due to the altering nature of the bug (or defect) fixing process. Therefore, we adopt the notion of a concept drift, which denotes that the defect prediction model has become unsuitable as set of influencing features has changed – usually due to a change in the underlying bug generation process (i.e., the concept). We explore four open source projects (Eclipse, OpenOffice, Netbeans and Mozilla) and construct file-level and project-level features for each of them from their respective CVS and Bugzilla repositories.
We then use this data to build defect prediction models and
visualize the prediction quality along the time axis. These
visualizations allow us to identify concept drifts and – as a consequence – phases of stability and instability expressed in the level of defect prediction quality. Further, we identify those project features, which are influencing the defect prediction quality using both a tree induction-algorithm and a linear regression model. Our experiments uncover that software systems are subject to considerable concept drifts in their evolution history. Specifically, we observe that the change in number of authors editing a file and the number of defects fixed by them contribute to a project’s concept drift and therefore influence the defect prediction quality.
Our findings suggest that project managers using defect
prediction models for decision making should be aware of
the actual phase of stability or instability due to a potential concept drift. |
|
D J Kurz, Abraham Bernstein, K Hunt, D Radovanovic, P Erne, Z Siudak, O Bertel, Simple point of care risk stratification in acute coronary syndromes: the AMIS model, Heart, Vol. 95 (8), 2009. (Journal Article)
 
Background: Early risk stratification is important in the management of patients with acute coronary syndromes (ACS).
Objective: To develop a rapidly available risk stratification tool for use in all ACS.
Design and methods: Application of modern data mining and machine learning algorithms to a derivation cohort of 7520 ACS patients included in the AMIS (Acute Myocardial Infarction in Switzerland)-Plus registry between 2001 and 2005; prospective model testing in two validation cohorts.
Results: The most accurate prediction of in-hospital mortality was achieved with the “Averaged One-Dependence Estimators” (AODE) algorithm, with input of 7 variables
available at first patient contact: Age, Killip class, systolic blood pressure, heart rate, pre-hospital cardio-pulmonary resuscitation, history of heart failure, history of cerebrovascular disease. The c-statistic for the derivation cohort (0.875) was essentially maintained in
important subgroups, and calibration over five risk categories, ranging from <1% to >30% predicted mortality, was accurate. Results were validated prospectively against an independent AMIS-Plus cohort (n=2854, c-statistic 0.868) and the Krakow-Region ACS Registry (n=2635, c-statistic 0.842). The AMIS model significantly outperformed established “point-of-care” risk prediction tools in both validation cohorts. In comparison to a logistic
regression-based model, the AODE-based model proved to be more robust when tested on the Krakow validation cohort (c-statistic 0.842 vs. 0.746). Accuracy of the AMIS model
prediction was maintained at 12-months follow-up in an independent cohort (n=1972, c-statistic 0.877).
Conclusions: The AMIS model is a reproducibly accurate point-of-care risk stratification tool for the complete range of ACS, based on variables available at first patient contact. |
|
Michael Imhof, Optimization strategies for RDFS-aware data storage, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2009. (Master's Thesis)

Indexing and storing triple-based Semantic Web data in a way to allow for efficient query
processing has long been a difficult task. A recent approach to address this issue is the index-
ing scheme Hexastore. In this work, we propose two novel on-disk storage models for Hexastore,
that use RDF Schema information to gather data that semantically belong together and store them
contiguously. In the clustering approach, elements of the same classes are stored contiguously
within the indices. In the subindex approach, data of the same categories are saved in separate
subindices. Thus, we expect to simplify and accelerate the retrieving process of Hexastore. The
experimental evaluation shows a clear advantage of the standard storage model against the pro-
posed approaches in terms of index creation time and required disk space. |
|
Adrian Bachmann, Abraham Bernstein, Data Retrieval, Processing and Linking for Software Process Data Analysis, No. IFI-2009.0003b, Version: 1, 2009. (Technical Report)
 
Many projects in the mining software repositories communities rely on software process data gathered from bug tracking databases and commit log files of version control systems. These data are then used to predict defects, gather insight into a project's life-cycle, and other tasks. In this technical report we introduce the software systems which hold such data. Furthermore, we present our approach for retrieving, processing and linking this data. Specifically, we first introduce the bug fixing process and the software products used which support this process. We then present a step by step guidance of our approach to retrieve, parse, convert and link the data sources. Additionally, we introduce an improved approach for linking the change log file with the bug tracking database. Doing that, we achieve a higher linking rate than with other approaches |
|
Ausgezeichnete Informatikdissertationen 2008, Edited by: Abraham Bernstein, Steffen Hölldobler, et al, Gesellschaft für Informatik, Bonn, 2009. (Edited Scientific Work)

|
|
The Semantic Web - ISWC 2009, Edited by: Abraham Bernstein, D R Karger, T Heath, L Feigenbaum, D Maynard, E Motta, K Thirunarayan, Springer, Berlin, 2009. (Edited Scientific Work)

This book constitutes the refereed proceedings of the 8th International Semantic Web Conference, ISWC 2009, held in Chantilly, VA, USA, during October 25-29, 2009.
The volume contains 43 revised full research papers selected from a total of 250 submissions; 15 papers out of 59 submissions to the semantic Web in-use track, and 7 papers and 12 posters accepted out of 19 submissions to the doctorial consortium.
The topics covered in the research track are ontology engineering; data management; software and service engineering; non-standard reasoning with ontologies; semantic retrieval; OWL; ontology alignment; description logics; user interfaces; Web data and knowledge; semantic Web services; semantic social networks; and rules and relatedness. The semantic Web in-use track covers knowledge management; business applications; applications from home to space; and services and infrastructure. |
|
Esther Kaufmann, Talking to the semantic web - natural language query interfaces for casual end-users, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2009. (Dissertation)
 
|
|