Manuel Kägi, Using Genetic Programming and SimPack to Learn Global Similarity Measures, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
For a growing number of applications good similarity measures are crucial to ensure that the applications works as desired. Similarity measures can be used to find the most similar object to another one, or can be used to perform a categorisation task, whereby the calculated similarity value will be used to determine the category. But manually defining a good similarity measure, especially if complex and domain specific objects have to be compared, can be a difficult task. A lot of domain knowledge combined with knowledge in computer science (namely how these similarity measures work internally) is needed, and there exists no approved methodology to do this. Therefore the global goal in this diploma thesis is, instead of manually defining similarity measures, to learn them and to evaluate the achieved results.To be able to learn similarity measures, an universal framework is used, the Local/Global Framework. The idea is to use the Local/Global principle to compare complex objects, whereby the local similarity measures and the amalgamation function can be learned. Another precondition for this is to have an evaluation method to estimate a particular similarity measure's soundness. Typically this is done by comparing the similarity measure's results with a so-called gold standard.To learn, the evolutionary principles observed in nature will be exploited in an artificial evolution. This artificial evolution can be implemented as a genetic algorithm or a genetic programming approach can be used. In the first case parameters of similarity measures will be learned, in the second case, using the genetic programming approach, the algorithms themselves are learned. In both cases the goal is to find similarity measures, which will show only a small deviation to the gold standard. In the case of using a similarity measure to do a categorisation, the goal will be to properly identify the category an object or a pair of objects (the two compared ones) belongs to. |
|
Esther Kaufmann, Abraham Bernstein, Renato Zumstein, Querix: A Natural Language Interface to Query Ontologies Based on Clarification Dialogs, In: 5th International Semantic Web Conference (ISWC 2006), Springer, November 2006. (Conference or Workshop Paper)
The logic-based machine-understandable framework of the Semantic Web typically challenges casual users when they try to query ontologies. An often proposed solution to help casual users is the use of natural language interfaces. Such tools, however, suffer from one of the biggest problems of natural language: ambiguities. Furthermore, the systems are hardly adaptable to new domains. This paper addresses these issues by presenting Querix, a domain-independent natural language interface for the Semantic Web. The approach allows queries in natural language, thereby asking the user for clarification in case of ambiguities. The preliminary evaluation showed good retrieval performance. |
|
Abraham Bernstein, Esther Kaufmann, GINO - A Guided Input Natural Language Ontology Editor, In: 5th International Semantic Web Conference (ISWC 2006), Springer, November 2006. (Conference or Workshop Paper)
The casual user is typically overwhelmed by the formal logic of the Semantic Web. The gap between the end user and the logic-based scaffolding has to be bridged if the Semantic Web's capabilities are to be utilized by the general public. This paper proposes that controlled natural languages offer one way to bridge the gap. We introduce GINO, a guided input natural language ontology editor that allows users to edit and query ontologies in a language akin to English. It uses a small static grammar, which it dynamically extends with elements from the loaded ontologies. The usability evaluation shows that GINO is well-suited for novice users when editing ontologies. We believe that the use of guided entry overcomes the habitability problem, which adversely affects most natural language systems. Additionally, the approach's dynamic grammar generation allows for
easy adaptation to new ontologies. |
|
Reto Wettstein, Kundenverhalten in web-basierten sozialen Netzwerken Eine Evaluation von Vorhersagemodellen, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
In every business, customer data are a big asset. Analyzing them allows you to segment, target and position your offers in terms of prize and channel. Data mining methods as an explorative way to analyze customer data made their way into corporate data warehouses more than ten years ago. Nowadays where web-based social networks offer customer created behavioural network data in real time, the mining community sees new applications of relational data mining approaches that take features of connected member-profiles and relations into their reasoning. Two freely available workbenches that incorporate such relational algorithms are NetKit-SRL and Proximity. Our work applied and compared these two software packages on a data set of 42?044 interconnected member-profiles of a web-based social network with widely used propositional algorithms like C5, Logistic-Regression and Neural Nets. The scope of data has been enriched with ego-net centrality and density measures from the corpus of measures commonly known in the social network analysis (sna) field. It has been shown that the incorporation of sna-measures must not improve the mining results with traditional algorithms as well as with relational ones. Furthermore it can be stated, that relational algorithms on networked data are not in every case superior to traditional algorithms on propositionalized data. Our work names the moderating variables that led to these outcomes. With our key finding in detecting meaningful correlations between sna- and activity- measures we have been able to design the ?social mailing model?, a direct mailing model that could lead to a substantial improvement in conversion rate. A real world experiment would therefore be one of the proposed next steps. |
|
David Kurz, Katrin Hunt, Abraham Bernstein, Dragana Radovanovic, Paul E. Erne, Osmund Bertel, Inadequate performance of the TIMI risk prediction score for patients with ST-elevation myocardial infarction treated according to current guidelines, In: World Congress of Cardiology 2006, September 2006. (Book Chapter)
Background: Mortality prediction of patients admitted with ST elevation myocardial infarction (STEMI) is currently based on models derived from randomised controlled trials performed in the 1990's, with selective inclusion and exclusion criteria. It is unclear whether such models remain valid in community-based populations in the modern era.
Methods: The AMIS (Acute Myocardial Infarction in Switzerland)-Plus registry prospectively collects data from ACS patients admitted to 56 Swiss hospitals. We analysed hospital mortality for patients with ST-elevation myocardial infarction (STEMI) included in this registry between 1997-2005, and compared it to mortality as predicted by the benchmark risk score from the TIMI study group. This is an integer score calculated from 10 weighted parameters available at admission. Each score value delivers a hospital mortality risk prediction (range 0.7% for 0 points, 31.7% for >8 points).
Results: Among 7875 patients with STEMI, overall hospital mortality was 7.3%. The TIMI risk score overestimated mortality risk at each score level for the entire population. Subgroup analysis according to initial revascularisation treatment (PCI n=3358, thrombolysis n=1842, none n=2675) showed an especially poor performance of the TIMI risk score for patients treated by PCI. In this subgroup no relevant increase in mortality was observed up until 5 points (actual mortality 2.7%, predicted 11.6%), and remained below 5% up till 7 points (predicted 21.5%) (Figure 1).
Conclusions: The TIMI risk score overestimates the mortality risk and delivers poor stratification in real life patients with STEMI treated according to current guidelines. |
|
David Kurz, Katrin Hunt, Abraham Bernstein, Dragana Radovanovic, Paul E. Erne, Jean-Christophe Stauffer, Osmund Bertel, Development of a novel risk stratification model to improve mortality prediction in acute coronary syndromes: the AMIS (Acute Myocardial Infarction in Switzerland) model, In: World Congress of Cardiology 2006, September 2006. (Book Chapter)
Background: Current established models predicting mortality in acute coronary syndrome (ACS) patients are derived from randomised controlled trials performed in the 1990's, and are thus based on and predictive for selected populations. These scores perform inadequately in patients treated according to current guidelines. The aim of this study was to develop a model with improved predictive performance applicable to all kinds of ACS, based on outcomes in real world patients from the new millennium.
Methods: The AMIS (Acute Myocardial Infarction in Switzerland)-Plus registry prospectively collects data from ACS patients admitted to 56 Swiss hospitals. Patients included in this registry between October 2001 and May 2005 (n = 7520) were the basis for model development. Modern data mining computational methods using new classification learning algorithms were tested to optimise mortality risk prediction using well-defined and non-ambiguous variables available at first patient contact. Predictive performance was quantified as ""area under the curve"" (AUC, range 0 - 1) in a receiver operator characteristic, and was compared to the benchmark risk score from the TIMI study group. Results were verified using 10-fold cross-validation.
Results: Overall, hospital mortality was 7.5%. The final prediction model was based on the ""Averaged One-Dependence Estimators"" algorithm and included the following 7 input variables: 1) Age, 2) Killip class, 3) systolic blood pressure, 4) heart rate, 5) pre-hospital mechanical resuscitation, 6) history of heart failure, 7) history of cerebrovascular disease. The output of the model was an estimate of in-hospital mortality risk for each patient. The AUC for the entire cohort was 0.875, compared to 0.803 for the TIMI risk score. The AMIS model performed equally well for patients with or without ST elevation myocardial infarction (AUC 0.879 and 0.868, respectively). Subgroup analysis according to the initial revascularisation modality indicated that the AMIS model performed best in patients undergoing PCI (AUC 0.884 vs. 0.783 for TIMI) and worst in patients receiving no revascularisation therapy (AUC 0.788 vs. 0.673 for TIMI). The model delivered an acurate and reproducible prediction over the complete range of risks and for all kinds of ACS.
Conclusions: The AMIS model performs about 10% better than established risk prediction models for hospital mortality in patients with all kinds of ACS in the modern era. Modern data mining algorithms proved useful to optimise the model development. |
|
David Kurz, Katrin Hunt, Abraham Bernstein, Dragana Radovanovic, Paul E. Erne, Jean-Christophe Stauffer, Osmund Bertel, Inadequate performance of the TIMI risk prediction score for patients with ST-elevation myocardial infarction in the modern era, In: Gemeinsame Jahrestagung der Schweizerischen Gesellschaften für Kardiologie, für Pneumologie, für Thoraxchirurgie, und für Intensivmedizin, June 2006. (Book Chapter)
Background: Mortality prediction of patients admitted with ST elevation myocardial infarction (STEMI) is currently based on models derived from randomised controlled trials performed in the 1990�s, with selective inclusion and exclusion criteria. It is unclear whether such models remain valid in community-based populations in the modern era.
Methods: The AMIS-Plus registry prospectively collects data from ACS patients admitted to 56 Swiss hospitals. We analysed hospital mortality for patients with ST-Elevation myocardial infarction (STEMI) included in this registry between 1997-2005, and compared it to mortality as predicted by the benchmark risk score from the TIMI study group. This is an integer score calculated from 10 weighted parameters available at admission. Each score value delivers a hospital mortality risk prediction (range 0.7% for 0 points, 31.7% for >8 points).
Results: Among 7875 patients with STEMI, overall hospital mortality was 7.3%. The TIMI risk score overestimated mortality risk at each score level for the entire population. Subgroup analysis according to initial revascularisation treatment (PCI n=3358, thrombolysis n=1842, none n=2675) showed an especially poor performance for patients treated by PCI. In this subgroup no relevant increase in mortality was observed up until 5 points (actual mortality 2.7%, predicted 11.6%), and remained below 5% up till 7 points (predicted 21.5%) (Figure 1).
FIGURE
Conclusions: The TIMI risk score overestimates the mortality risk and delivers poor stratification in real life patients with STEMI treated according to current guidelines. |
|
David Kurz, Katrin Hunt, Abraham Bernstein, Dragana Radovanovic, Paul E. Erne, Jean-Christophe Stauffer, Development of a novel risk stratification model to improve mortality prediction in acute coronary syndromes: the AMIS model, In: Gemeinsame Jahrestagung der Schweizerischen Gesellschaften für Kardiologie, für Pneumologie, für Thoraxchirurgie, und für Intensivmedizin, June 2006. (Book Chapter)
Background: Current established models predicting mortality in acute coronary syndrome (ACS) patients are derived from randomised controlled trials performed in the 1990�s, and are thus based on and predictive for selected populations. These scores perform inadequately in patients treated according to current guidelines. The aim of this study was to develop a model with improved predictive performance applicable to all kinds of ACS, based on outcomes in real world patients from the new millennium.
Methods: The AMIS-Plus registry prospectively collects data from ACS patients admitted to 56 Swiss hospitals. Patients included in this registry between October 2001 and May 2005 (n = 7520) were the basis for model development. Modern data mining computational methods using new classification learning algorithms were tested to optimise mortality risk prediction using well-defined and non-ambiguous variables available at first patient contact. Predictive performance was quantified as �area under the curve� (AUC, range 0 � 1) in a receiver operator characteristic, and was compared to the benchmark risk score from the TIMI study group. Results were verified using 10-fold cross-validation.
Results: Overall, hospital mortality was 7.5%. The final prediction model was based on the �Averaged One-Dependence Estimators� algorithm and included the following 7 input variables: 1) Age, 2) Killip class, 3) systolic blood pressure, 4) heart rate, 5) pre-hospital mechanical resuscitation, 6) history of heart failure, 7) history of cerebrovascular disease. The output of the model was an estimate of in-hospital mortality risk for each patient. The AUC for the entire cohort was 0.875, compared to 0.803 for the TIMI risk score. The AMIS model performed equally well for patients with or without ST-Elevation (AUC 0.879 and 0.868, respectively). Subgroup analysis according to the initial revascularisation modality indicated that the AMIS model performed best in patients undergoing PCI (AUC 0.884 vs. 0.783 for TIMI) and worst for patients receiving no revascularisation therapy (AUC 0.788 vs. 0.673 for TIMI). The model delivered an accurate and reproducible prediction over the complete range of risks and for all kinds of ACS.
Conclusions: The AMIS model performs about 10% better than established risk prediction models for hospital mortality in patients with all kinds of ACS in the modern era. Modern data mining algorithms proved useful to optimise the model development. |
|
Patrick Ziegler, Christoph Kiefer, Christoph Sturm, Klaus R. Dittrich, Abraham Bernstein, Generic Similarity Detection in Ontologies with the SOQA-SimPack Toolkit, In: SIGMOD Conference, ACM, New York, NY, USA, June 2006. (Conference or Workshop Paper)
Ontologies are increasingly used to represent the intended real-world semantics of data and services in information systems. Unfortunately, different databases often do not relate to the same ontologies when describing their semantics. Consequently, it is desirable to have information about the similarity between ontology concepts for ontology alignment and integration. In this demo, we present the SOQASimPack Toolkit (SST) 7, an ontology language independent Java API that enables generic similarity detection and visualization in ontologies. We demonstrate SST’s usefulness with the SOQA-SimPack Toolkit Browser, which allows users to graphically perform similarity calculations in ontologies. |
|
Abraham Bernstein, Esther Kaufmann, Christian Kaiser, Christoph Kiefer, Ginseng: A Guided Input Natural Language Search Engine for Querying Ontologies, In: 2006 Jena User Conference, May 2006. (Conference or Workshop Paper)
|
|
Patrick Knab, Martin Pinzger, Abraham Bernstein, Predicting Defect Densities in Source Code Files with Decision Tree Learners, In: MSR '06: Proceedings of the 2006 International Workshop on Mining Software Repositories, ACM, New York, NY, USA, May 2006. (Conference or Workshop Paper)
With the advent of open source software repositories the data available
for defect prediction in source files increased tremendously.
Although traditional statistics turned out to derive reasonable results
the sheer amount of data and the problem context of defect prediction
demand sophisticated analysis such as provided by current data
mining and machine learning techniques.
In this work we focus on defect density prediction and present
an approach that applies a decision tree learner on evolution data
extracted from the Mozilla open source web browser project. The
evolution data includes different source code, modification, and defect
measures computed from seven recent Mozilla releases. Among
the modification measures we also take into account the change coupling,
a measure for the number of change-dependencies between
source files. The main reason for choosing decision tree learners,
instead of for example neural nets, was the goal of finding underlying
rules which can be easily interpreted by humans. To find these
rules, we set up a number of experiments to test common hypotheses
regarding defects in software entities. Our experiments showed, that
a simple tree learner can produce good results with various sets of
input data. |
|
Tobias Sager, Abraham Bernstein, Martin Pinzger, Christoph Kiefer, Detecting Similar Java Classes Using Tree Algorithms, In: Proceedings of the International Workshop on Mining Software Repositories, ACM, Shanghai, China, May 2006. (Conference or Workshop Paper)
Similarity analysis of source code is helpful during development to provide, for instance, better support for code reuse. Consider a development environment that analyzes code while typing and that suggests similar code examples or existing implementations from a source code repository. Mining software repositories by means of similarity measures enables and enforces reusing existing code and reduces the developing effort needed by creating a shared knowledge base of code fragments. In information retrieval similarity measures are often used to find documents similar to a given query document. This paper extends this idea to source code repositories. It introduces our approach to detect similar Java classes in software projects using tree similarity algorithms. We show how our approach allows to find similar Java classes based on an evaluation of three tree-based similarity measures in the context of five user-defined test cases as well as a preliminary software evolution analysis of a medium-sized Java project. Initial results of our technique indicate that it (1) is indeed useful to identify similar Java classes, (2) successfully identifies the ex ante and expost versions of refactored classes, and (3) provides some interesting insights into within-version and between-version dependencies of classes within a Java project. |
|
Abraham Bernstein, Christoph Kiefer, Imprecise RDQL: Towards Generic Retrieval in Ontologies Using Similarity Joins, In: 21th Annual ACM Symposium on Applied Computing (ACM SAC 2006), ACM, New York, NY, USA, April 2006. (Conference or Workshop Paper)
Traditional semantic web query languages support a logic-based
access to the semantic web. They offer a retrieval (or reasoning) of
data based on facts. On the traditional web and in databases,
however, exact querying often provides an incomplete answer as
queries are over-specified or the mix of multiple
ontologies/modelling differences requires ``interpretational
flexibility.'' Therefore, similarity measures or ranking approaches
are frequently used to extend the reach of a query. This paper
extends this idea to the semantic web. It introduces iRDQL---a
semantic web query language with support for similarity joins. It is
an extension of RDQL (RDF Data Query Language) that enables its
users to query for similar resources ranking the results using a
similarity measure. We show how iRDQL allows to extend the reach of
a query by finding additional results. We quantitatively evaluated
four similarity measures for their usefulness in iRDQL in the
context of an OWL-S semantic web service retrieval test collection
and compared the results to a specialized OWL-S matchmaker. Initial
results of using iRDQL indicate that it is indeed useful for
extending the reach of queries and that it is able to improve recall
without overly sacrificing precision. We also found that our generic
iRDQL approach was only slightly outperformed by the specialized
algorithm. |
|
Adrian Bachmann, Indoornavigation mittels Ortsinterpolation, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
Satellite navigation is ubiquitous in our daily life. Unfortunately, satellite navigation signals can not be received somtimes. Due to this effect, a new stream of reseach has been emerged focussing on alternatives, especially in the area of indoor-navigation. In this diploma thesis a new approach is developed and presented. The new approach induces in a new way navigation information from accelerometer and magnetometer sensor data. The new approach overcomes the shortcoming of insufficient calibration - one of the major issues in current research. The main contribution of this work is an online calibration framework, which allows to adapt to changing border conditions. Thus, the result is a much more robust, precise, and up-to-date base for the path extrapolation. |
|
Peter Vorburger, Abraham Bernstein, Entropy-based Concept Shift Detection, In: IEEE International Conference on Data Mining (ICDM), March 2006. (Conference or Workshop Paper)
|
|
Abraham Bernstein, Peter Vorburger, A Scenario-Based Approach for Direct Interruptablity Prediction on Wearable Devices, Journal of Pervasive Computing and Communications, Vol. 3 (4), 2006. (Journal Article)
People are subjected to a multitude of interruptions. This situation is likely to get worse as technological devices are making us increasingly reachable. In order to manage the interruptions it is imperative to predict a person’s interruptability - his/her current readiness or inclination to be interrupted. In this paper we introduce the approach of direct interruptability inference from sensor streams (accelerometer and audio data) in a ubiquitous computing setup and show that it provides highly accurate and robust predictions. Furthermore, we argue that scenarios are central for evaluating the performance of ubiquitous computing devices (and interruptability predicting devices in particular) and prove it on our setup. We also demonstrate that scenarios provide the foundation for avoiding misleading results, assessing the results’ generalizability, and provide the basis for a stratified scenario-based learning model, which greatly speeds-up the training of such devices. |
|
Patrick Ziegler, Christoph Kiefer, Christoph Sturm, Klaus R. Dittrich, Abraham Bernstein, Detecting Similarities in Ontologies with the SOQA-SimPack Toolkit, In: 10th International Conference on Extending Database Technology (EDBT 2006), Springer, March 2006. (Conference or Workshop Paper)
Ontologies are increasingly used to represent the intended real-world semantics of data and services in information systems. Unfortunately, different databases often do not relate to the same ontologies when describing their semantics. Consequently, it is desirable to have information about the similarity between ontology concepts for ontology alignment and integration. This paper presents the SOQA-SimPack Toolkit (SST), an ontology language independent Java API that enables generic similarity detection and visualization in ontologies. We demonstrate SST's usefulness with the SOQA-SimPack Toolkit Browser, which allows users to graphically perform similarity calculations in ontologies. |
|
Markus Stocker, The Fundamentals of iSPARQL, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
The growing amount of semantically annotated data and published ontologies opens an interesting and challenging application for similarity measures. In face of limited knowledge about the distributed data on the SemanticWeb, similarity measures allow a retrieval which is expected to improve the performance compared to exact querying. This thesis presents iSPARQL, an extension to SPARQL which allows querying for similar resources in both RDF/RDFS and OWL ontologies, and supports the development of strategies to compute the similarity of ontological resources. Huge data volume forced the development of query optimization techniques for relational database systems. However, query engines for ontological data based on graph models, mostly execute user queries without considering any optimization. Especially for large ontologies, optimization techniques are required to ensure that query results can be delivered within reasonable time. OptARQ is a first prototype for iSPARQL query optimization based on the concept of triple pattern selectivity estimation. The evaluation we conduct demonstrates how triple pattern reordering according to their selectivity affects the query execution performance. |
|
Iwan Stierli, Entropy-based, Semi-dynamic Regulation of Incremental Algorithms in the case of Instantaneous Concept Drifts, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
Incremental classifiers build their prediction rules according to the known instances of a continuous data stream. As these algorithms learn from correct and incorrect predictions, their performance improves the more instances their rules are based on. In the case of an instantaneous concept drift, this assumption is no longer valid as the old concepts’ instances falsify the rules which are to be built. Therefore, it would be ideal to forget the instances. In this thesis, it is tried to regulate this forgetting rate accurately by using an adapted form of the entropy term. First, a simple, linear correlation between the entropy and the forgetting rate will be excluded. Furthermore, a second, semi-dynamic and noise-resistant switching strategy will be pursued. It will be tested on a synthetic data set and compared with the applicable benchmarks according to two different quality measures. |
|
Beat Sprenger, Semantic Crystal. Ein End-User-Interface zur Unterstützung von Ontologie-Abfragen mit SPARQL, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2006. (Master's Thesis)
Although standards for structuring semantic information have been available for a long time (XML, RDF, OWL, SPARQL), most typical users have not yet encountered the semantic web. There are mainly two reasons for this: There are not many semantic knowledge bases available which could be interesting to query, and there is not much software to support such queries neither. Therefore this research paper presents Semantic Crystal, a prototype of such a query software. Semantic Crystal displays classes and relationships from OWL knowledge bases graphically and supports the setup of a query – graphically as well. Although the query eventually exists in the intricate language SPARQL, it will be comprehensible by a broad number of users due to its graphical representation. |
|