Isabelle Guyon, Jiwen Li, Theodor Mador, Patrick A. Pletscher, Gerold Schneider, Markus Uhr, Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark, Pattern Recognition Letters, Vol. 28 (12), 2007. (Journal Article)
 
We used the datasets of the NIPS 2003 challenge on feature selection as part of the practical work of an undergraduate course on feature
extraction. The students were provided with a toolkit implemented in Matlab. Part of the course requirements was that they should
outperform given baseline methods. The results were beyond expectations: the student matched or exceeded the performance of the
best challenge entries and achieved very effective feature selection with simple methods. We make available to the community the results
of this experiment and the corresponding teaching material. These results also provide a new baseline for researchers in feature selection. |
|
Peter Höltschi, Ein regel- und statistikbasiertes Empfehlungssystem für das Masterstudium in Informatik, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2007. (Bachelor's Thesis)
 
In dieser Bachelorarbeit wird ein regel- und statistikbasiertes Empfehlungssystem für die Planung des
Masterstudiums in Informatik an der Universität Zürich spezifiziert, entworfen und an einem Prototypen
erprobt. Das System unterstützt Informatikstudierende bei der automatischen Erstellung von
Studienplänen. Dadurch wird zum einen die Einhaltung der Studienreglemente garantiert. Andererseits
erhalten die Studierenden ein Bild darüber, wie ihr Masterstudium aussehen könnte. Sie müssen dazu
die Daten ihres Leistungsausweises zur Verfügung stellen und Präferenzen zur Studienrichtung und zur
Modulwahl angeben. Aufgrund dieser Daten erstellt das System mittels mehrerer Filter- und
Sortierfunktionen die gewünschten Studienpläne. In einer Evaluation wurden Studierende um die
manuelle Erstellung eines Studienplans und der Angabe der Daten zur automatischen Erstellung
angefragt. Eine Analyse der Resultate und ein Vergleich zwischen dem manuellen und dem automatisch
erstellten Studienplan hat ergeben, dass die Qualität von letzterem stark von der Qualität und der
Menge der Präferenzangaben des Studenten abhängt. Zudem kam heraus, dass das System zur
optimalen Nutzung mit zusätzlichen Features ausgestattet werden sollte. This Bachelor Thesis describes the specification, design and implementation of a rule- and statistics
based recommendation system for the planning of the master study in informatics at the University of
Zurich. The system supports students in automatically generating study plans. On one hand, this
guaranties the compliance with the reglements of study. On the other hand, the students quickly get a
picture of how a master study plan can look like. For this to work, the student has to provide data of his
transcript of records, some details concerning his course of study and a choice of preferred lecture
contents. Based on this data, the system generates the study plans using filtering and sorting functions.
In an evaluation, some students were asked to provide a manually created study plan and the data for
automatically generating study plans. The analysis of the results and a comparison of the manually and
automatically generated study plans showed that the quality and quantity of the provided data have a
strong impact on the quality of the resulting study plans. To enhance the system, further features should
be implemented. |
|
Roman Zweifel, Developing a Web Portal for Case Studies, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2007. (Master's Thesis)
 
In the last few years the e-learning offers from the higher education schools have increased. They supply whole courses and learning materials in the internet for their students. So the internet gives the possibility to have another learning method.
The more learning ressources exist in a portal, it is much more difficult for the students to find the right materials and courses. Today the main approach is to offer a full text search or a catego-risation of course materials. New developments like faceted browsing or semantic annotation is rarely used. This thesis describes the CasIS portal. It is developed for the master studies in Computer Science at the University of Zurich. It supplies different case studies in the dissimilar areas of Information Systems. For a better usability the searching of the right ressources is very important. With the aid of the semantic web and a facetted browsing tool, the student gets the ability to find the appropriate ressources easily. |
|
Abraham Bernstein, Jayalath Ekanayake, Martin Pinzger, Improving defect prediction using temporal features and non linear models, In: Proceedings of the International Workshop on Principles of Software Evolution, IEEE Computer Society, Dubrovnik, Croatia, 2007-09-01. (Conference or Workshop Paper published in Proceedings)
 
Predicting the defects in the next release of a large software system is a very valuable asset for the pro ject manger to plan her resources. In this paper we argue that temporal features (or aspects) of the data are central to prediction performance. We also argue that the use of non-linear models, as opposed to traditional regression, is necessary to uncover some of the hidden interrelationships between the features and the defects and maintain the accuracy of the prediction in some cases. Using data obtained from the CVS and Bugzilla repositories of the Eclipse pro ject, we extract a number of temporal features, such as the number of revisions and number of reported issues within the last three months. We then use these data to predict both the location of defects (i.e., the classes in which defects will occur) as well as the number of reported bugs in the next month of the pro ject. To that end we use standard tree-based induction algorithms in comparison with the traditional regression. Our non-linear models uncover the hidden relationships between features and defects, and present them in easy to understand form. Results also show that using the temporal features our prediction model can predict whether a source file will have a defect with an accuracy of 99% (area under ROC curve 0.9251) and the number of defects with a mean absolute error of 0.019 (Spearman’s correlation of 0.96). |
|
Philippe Hungerbühler, The Influence of SPAM on Performance, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2007. (Bachelor's Thesis)

Almost every Internet user knows about the problem of SPAM. At work, especially, it costs time
to sort out irrelevant emails. This thesis deals with the problem of SPAM and its consequences
on productivity at work. For this reason an experiment has been conducted to examine the distraction
of SPAM and its perception. A few hypotheses, stated in advance, have been reviewed
on basis of this experiment. The results and the interpretation are presented and discussed in this
thesis. |
|
Michael Imhof, Entwicklung eines RDF Parsers für transaktionsbasierte Daten, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2007. (Bachelor's Thesis)
 
Der Java RDF Parser (JRP) ist ein Programm zum Einlesen von Files im RDF
Format und Extrahieren von transaktionsbasierten Daten, die anschliessend in einer
Datenbank gespeichert werden können. Diese Arbeit handelt von der Entwicklung
von JRP und bietet dem Leser einen Einblick in das Design des Codes, das
Datenbank-Schema und die Anbindung sowie eine Evaluation von Jena, der Java
Library die fu?r das Parsen der Daten benutzt wird. Selbstverständlich wurde das
Programm mit realen Daten getestet und bewies auf diese Art und Weise seine
korrekte Funktionalität. Leider kann bis jetzt nichts u?ber die Skalierbarkeit des
Parsers gesagt werden, da fu?r die Performance Tests keine grossen Datensätze
vorhanden waren. The Java RDF Parser (JRP) is a program to read in files in RDF format and extract
transactional data from it that can be stored in a database afterwards. This thesis is
about the development of JRP and gives an insight into the design of the code, the
database schema and connection, as well as an evaluation of Jena, the Java library
that is used to parse the input files. Naturally, the program was tested with real data
and proved the desired functionality. Unfortunately, nothing can be said about the
scalability of the parser, because there were no large datasets available for
performance tests. |
|
Matthias Linherr, Data Mining auf Kundendaten, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2007. (Master's Thesis)
 
The aim of this thesis is to implement a platform to enable the alumni associations to
analyse their member-databases. Using statistical methods and data-mining algorithms,
this platform should allow the visualization and appraisal of member behaviour and
member structure. Four different alumni organisations utilise this platform in form of a
web-based application to maintain their databases. They form the base of the following
evaluations.
|
|
Katharina Reinecke, Abraham Bernstein, Culturally Adaptive Software: Moving Beyond Internationalization, In: Proceedings of the HCI International (HCII), Springer, Beijing, China, July 2007. (Conference or Workshop Paper)
 
So far, culture has played a minor role in the design of software. Our experience with imbuto, a program designed for Rwandan agricultural advisors, has shown that cultural adaptation increased efficiency, but was extremely time-consuming and, thus, prohibitively expensive. In order to bridge the gap between cost-savings on one hand, and international usability on the other, this paper promotes the idea of culturally adaptive software. In contrast to manual localization, adaptive software is able to acquire details about an individual's cultural identity during use. Combining insights from the related fields international usability, user modeling and user interface adaptation, we show how research findings can be exploited for an integrated approach to automatically adapt software to the user's cultural frame. |
|
Sinja Helfenstein, Visualizing Labor Market Dynamics based on Social Security Records A Combination of Temporal and Visual Data Mining, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2007. (Master's Thesis)
 
The goal of this thesis is the understanding of temporal patterns in the Austrian Social Security Database to derive labor market dynamics. As these structures are very complex, conventional data mining approaches turned out to be inadequate for interpretation and knowledge discovery. The main challenge is the intuitive representation of the time dimension. Therefore, we keep the time dimension by generating movies of concatenated probabilistic model visualizations. Using this combination of temporal and visual data mining allows us to identify various effects such as seasonal hiring cycles, gender and age-related employment dynamics, and demographic influences. |
|
Domenic Benz, Voraussage von Benutzerverhalten in dynamischen Umgebungen, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2007. (Master's Thesis)
 
The increasing proliferation of mobile phones has a significant influence on our daily lifes. Allthough the increasing use of mobile devices has brought several advantages, it also has the negative effect of unwanted disturbance and interruptions. It is desirable that a mobile phone has the ability to adapt to the current situation it is in. For such an adaption to become possible, the mobile phone would need to have information about its current context. To achieve this goal, a software is implemented which gathers data from a variety of sensors on a mobile phone. This software is then being used in a prototype experiment. In this experiment we try to determine if it is possible to predict a users activity and location based on the collected data. The software implemented in this thesis and the results of the experiment help to prepare and conduct follow-up experiments in the field of context awareness and human interuptibility research. |
|
Christoph Kiefer, Imprecise SPARQL: Towards a Unified Framework for Similarity-Based Semantic Web Tasks, In: Proceedings of 2nd Knowledge Web PhD Symposium (KWEPSY) colocated with the 4th Annual European Semantic Web Conference (ESWC), June 2007. (Conference or Workshop Paper)
 
This proposal explores a unified framework to solve Semantic Web tasks that often require similarity measures, such as RDF retrieval, ontology alignment, and semantic service matchmaking. Our aim is to see how far it is possible to integrate user-defined similarity functions (UDSF) into SPARQL to achieve good results for these tasks.We present some research questions, summarize the experimental work conducted so far, and present our research plan that focuses on the various challenges of similarity querying within the Semantic Web. |
|
Christoph Kiefer, Abraham Bernstein, Jonas Tappolet, Analyzing Software with iSPARQL, In: Proceedings of the 3rd International Workshop on Semantic Web Enabled Software Engineering (SWESE 2007), Springer, June 2007. (Conference or Workshop Paper)
 
|
|
Dennis Weiss, Mining Customer Networks and Inter-Product Relations in Internet / Digital Entertainment Provider Data, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2007. (Master's Thesis)

Today’s telecommunication companies have at their disposal large quantities of detailed transaction data. Methods of data mining can be utilized for generating information on product use, customer behaviour and interaction between customers. Cross-selling analyses, customer segmentation and social network analysis thereby represent only some of the practices which can be employed for facilitating direct marketing procedures. This present thesis illustrates an approach for identifying customer groups and their networks, from which management implications may be derived by means of propositional as well as relational data mining. In this context triple play customers - i.e. subscribers of broadband internet, fixed-line telephony and digital TV - were segmented on the basis of data generated from product use. In addition, network analysis and the search for multi-relational patterns provided further insight into both customer types and their respective needs. |
|
Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, Andreas Zeller, How Long will it Take to Fix This Bug?, In: Proceedings of the Fourth International Workshop on Mining Software Repositories, IEEE Computer Society, May 2007. (Conference or Workshop Paper)

Predicting the time and effort for a software problem has long been a difficult task. We present an approach that automatically predicts the fixing effort, i.e., the person-hours spent on fixing an issue. Our technique leverages existing issue tracking systems: given a new issue report, we use the Lucene framework to search for similar, earlier reports and use their average time as a prediction. Our approach thus allows for early effort estimation, helping in assigning issues and scheduling stable releases. We evaluated our approach using effort data from the JBoss project. Given a sufficient number of issues reports, our automatic predictions are close to the actual effort; for issues that are bugs, we are off by only one hour, beating naive predictions by a factor of four. |
|
Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, Andreas Zeller, Predicting Effort to fix Software Bugs, In: Proceedings of the 9th Workshop Software Reengineering, May 2007. (Conference or Workshop Paper)

|
|
Jonas Tappolet, Mining Software Repositories - A Semantic Web Approach, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2007. (Master's Thesis)
 
Modern software development has become a complex task. Software systems grow larger and are densely interconnected to other systems making excessive use of large communication frameworks. To cope with this complexity, software developers and project managers need the assistance of tools which extract information about flaws in code as well as general information about the state of a project. In this thesis, we first introduce a data exchange format based on OWL/RDF, the Semantic Web’s format of choice today, able to store data and meta data from the source code, versioning system (i.e. CVS) and bug tracking system (i.e. Bugzilla). In a next step, we present a tool to retrieve the data from the online software repositories and to store it in OWL/RDF. This tool is implemented as a plug-in for the Eclipse IDE and is able to harvest data from projects managed by Eclipse. Finally, we evaluated our data format and tools by applying a set of software metric calculations, pattern detections and similarity measures by using iSPARQL and SimPack. The results of the conducted experiments are promising, and gave a first proof of our approach. |
|
Esther Kaufmann, Abraham Bernstein, How Useful are Natural Language Interfaces to the Semantic Web for Casual End-users?, In: 6th International Semantic Web Conference (ISWC 2007), March 2007. (Conference or Workshop Paper)
 
Natural language interfaces offer end-users a familiar and convenient option for querying ontology-based knowledge bases. Several studies have shown that they can achieve high retrieval performance as well as domain independence. This paper focuses on usability and investigates if NLIs are useful from an end-user's point of view. To that end, we introduce four interfaces each allowing a different query language and present a usability study benchmarking these interfaces. The results of the study reveal a clear preference for full sentences as query language and confirm that NLIs are useful for querying Semantic Web data. |
|
Christoph Kiefer, Abraham Bernstein, Markus Stocker, The Fundamentals of iSPARQL - A Virtual Triple Approach For Similarity-Based Semantic Web Tasks, In: Proceedings of the 6th International Semantic Web Conference (ISWC), Springer, March 2007. (Conference or Workshop Paper)
 
This research explores three SPARQL-based techniques to solve Semantic Web tasks that often require similarity measures, such as semantic data integration, ontology mapping, and Semantic Weg service matchmaking. Our aim is to see how far it is possible to integrate customized similarity functions (CSF) into SPARQL to achieve good results for these tasks. Our first approach exploits virtual triples calling property functions to establish virtual relations among resources under comparison; the second approach uses extension functions to filter out resources that do not meet the requested similarity criteria; finally, our third technique applies new solution modifiers to post-process a SPARQL solution sequence. The semantics of the three approaches are formally elaborated and discussed. We close the paper with a demonstration of the usefulness of our iSPARQL framework in the context of a data integration and an ontology mapping experiment. |
|
Abraham Bernstein, Michael Daenzer, The NExT System: Towards True Dynamic Adaptions of Semantic Web Service Compositions (System Description), In: Proceedings of the 4th European Semantic Web Conference (ESWC '07), Springer, March 2007. (Conference or Workshop Paper)
 
Traditional process support systems typically offer a static composition of atomic tasks to more powerful services. In the real world, however, processes change over time: business needs are rapidly evolving thus changing the work itself and relevant information may be unknown until workflow execution run-time. Hence, the static approach does not sufficiently address the need for dynamism. Based on applications in the life science domain this paper puts forward five requirements for dynamic process support systems. These demand a focus on a tight user interaction in the whole process life cycle. The system and the user establish a continuous feedback loop resulting in a mixed-initiative approach requiring a partial execution and resumption feature to adapt a running process to changing needs. Here we present our prototype implementation NExT and discuss a preliminary validation based on a real-world scenario. |
|
Christoph Kiefer, Abraham Bernstein, Jonas Tappolet, Mining Software Repositories with iSPARQL and a Software Evolution Ontology, In: Proceedings of the 2007 International Workshop on Mining Software Repositories (MSR '07), IEEE Computer Society, March 2007. (Conference or Workshop Paper)
 
One of the most important decisions researchers face when analyzing the evolution of software systems is the choice of a proper data analysis/exchange format. Most existing formats have to be processed with special programs written specifically for that purpose and are not easily extendible. Most scientists, therefore, use their own database(s) requiring each of them to repeat the work of writing the import/export programs to their format. We present EvoOnt, a software repository data exchange format based on the Web Ontology Language (OWL). EvoOnt includes software, release, and bug-related information. Since OWL describes the semantics of the data, EvoOnt is (1) easily extendible, (2) comes with many existing tools, and (3) allows to derive assertions through its inherent Description Logic reasoning capabilities. The paper also shows iSPARQL – our SPARQL-based Semantic Web query engine containing similarity joins. Together with EvoOnt, iSPARQL can accomplish a sizable number of tasks sought in software repository mining projects, such as an assessment of the amount of change between versions or the detection of bad code smells. To illustrate the usefulness of EvoOnt (and iSPARQL), we perform a series of experiments with a real-world Java project. These show that a number of software analyses can be reduced to simple iSPARQL queries on an EvoOnt dataset. |
|