Ausgezeichnete Informatikdissertationen 2018, Edited by: Steffen Hölldobler, Sven Appel, Abraham Bernstein, Felix Freiling, Hans-Peter Lenhof, Paul Molitor, Gustaf Neumann, Rüdiger Reischuk, Björn Scheuermann, Nicole Schweikardt, Myra Spiliopoulou, Sabine Süsstrunk, Klaus Wehrle, Gesellschaft für Informatik, Bonn, 2019. (Edited Scientific Work)

|
|
KV Rosni, Manish Shukla, Vijayanand Banahatti, Sachin Lodha, Consent Recommender System: A Case Study on LinkedIn Settings, PAL: Privacy-Enhancing Artificial Intelligence and Language Technologies As Part of the AAAI Spring Symposium Series (AAAI-SSS 2019), 2019. (Journal Article)

|
|
Oana Inel, Lora Aroyo, Validation methodology for expert-annotated datasets: Event annotation case study, In: 2nd Conference on Language, Data and Knowledge (LDK 2019), 2019. (Conference or Workshop Paper published in Proceedings)

|
|
Daniel Spicar, Efficient spectral link prediction on graphs: approximation of graph algorithms via coarse-graining, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Dissertation)
 
Spectral graph theory studies the properties of graphs in relation to their eigenpairs, that is, the eigenvalues and eigenvectors of associated graph matrices. Successful applications of spectral graph theory include the ranking web search results and link prediction in graphs. The latter is used to predict the evolution of graphs and to discover previously unobserved edges. However, the computation of eigenpairs is computationally very demanding. The eigenvalue-eigenvector decomposition of graph matrices has cubic time complexity. As graphs or networks become large, this makes the computation of a full eigendecomposition infeasible. This complexity problem is addressed on one of the most accurate state-of-the-art spectral link prediction methods. The method requires several eigenvalue-eigenvector decompositions which limits its applicability to small graphs only.
Previous work on similar complexity bottlenecks has approached the problem by computing only a subset of the eigenpairs in order to obtain an approximation of the original method at lower computational cost. This thesis takes the same approach but instead of modifying the original link-prediction algorithm, it uses the eigenpair subset to approximate the graph without significant changes to the link prediction algorithm. The graph is approximated by spectral coarse-graining, a method that shrinks graphs while preserving their dominant spectral properties. This approach is motivated by the hypothesis that results computed on a coarse-grained graph approximate the original link prediction results.
The main contribution presented in this dissertation is a new, coarse-grained spectral link-prediction approach. In a first part, the state-of-the-art link prediction method is combined with spectral coarse-graining and the computational cost, complexity and link prediction accuracy is evaluated. Theoretical analysis and experiments show that the coarse-grained approach produces accurate approximations of the original method with a significantly reduced time complexity. Thereafter, the spectral coarse-graining method is extended to make the complexity reduction more controllable and to avoid the computation of the eigenvalue-eigenvector decomposition. This dramatically increases the efficiency of the proposed approach and allows to compute more accurate graph approximations. As a result, the link prediction accuracy can be significantly improved while maintaining the reduced time complexity of the coarse-grained approach.
Furthermore, the proposed approach produces a valid graph of the same structure and type as the original graph. In principle, it can be used with many other graph applications without the need for major adaptations. Therefore, the approach is a step towards a more general approximation framework for spectral graph algorithms. |
|
Proceedings of the ISWC 2019 Satellite Tracks (Posters & Demonstrations, Industry, and Outrageous Ideas) co-located with 18th International Semantic Web Conference (ISWC 2019), Edited by: Mari Carmen Suárez-Figueroa, Gong Cheng, Anna Lisa Gentile, Christophe Guéret, Maria Keet, Abraham Bernstein, CEUR Workshop Proceedings, CEUR, 2019. (Edited Scientific Work)

|
|
Laurenz Shi, Dierential Privacy Algorithms for Data Streams on Top of Apache Flink, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Bachelor's Thesis)
 
With the upcoming of computing systems that process large amounts of data, securing sensitive data has become a lasting concern. Research has shown that with the help of other data sets, anoynimzed data sets can be reconstructed. Differential privacy is a construct to secure sensitive data. The data is perturbed with noise to ensure privacy. Differential privacy has shown to be very effective against attempts to reconstruct anonymized data sets in combination with other data sets. Originally used on static data sets, differential privacy was extended to data streams. Four papers that introduce various algorithms are examined in this thesis. The algorithms are often of theoretical nature. To evaluate the real effects of the algorithms on the results of data streams, they are implemented and compared to each other. The focus is on the performance in terms of accuracy that the algorithms deliver. |
|
Bibek Paudel, Suzanne Tolmeijer, Abraham Bernstein, Bringing Diversity in News Recommender Algorithms, In: ECREA 2018 - pre-conference workshop on Information, Diversity and Media Pluralism in the Age of Algorithms. 2018. (Conference Presentation)
 
In major political events and discussions, recommender algorithms play a large role in shaping opinions and setting the agenda for more traditional news media. These algorithms are used pervasively in social networks, media platforms and search engines. They determine what information is shown to the user and in which order.
Existing recommender systems focus on improving accuracy based on historic interaction-data, which has received criticism for being detrimental to the goals of improving user experience and information diversity. Their accuracy is measured in terms of predicting future user behavior based on past observation. The most popular algorithms suggest items to users based on the choices of similar users.
However, these systems are observed to promote a narrow set of already-popular items that are similar to past choices of the user, or are liked by many users. This limits users' exposure to diverse viewpoints and potentially increases polarization.
One reason for the lack of diversity in existing approaches is that they optimize for average accuracy and click rates. There are other aspects of user-experience like new information and diering viewpoints, that are not taken into account. Similarly, a high average accuracy does not necessarily mean that the algorithm is similarly accurate for dierent groups of users and niche items. Besides this, training data may be biased because of previous recommendations,
or the data collection process.
Although information diversity is important, its denition depends heavily on the specic application and it is hard to nd a general denition in the context of recommender systems. Nevertheless, there are denitions that pertain to some specic aspects of diversity, such as entropy, coverage, personalization, and surprisal.
There is a need to design recommender algorithms that balance accuracy and diversity in order to deal with the aforementioned problems. To approach this problem, we rst investigated state-of-the-art algorithms and found that they promote already popular products. We also found that recommender diversity depends on the optimization parameters of the algorithms, a factor that had not received enough attention before. In this context, we developed a graph- based method that promotes niche items without sacricing accuracy. We also developed a probabilistic latent-factor model that signicantly improves the coverage and long-tail diversity of recommendations. Currently, we are using these insights to develop new machine learning algorithms for diverse news recommendation. We plan to and deploy and test them in a large-scale experiment involving multiple news producers and consumers. |
|
Romana Pernisch, Florian Ruosch, Daniele Dell'Aglio, Abraham Bernstein, Stream Processing: The Matrix Revolutions, In: SSWS 2018 Scalable Semantic Web Knowledge Base Systems, CEUR-WS.org, Aachen, 2018-10-09. (Conference or Workshop Paper published in Proceedings)
 
Analyzing data streams is a vital task in data science. Often, data comes in different shapes such as triples, tuples, relations, or matrices. Traditional stream processing systems, however, only process data in one of these formats.
To enable the processing of streams combining different shapes of data, we developed a system that parses SPARQL queries using the Apache Jena parser and transforms them to Apache Flink topologies. With a custom data type and tailored functions, we enabled the integration of matrices in Jena and therefore allowed to mix graphs, relational, and linear algebra in an RDF graph. This provided a proof of concept that queries may be written for static data and – with the usage of the streaming engine Flink – can easily be run on data streams, even if they contain multiple of the aforementioned types. |
|
Riccardo Tommasini, Yehia Abo Sedira, Daniele Dell'Aglio, Marco Balduini, Muhammed Intizar Ali, Danh Le Phuoc, Emanuele Della Valle, Jean-Paul Calbimonte, VoCaLS: Describing Streams on the Web, In: ISWC 2018 Posters & Demonstrations and Industry Tracks, CEUR-WS.org, Aachen, 2018-10-08. (Conference or Workshop Paper)
 
|
|
Tobias Grubenmann, Daniele Dell'Aglio, Abraham Bernstein, Dmitrii Moor, Sven Seuken, Make restaurants pay your server bills, In: ISWC 2018 Posters & Demonstrations and Industry Tracks, CEUR-WS.org, Aachen, 2018-10-08. (Conference or Workshop Paper)
 
|
|
Riccardo Tommasini, Yehia Abo Sedira, Daniele Dell'Aglio, Marco Balduini, Muhammed Intizar Ali, Danh Le Phuoc, Emanuele Della Valle, Jean-Paul Calbimonte, VoCaLS: Vocabulary & Catalog of Linked Streams, In: The Semantic Web - ISWC 2018, Springer, Cham, 2018-10-08. (Conference or Workshop Paper published in Proceedings)
 
|
|
Matthias Baumgartner, Wen Zhang, Bibek Paudel, Daniele Dell'Aglio, Huajun Chen, Abraham Bernstein, Aligning Knowledge Base and Document Embedding Models using Regularized Multi-Task Learning, In: The Semantic Web – ISWC 2018, Springer, Cham, 2018-10-08. (Conference or Workshop Paper published in Proceedings)
 
Knowledge Bases (KBs) and textual documents contain rich and complementary information about real-world objects, as well as relations among them. While text documents describe entities in freeform, KBs organizes such information in a structured way. This makes these two information representation forms hard to compare and integrate, limiting the possibility to use them jointly to improve predictive and analytical tasks. In this article, we study this problem, and we propose KADE, a solution based on a regularized multi-task learning of KB and document embeddings. KADE can potentially incorporate any KB and document embedding learning method. Our experiments on multiple datasets and methods show that KADE effectively aligns document and entities embeddings, while maintaining the characteristics of the embedding models. |
|
Melina Mast, Application Prototype for Recognizing High-Stress Situations, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
 
Fitness trackers and apps provide a new way to monitor the behaviour and health of a person. These technologies support psychiatrists in the treatment of patients. This thesis presents a prototype that aims to identify stressful situations with a fitness tracker and an app. The prototype will help medical professionals to evaluate whether a mobile healthcare system is accepted by patients. In addition to collecting physiological data, subjective perceptions of stress and well-being can be recorded via the app. Authorized specialists can view the gathered data in a web application. |
|
Simon Tännler, Framework for News Recommendations, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
 
A common issue in recommender systems is the lack of explicit feedback. In this thesis we build a news recommendation framework to collect explicit and implicit feedback from users reading news articles. Our goal is to find a generic way how to interpret implicit feedback to approximate a users interest in an article. After conducting a field study we train machine learning models on the collected implicit feedback to predict the explicit feedback the user has assigned to a given article. Our results reveal that the time spent reading and reading progress are the two most influential features. These results have limitations due to UI issues with the way we collected explicit feedback in our iOS client. We realized that our design restricts the number of negative explicit feedback which lead to a severely imbalanced data set. We conclude with suggestions to improve the UI design and recommend a more comprehensive experiment to collect an overall bigger data set. |
|
Jérôme Oesch, Benchmarking Incremental Reasoner Systems, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
 
Benchmarking reasoner systems is an already wide-spread approach of comparing different ontology reasoners among each other. With the emergence of incremental reasoners, no benchmarks have been proposed so far that are able to test competing incremental implementations. In this thesis, we not only propose a benchmarking framework that could fill this gap, but we present a new approach of a benchmark that is capable of generating both queries and subsequent ontology edits automatically, just requiring an existing ontology as input. Our implemented framework, ReasonBench++, uses the concepts of Competency Question-driven Ontology Authoring to generate queries, as well as Ontology Evolution Mapping to generate edits. With the application of these two concepts, we are able to show that ReasonBench++ is generating close to real-life benchmarks that reflect an ontology intensively used by ontology authors and simultaneous
queries of users. |
|
Michael Feldman, Crowdsourcing data analysis: empowering non-experts to conduct data analysis, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Dissertation)
 
The development of Internet-based ecosystem has led to the emergence of alternative recruitment models which are exclusively facilitated through the internet. With Online Labor Markets (OLMs) and Crowdsourcing platforms it is possible to hire individuals online to conduct tasks and projects of different size and complexity. Crowdsourcing platforms are well-suited for simple micro-tasks which could take seconds or minutes and be completed with big number of participants working in parallel. On the other hand, OLMs are usually allowing to hire experts in flexible manner for more advanced projects that could take days, weeks or even months. Due to the flexibility of such employment models it is possible to find various experts on OLMs such as designers, lawyers, developers or engineers. However, it is relatively rare to find data scientists – experts able to preprocess analyze and make sense of data. This shortage is not surprising giving the general shortage of data science experts. Moreover, due to various reasons such as extensive education and training requirements as well as soaring demand, the projected shortage in such experts is expected to grow during the next years.
In this dissertation we explored how the crowdsourcing approach could be leveraged to support data science projects. In particular, we presented three use cases where crowds and freelancers with different expertise levels could be involved to support data science projects. We conventionally classified crowds into low, intermediate, and high levels of expertise in data analysis and proposed use cases where every group might contribute through crowdsourcing setting.
In the first case study we presented an approach of how crowds could be engaged in the review process of the statistical assumptions in scientific publications. When researchers use statistical methods in scientific manuscripts these methods are often valid only if their underlying assumptions are met. If these assumptions are compromised, then the validity of the results is questionable. We presented an approach based on micro-tasking with laymen crowds that reach quality similar to expert-based review. We then conducted longitudinal analysis of CHI conference proceedings to evaluate the dynamics of standards on statistical reporting throughout the years. Finally, we compared CHI proceedings with 5 top journals in the field of medicine, management, and psychology to compare the reporting of statistical assumptions across disciplines.
Our second case study addressed the freelancers with intermediate expertise in data analysis. To better understand what the skills that intermediate experts possess are, we conducted an interview with data scientist experts whom we asked what kind of tasks could be outsourced to non-experts. Additionally, we conducted a survey in most prominent OLMs to better understand the skills of freelancers active in data analysis. The conclusions of this study were twofold: 1) conservatively individuals with certain coding skills could be helpful in data science projects if integrated properly and 2) data preprocessing tasks are by far the biggest bottle neck activity that could be outsourced, if the coordination between involved parties is managed properly. Departing from these results, we conducted a study, where we designed a proof-of-concept for a platform that facilitated a number of experiments where non- experts were collaborating with experts through offloading data preprocessing
activities. Our results suggest that the outcome achieved with mixed expertise teams are similar in quality and cheaper than the work of experts.
Our last use case was not as much directed to alleviate the shortage in data scientists as to take advantage of the crowdsourcing setting to address inherent vulnerability of data-driven analysis. Recently, there has been a discussion among data analysis experts and researchers regarding the subjectivity of data driven analysis outputs. Namely, it has been shown that when data analysts perform data analysis where they are provided with the same data and the same hypothesis, within an NHST (Null Hypothesis Significance Testing) approach, they often reach cardinally different results. Therefore, we conducted a study where we provided 47 experts with the same data and hypotheses to answer. Through especially designed platform we were able to elicit the rational for every decision made throughout data analysis. This fine-grained data allowed us to conduct a qualitative analysis where we explored the underlying factors leading to the variability of data analysis results.
The case studies combined together provide an overview of how the discipline of data science could benefit from the crowdsourcing approach. We hope that the solutions proposed in this dissertation will contribute to the discussion on how to reduce the entry barrier for laymen to participate in data driven research as well as how to improve the transparency of how the results were reached. |
|
Robin Stohler, Document Embedding Models - A Comparison with Bag-of-Words, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
 
Word embeddings changed the possibilities in the field of Natural Language Processing and Machine Learning completely, opening new doors for many applications. One is the creation of document embeddings with the Doc2Vec algorithm based on Word2Vec. These dense distributed latent vectors allow to work with text in a better, more meaningful way compared to older text vectorization processes such as Bag-of-Words (BOW). In this thesis, a variety of baseline methods are compared in different categories against Doc2Vec. To finally asses the usefulness of these older approaches after the recent upshake in the Natural Language Processing field. Empirical results show that BOW used with a strong classifier is especially in smaller datasets better than Doc2Vec. Additionally, an approach to reduce the dimensionality of a BOW is presented. |
|
Romana Pernisch, Impact of Changes on Operations over Knowledge Graphs, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
 
Knowledge graphs capture information in the form of a named graph. They consist of ten thousands of nodes and edges. Because of their size, operations executed over them take a large quantity of time and there is no desire to compute the results unnecessarily. I am interested in investigating the impact of the evolution of the graph on the results of operations.
One such operation is materialization. I have predicted the impact using descriptive graph measures and change actions as features with a support vector regression with a linear kernel. Only one model satisfied our requirements of a RSME below 0.2 and R-squared above 0.7. However, the used data was too small to generalize the results.
|
|
Elias Bernhaut, Publication of linked data streams on the Web, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
 
The access to information is an important factor for
making informed choices, for example in the context of votings
or business decisions. Free information is published on the Web as Open Data including free datasets published by governments. The so called Open Government Data (OGD) empowers the citizens with the access to information and builds the basis for new applications using government
data as datasource. Parallel to the Open Data movement, the Semantic
Web expands the global graph of Open Linked Data building the Linked Open Data Cloud (LOV). Linked Data comes with the advantage of globally identifiable entities interlinked through relationships. Applications with access to the Web are able to access the graph of the LOV for the retrieval of information from various sources.
Linked Open Data datasets are majorly published as static datasets
which means they don't change over time in contrast to dynamic datasets. An ongoing research has the goal to publish dynamic datasets as data streams. A notable example for a framework approaching
the publication of Linked Data as data streams is TripleWave. It can interlink streamed input data and thus transform it to Linked Data
streams. As TripleWave can transform streamed input data, it is missing
the ability to transform static datasets which are updated
frequently into Linked Data streams. Throughout this thesis, I analyse the OGD datasets and identify the requirements for their publication as
Linked Open Data streams which are not yet covered by
TripleWave. I show that no existing mapping language fulfills the requirements for the publication of the OGD datasets. A new mapping language is therefore necessary and thus I introduce a new, RML oriented mapping module named JRML. JRML is a Javascript module for data mappings to Linked Data with an integrated, pull-based data-fetching strategy controlled by a scheduler. I show how JRML meets the requirements for transforming the OGD datasets of the survey,
how I implement JRML and how I integrate it into TripleWave. Finally I publish a range of transformed datasets to present the result and thus increase the number of
Linked Open Data streams on the web.
|
|
Peter Giger, Improvement Of Word Embeddings By Joining Visual Features, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
 
In machine learning, embeddings are used to encode information in a vector space. Word2Vec is a popular method for creating word embeddings, vector representations of words, and can be used for semantic tasks such as finding the similarity between words. Similarly, image embeddings are vector representations of images. The concatenation of word and image vectors is one possible multi-modal model and has shown to outperform the individual models. This work examines if and when a concatenation is beneficial and proposes an alternative model without using vector concatenation. |
|