Wen Zhang, Bibek Paudel, Wei Zhang, Abraham Bernstein, Huajun Chen, Interaction Embeddings for Prediction and Explanation in Knowledge Graphs, In: International Conference on Web Search and Data Mining (WSDM), Association of Computing Machinery (ACM), New York, NY, 2019-02-11. (Conference or Workshop Paper published in Proceedings)
 
Knowledge graph embedding aims to learn distributed representations for entities and relations, and are proven to be effective in many applications. Crossover interactions --- bi-directional effects between entities and relations --- help select related information when predicting a new triple, but hasn't been formally discussed before.
In this paper, we propose CrossE, a novel knowledge graph embedding which explicitly simulates crossover interactions. It not only learns one general embedding for each entity and relation as in most previous methods, but also generates multiple triple specific embeddings for both of them, named interaction embeddings.
We evaluate the embeddings on typical link prediction task and find that CrossE achieves state-of-the-art results on complex and more challenging datasets.
Furthermore, we evaluate the embeddings from a new perspective --- giving explanations for predicted triples, which is important for real applications.
In this work, explanations for a triple are regarded as reliable closed-paths between head and tail entity. Compared to other baselines, we show experimentally that CrossE is more capable of generating reliable explanations to support its predictions, benefiting from interaction embeddings. |
|
Te Tan, Online Optimization of Job Parallelization in Apache GearPump, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Master's Thesis)
 
Parameter tuning in the realm of distributed (streaming) systems is a popular research area and many solutions have been proposed by the research community. Bayesian Optimization (BO) is one of the them which is proved to be powerful. While the existing way to conduct the BO process is `offline' and involves shutting down the system as well as many inefficient manual steps, in this work we implement an optimizer which is able to do `online' BO optimization. The optimizer is implemented within Apache Gearpump, a message-driven streaming engine. As the DAG operation at runtime is the prerequisite for doing `online' optimization, we inspect into the existing feature of Apache Gearpump, and propose our improved approach named Restart to do runtime DAG operations. Then supported by Restart approach, we design and implement JobOptimizer, which enables `online' BO optimization. The evaluation results show that: with the constraint of maximum number of trials, although JobOptimizer is not able to explore the parameter space adequately, it is able to find better parameter set than random exploration. It also outperforms Linear Ascent Optimizer in terms of throughput in the case of comparatively larger DAG applications. |
|
Luca Rossetto, Mahnaz Amiri Parian, Ralph Gasser, Ivan Giangreco, Silvan Heller, Heiko Schuldt, Deep Learning-Based Concept Detection in vitrivr, In: MultiMedia Modeling, Springer, Heidelberg, p. 616 - 621, 2019-01-11. (Book Chapter)

This paper presents the most recent additions to the vitrivr retrieval stack, which will be put to the test in the context of the 2019 Video Browser Showdown (VBS). The vitrivr stack has been extended by approaches for detecting, localizing, or describing concepts and actions in video scenes using various convolutional neural networks. Leveraging those additions, we have added support for searching the video collection based on semantic sketches. Furthermore, vitrivr offers new types of labels for text-based retrieval. In the same vein, we have also improved upon vitrivr’s pre-existing capabilities for extracting text from video through scene text recognition. Moreover, the user interface has received a major overhaul so as to make it more accessible to novice users, especially for query formulation and result exploration. |
|
Luca Rossetto, Heiko Schuldt, George Awad, Asad A Butt, V3C - A Research Video Collection, In: International Conference on Multimedia Modeling, Springer, Heidelberg, Germany, 2019-01-08. (Conference or Workshop Paper published in Proceedings)
 
|
|
Ausgezeichnete Informatikdissertationen 2018, Edited by: Steffen Hölldobler, Sven Appel, Abraham Bernstein, Felix Freiling, Hans-Peter Lenhof, Paul Molitor, Gustaf Neumann, Rüdiger Reischuk, Björn Scheuermann, Nicole Schweikardt, Myra Spiliopoulou, Sabine Süsstrunk, Klaus Wehrle, Gesellschaft für Informatik, Bonn, 2019. (Edited Scientific Work)

|
|
KV Rosni, Manish Shukla, Vijayanand Banahatti, Sachin Lodha, Consent Recommender System: A Case Study on LinkedIn Settings, PAL: Privacy-Enhancing Artificial Intelligence and Language Technologies As Part of the AAAI Spring Symposium Series (AAAI-SSS 2019), 2019. (Journal Article)

|
|
Oana Inel, Lora Aroyo, Validation methodology for expert-annotated datasets: Event annotation case study, In: 2nd Conference on Language, Data and Knowledge (LDK 2019), 2019. (Conference or Workshop Paper published in Proceedings)

|
|
Daniel Spicar, Efficient spectral link prediction on graphs: approximation of graph algorithms via coarse-graining, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Dissertation)
 
Spectral graph theory studies the properties of graphs in relation to their eigenpairs, that is, the eigenvalues and eigenvectors of associated graph matrices. Successful applications of spectral graph theory include the ranking web search results and link prediction in graphs. The latter is used to predict the evolution of graphs and to discover previously unobserved edges. However, the computation of eigenpairs is computationally very demanding. The eigenvalue-eigenvector decomposition of graph matrices has cubic time complexity. As graphs or networks become large, this makes the computation of a full eigendecomposition infeasible. This complexity problem is addressed on one of the most accurate state-of-the-art spectral link prediction methods. The method requires several eigenvalue-eigenvector decompositions which limits its applicability to small graphs only.
Previous work on similar complexity bottlenecks has approached the problem by computing only a subset of the eigenpairs in order to obtain an approximation of the original method at lower computational cost. This thesis takes the same approach but instead of modifying the original link-prediction algorithm, it uses the eigenpair subset to approximate the graph without significant changes to the link prediction algorithm. The graph is approximated by spectral coarse-graining, a method that shrinks graphs while preserving their dominant spectral properties. This approach is motivated by the hypothesis that results computed on a coarse-grained graph approximate the original link prediction results.
The main contribution presented in this dissertation is a new, coarse-grained spectral link-prediction approach. In a first part, the state-of-the-art link prediction method is combined with spectral coarse-graining and the computational cost, complexity and link prediction accuracy is evaluated. Theoretical analysis and experiments show that the coarse-grained approach produces accurate approximations of the original method with a significantly reduced time complexity. Thereafter, the spectral coarse-graining method is extended to make the complexity reduction more controllable and to avoid the computation of the eigenvalue-eigenvector decomposition. This dramatically increases the efficiency of the proposed approach and allows to compute more accurate graph approximations. As a result, the link prediction accuracy can be significantly improved while maintaining the reduced time complexity of the coarse-grained approach.
Furthermore, the proposed approach produces a valid graph of the same structure and type as the original graph. In principle, it can be used with many other graph applications without the need for major adaptations. Therefore, the approach is a step towards a more general approximation framework for spectral graph algorithms. |
|
Proceedings of the ISWC 2019 Satellite Tracks (Posters & Demonstrations, Industry, and Outrageous Ideas) co-located with 18th International Semantic Web Conference (ISWC 2019), Edited by: Mari Carmen Suárez-Figueroa, Gong Cheng, Anna Lisa Gentile, Christophe Guéret, Maria Keet, Abraham Bernstein, CEUR Workshop Proceedings, CEUR, 2019. (Edited Scientific Work)

|
|
Laurenz Shi, Dierential Privacy Algorithms for Data Streams on Top of Apache Flink, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Bachelor's Thesis)
 
With the upcoming of computing systems that process large amounts of data, securing sensitive data has become a lasting concern. Research has shown that with the help of other data sets, anoynimzed data sets can be reconstructed. Differential privacy is a construct to secure sensitive data. The data is perturbed with noise to ensure privacy. Differential privacy has shown to be very effective against attempts to reconstruct anonymized data sets in combination with other data sets. Originally used on static data sets, differential privacy was extended to data streams. Four papers that introduce various algorithms are examined in this thesis. The algorithms are often of theoretical nature. To evaluate the real effects of the algorithms on the results of data streams, they are implemented and compared to each other. The focus is on the performance in terms of accuracy that the algorithms deliver. |
|
Bibek Paudel, Suzanne Tolmeijer, Abraham Bernstein, Bringing Diversity in News Recommender Algorithms, In: ECREA 2018 - pre-conference workshop on Information, Diversity and Media Pluralism in the Age of Algorithms. 2018. (Conference Presentation)
 
In major political events and discussions, recommender algorithms play a large role in shaping opinions and setting the agenda for more traditional news media. These algorithms are used pervasively in social networks, media platforms and search engines. They determine what information is shown to the user and in which order.
Existing recommender systems focus on improving accuracy based on historic interaction-data, which has received criticism for being detrimental to the goals of improving user experience and information diversity. Their accuracy is measured in terms of predicting future user behavior based on past observation. The most popular algorithms suggest items to users based on the choices of similar users.
However, these systems are observed to promote a narrow set of already-popular items that are similar to past choices of the user, or are liked by many users. This limits users' exposure to diverse viewpoints and potentially increases polarization.
One reason for the lack of diversity in existing approaches is that they optimize for average accuracy and click rates. There are other aspects of user-experience like new information and diering viewpoints, that are not taken into account. Similarly, a high average accuracy does not necessarily mean that the algorithm is similarly accurate for dierent groups of users and niche items. Besides this, training data may be biased because of previous recommendations,
or the data collection process.
Although information diversity is important, its denition depends heavily on the specic application and it is hard to nd a general denition in the context of recommender systems. Nevertheless, there are denitions that pertain to some specic aspects of diversity, such as entropy, coverage, personalization, and surprisal.
There is a need to design recommender algorithms that balance accuracy and diversity in order to deal with the aforementioned problems. To approach this problem, we rst investigated state-of-the-art algorithms and found that they promote already popular products. We also found that recommender diversity depends on the optimization parameters of the algorithms, a factor that had not received enough attention before. In this context, we developed a graph- based method that promotes niche items without sacricing accuracy. We also developed a probabilistic latent-factor model that signicantly improves the coverage and long-tail diversity of recommendations. Currently, we are using these insights to develop new machine learning algorithms for diverse news recommendation. We plan to and deploy and test them in a large-scale experiment involving multiple news producers and consumers. |
|
Romana Pernisch, Florian Ruosch, Daniele Dell'Aglio, Abraham Bernstein, Stream Processing: The Matrix Revolutions, In: SSWS 2018 Scalable Semantic Web Knowledge Base Systems, CEUR-WS.org, Aachen, 2018-10-09. (Conference or Workshop Paper published in Proceedings)
 
Analyzing data streams is a vital task in data science. Often, data comes in different shapes such as triples, tuples, relations, or matrices. Traditional stream processing systems, however, only process data in one of these formats.
To enable the processing of streams combining different shapes of data, we developed a system that parses SPARQL queries using the Apache Jena parser and transforms them to Apache Flink topologies. With a custom data type and tailored functions, we enabled the integration of matrices in Jena and therefore allowed to mix graphs, relational, and linear algebra in an RDF graph. This provided a proof of concept that queries may be written for static data and – with the usage of the streaming engine Flink – can easily be run on data streams, even if they contain multiple of the aforementioned types. |
|
Riccardo Tommasini, Yehia Abo Sedira, Daniele Dell'Aglio, Marco Balduini, Muhammed Intizar Ali, Danh Le Phuoc, Emanuele Della Valle, Jean-Paul Calbimonte, VoCaLS: Describing Streams on the Web, In: ISWC 2018 Posters & Demonstrations and Industry Tracks, CEUR-WS.org, Aachen, 2018-10-08. (Conference or Workshop Paper)
 
|
|
Tobias Grubenmann, Daniele Dell'Aglio, Abraham Bernstein, Dmitrii Moor, Sven Seuken, Make restaurants pay your server bills, In: ISWC 2018 Posters & Demonstrations and Industry Tracks, CEUR-WS.org, Aachen, 2018-10-08. (Conference or Workshop Paper)
 
|
|
Riccardo Tommasini, Yehia Abo Sedira, Daniele Dell'Aglio, Marco Balduini, Muhammed Intizar Ali, Danh Le Phuoc, Emanuele Della Valle, Jean-Paul Calbimonte, VoCaLS: Vocabulary & Catalog of Linked Streams, In: The Semantic Web - ISWC 2018, Springer, Cham, 2018-10-08. (Conference or Workshop Paper published in Proceedings)
 
|
|
Matthias Baumgartner, Wen Zhang, Bibek Paudel, Daniele Dell'Aglio, Huajun Chen, Abraham Bernstein, Aligning Knowledge Base and Document Embedding Models using Regularized Multi-Task Learning, In: The Semantic Web – ISWC 2018, Springer, Cham, 2018-10-08. (Conference or Workshop Paper published in Proceedings)
 
Knowledge Bases (KBs) and textual documents contain rich and complementary information about real-world objects, as well as relations among them. While text documents describe entities in freeform, KBs organizes such information in a structured way. This makes these two information representation forms hard to compare and integrate, limiting the possibility to use them jointly to improve predictive and analytical tasks. In this article, we study this problem, and we propose KADE, a solution based on a regularized multi-task learning of KB and document embeddings. KADE can potentially incorporate any KB and document embedding learning method. Our experiments on multiple datasets and methods show that KADE effectively aligns document and entities embeddings, while maintaining the characteristics of the embedding models. |
|
Melina Mast, Application Prototype for Recognizing High-Stress Situations, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
 
Fitness trackers and apps provide a new way to monitor the behaviour and health of a person. These technologies support psychiatrists in the treatment of patients. This thesis presents a prototype that aims to identify stressful situations with a fitness tracker and an app. The prototype will help medical professionals to evaluate whether a mobile healthcare system is accepted by patients. In addition to collecting physiological data, subjective perceptions of stress and well-being can be recorded via the app. Authorized specialists can view the gathered data in a web application. |
|
Simon Tännler, Framework for News Recommendations, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
 
A common issue in recommender systems is the lack of explicit feedback. In this thesis we build a news recommendation framework to collect explicit and implicit feedback from users reading news articles. Our goal is to find a generic way how to interpret implicit feedback to approximate a users interest in an article. After conducting a field study we train machine learning models on the collected implicit feedback to predict the explicit feedback the user has assigned to a given article. Our results reveal that the time spent reading and reading progress are the two most influential features. These results have limitations due to UI issues with the way we collected explicit feedback in our iOS client. We realized that our design restricts the number of negative explicit feedback which lead to a severely imbalanced data set. We conclude with suggestions to improve the UI design and recommend a more comprehensive experiment to collect an overall bigger data set. |
|
Jérôme Oesch, Benchmarking Incremental Reasoner Systems, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
 
Benchmarking reasoner systems is an already wide-spread approach of comparing different ontology reasoners among each other. With the emergence of incremental reasoners, no benchmarks have been proposed so far that are able to test competing incremental implementations. In this thesis, we not only propose a benchmarking framework that could fill this gap, but we present a new approach of a benchmark that is capable of generating both queries and subsequent ontology edits automatically, just requiring an existing ontology as input. Our implemented framework, ReasonBench++, uses the concepts of Competency Question-driven Ontology Authoring to generate queries, as well as Ontology Evolution Mapping to generate edits. With the application of these two concepts, we are able to show that ReasonBench++ is generating close to real-life benchmarks that reflect an ontology intensively used by ontology authors and simultaneous
queries of users. |
|
Michael Feldman, Crowdsourcing data analysis: empowering non-experts to conduct data analysis, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Dissertation)
 
The development of Internet-based ecosystem has led to the emergence of alternative recruitment models which are exclusively facilitated through the internet. With Online Labor Markets (OLMs) and Crowdsourcing platforms it is possible to hire individuals online to conduct tasks and projects of different size and complexity. Crowdsourcing platforms are well-suited for simple micro-tasks which could take seconds or minutes and be completed with big number of participants working in parallel. On the other hand, OLMs are usually allowing to hire experts in flexible manner for more advanced projects that could take days, weeks or even months. Due to the flexibility of such employment models it is possible to find various experts on OLMs such as designers, lawyers, developers or engineers. However, it is relatively rare to find data scientists – experts able to preprocess analyze and make sense of data. This shortage is not surprising giving the general shortage of data science experts. Moreover, due to various reasons such as extensive education and training requirements as well as soaring demand, the projected shortage in such experts is expected to grow during the next years.
In this dissertation we explored how the crowdsourcing approach could be leveraged to support data science projects. In particular, we presented three use cases where crowds and freelancers with different expertise levels could be involved to support data science projects. We conventionally classified crowds into low, intermediate, and high levels of expertise in data analysis and proposed use cases where every group might contribute through crowdsourcing setting.
In the first case study we presented an approach of how crowds could be engaged in the review process of the statistical assumptions in scientific publications. When researchers use statistical methods in scientific manuscripts these methods are often valid only if their underlying assumptions are met. If these assumptions are compromised, then the validity of the results is questionable. We presented an approach based on micro-tasking with laymen crowds that reach quality similar to expert-based review. We then conducted longitudinal analysis of CHI conference proceedings to evaluate the dynamics of standards on statistical reporting throughout the years. Finally, we compared CHI proceedings with 5 top journals in the field of medicine, management, and psychology to compare the reporting of statistical assumptions across disciplines.
Our second case study addressed the freelancers with intermediate expertise in data analysis. To better understand what the skills that intermediate experts possess are, we conducted an interview with data scientist experts whom we asked what kind of tasks could be outsourced to non-experts. Additionally, we conducted a survey in most prominent OLMs to better understand the skills of freelancers active in data analysis. The conclusions of this study were twofold: 1) conservatively individuals with certain coding skills could be helpful in data science projects if integrated properly and 2) data preprocessing tasks are by far the biggest bottle neck activity that could be outsourced, if the coordination between involved parties is managed properly. Departing from these results, we conducted a study, where we designed a proof-of-concept for a platform that facilitated a number of experiments where non- experts were collaborating with experts through offloading data preprocessing
activities. Our results suggest that the outcome achieved with mixed expertise teams are similar in quality and cheaper than the work of experts.
Our last use case was not as much directed to alleviate the shortage in data scientists as to take advantage of the crowdsourcing setting to address inherent vulnerability of data-driven analysis. Recently, there has been a discussion among data analysis experts and researchers regarding the subjectivity of data driven analysis outputs. Namely, it has been shown that when data analysts perform data analysis where they are provided with the same data and the same hypothesis, within an NHST (Null Hypothesis Significance Testing) approach, they often reach cardinally different results. Therefore, we conducted a study where we provided 47 experts with the same data and hypotheses to answer. Through especially designed platform we were able to elicit the rational for every decision made throughout data analysis. This fine-grained data allowed us to conduct a qualitative analysis where we explored the underlying factors leading to the variability of data analysis results.
The case studies combined together provide an overview of how the discipline of data science could benefit from the crowdsourcing approach. We hope that the solutions proposed in this dissertation will contribute to the discussion on how to reduce the entry barrier for laymen to participate in data driven research as well as how to improve the transparency of how the results were reached. |
|