Riccardo Tommasini, Yehia Abo Sedira, Daniele Dell'Aglio, Marco Balduini, Muhammed Intizar Ali, Danh Le Phuoc, Emanuele Della Valle, Jean-Paul Calbimonte, VoCaLS: Vocabulary & Catalog of Linked Streams, In: The Semantic Web - ISWC 2018, Springer, Cham, 2018-10-08. (Conference or Workshop Paper published in Proceedings)
|
|
Matthias Baumgartner, Wen Zhang, Bibek Paudel, Daniele Dell'Aglio, Huajun Chen, Abraham Bernstein, Aligning Knowledge Base and Document Embedding Models using Regularized Multi-Task Learning, In: The Semantic Web – ISWC 2018, Springer, Cham, 2018-10-08. (Conference or Workshop Paper published in Proceedings)
Knowledge Bases (KBs) and textual documents contain rich and complementary information about real-world objects, as well as relations among them. While text documents describe entities in freeform, KBs organizes such information in a structured way. This makes these two information representation forms hard to compare and integrate, limiting the possibility to use them jointly to improve predictive and analytical tasks. In this article, we study this problem, and we propose KADE, a solution based on a regularized multi-task learning of KB and document embeddings. KADE can potentially incorporate any KB and document embedding learning method. Our experiments on multiple datasets and methods show that KADE effectively aligns document and entities embeddings, while maintaining the characteristics of the embedding models. |
|
Melina Mast, Application Prototype for Recognizing High-Stress Situations, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
Fitness trackers and apps provide a new way to monitor the behaviour and health of a person. These technologies support psychiatrists in the treatment of patients. This thesis presents a prototype that aims to identify stressful situations with a fitness tracker and an app. The prototype will help medical professionals to evaluate whether a mobile healthcare system is accepted by patients. In addition to collecting physiological data, subjective perceptions of stress and well-being can be recorded via the app. Authorized specialists can view the gathered data in a web application. |
|
Simon Tännler, Framework for News Recommendations, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
A common issue in recommender systems is the lack of explicit feedback. In this thesis we build a news recommendation framework to collect explicit and implicit feedback from users reading news articles. Our goal is to find a generic way how to interpret implicit feedback to approximate a users interest in an article. After conducting a field study we train machine learning models on the collected implicit feedback to predict the explicit feedback the user has assigned to a given article. Our results reveal that the time spent reading and reading progress are the two most influential features. These results have limitations due to UI issues with the way we collected explicit feedback in our iOS client. We realized that our design restricts the number of negative explicit feedback which lead to a severely imbalanced data set. We conclude with suggestions to improve the UI design and recommend a more comprehensive experiment to collect an overall bigger data set. |
|
Jérôme Oesch, Benchmarking Incremental Reasoner Systems, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
Benchmarking reasoner systems is an already wide-spread approach of comparing different ontology reasoners among each other. With the emergence of incremental reasoners, no benchmarks have been proposed so far that are able to test competing incremental implementations. In this thesis, we not only propose a benchmarking framework that could fill this gap, but we present a new approach of a benchmark that is capable of generating both queries and subsequent ontology edits automatically, just requiring an existing ontology as input. Our implemented framework, ReasonBench++, uses the concepts of Competency Question-driven Ontology Authoring to generate queries, as well as Ontology Evolution Mapping to generate edits. With the application of these two concepts, we are able to show that ReasonBench++ is generating close to real-life benchmarks that reflect an ontology intensively used by ontology authors and simultaneous
queries of users. |
|
Michael Feldman, Crowdsourcing data analysis: empowering non-experts to conduct data analysis, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Dissertation)
The development of Internet-based ecosystem has led to the emergence of alternative recruitment models which are exclusively facilitated through the internet. With Online Labor Markets (OLMs) and Crowdsourcing platforms it is possible to hire individuals online to conduct tasks and projects of different size and complexity. Crowdsourcing platforms are well-suited for simple micro-tasks which could take seconds or minutes and be completed with big number of participants working in parallel. On the other hand, OLMs are usually allowing to hire experts in flexible manner for more advanced projects that could take days, weeks or even months. Due to the flexibility of such employment models it is possible to find various experts on OLMs such as designers, lawyers, developers or engineers. However, it is relatively rare to find data scientists – experts able to preprocess analyze and make sense of data. This shortage is not surprising giving the general shortage of data science experts. Moreover, due to various reasons such as extensive education and training requirements as well as soaring demand, the projected shortage in such experts is expected to grow during the next years.
In this dissertation we explored how the crowdsourcing approach could be leveraged to support data science projects. In particular, we presented three use cases where crowds and freelancers with different expertise levels could be involved to support data science projects. We conventionally classified crowds into low, intermediate, and high levels of expertise in data analysis and proposed use cases where every group might contribute through crowdsourcing setting.
In the first case study we presented an approach of how crowds could be engaged in the review process of the statistical assumptions in scientific publications. When researchers use statistical methods in scientific manuscripts these methods are often valid only if their underlying assumptions are met. If these assumptions are compromised, then the validity of the results is questionable. We presented an approach based on micro-tasking with laymen crowds that reach quality similar to expert-based review. We then conducted longitudinal analysis of CHI conference proceedings to evaluate the dynamics of standards on statistical reporting throughout the years. Finally, we compared CHI proceedings with 5 top journals in the field of medicine, management, and psychology to compare the reporting of statistical assumptions across disciplines.
Our second case study addressed the freelancers with intermediate expertise in data analysis. To better understand what the skills that intermediate experts possess are, we conducted an interview with data scientist experts whom we asked what kind of tasks could be outsourced to non-experts. Additionally, we conducted a survey in most prominent OLMs to better understand the skills of freelancers active in data analysis. The conclusions of this study were twofold: 1) conservatively individuals with certain coding skills could be helpful in data science projects if integrated properly and 2) data preprocessing tasks are by far the biggest bottle neck activity that could be outsourced, if the coordination between involved parties is managed properly. Departing from these results, we conducted a study, where we designed a proof-of-concept for a platform that facilitated a number of experiments where non- experts were collaborating with experts through offloading data preprocessing
activities. Our results suggest that the outcome achieved with mixed expertise teams are similar in quality and cheaper than the work of experts.
Our last use case was not as much directed to alleviate the shortage in data scientists as to take advantage of the crowdsourcing setting to address inherent vulnerability of data-driven analysis. Recently, there has been a discussion among data analysis experts and researchers regarding the subjectivity of data driven analysis outputs. Namely, it has been shown that when data analysts perform data analysis where they are provided with the same data and the same hypothesis, within an NHST (Null Hypothesis Significance Testing) approach, they often reach cardinally different results. Therefore, we conducted a study where we provided 47 experts with the same data and hypotheses to answer. Through especially designed platform we were able to elicit the rational for every decision made throughout data analysis. This fine-grained data allowed us to conduct a qualitative analysis where we explored the underlying factors leading to the variability of data analysis results.
The case studies combined together provide an overview of how the discipline of data science could benefit from the crowdsourcing approach. We hope that the solutions proposed in this dissertation will contribute to the discussion on how to reduce the entry barrier for laymen to participate in data driven research as well as how to improve the transparency of how the results were reached. |
|
Robin Stohler, Document Embedding Models - A Comparison with Bag-of-Words, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
Word embeddings changed the possibilities in the field of Natural Language Processing and Machine Learning completely, opening new doors for many applications. One is the creation of document embeddings with the Doc2Vec algorithm based on Word2Vec. These dense distributed latent vectors allow to work with text in a better, more meaningful way compared to older text vectorization processes such as Bag-of-Words (BOW). In this thesis, a variety of baseline methods are compared in different categories against Doc2Vec. To finally asses the usefulness of these older approaches after the recent upshake in the Natural Language Processing field. Empirical results show that BOW used with a strong classifier is especially in smaller datasets better than Doc2Vec. Additionally, an approach to reduce the dimensionality of a BOW is presented. |
|
Romana Pernisch, Impact of Changes on Operations over Knowledge Graphs, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
Knowledge graphs capture information in the form of a named graph. They consist of ten thousands of nodes and edges. Because of their size, operations executed over them take a large quantity of time and there is no desire to compute the results unnecessarily. I am interested in investigating the impact of the evolution of the graph on the results of operations.
One such operation is materialization. I have predicted the impact using descriptive graph measures and change actions as features with a support vector regression with a linear kernel. Only one model satisfied our requirements of a RSME below 0.2 and R-squared above 0.7. However, the used data was too small to generalize the results.
|
|
Elias Bernhaut, Publication of linked data streams on the Web, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
The access to information is an important factor for
making informed choices, for example in the context of votings
or business decisions. Free information is published on the Web as Open Data including free datasets published by governments. The so called Open Government Data (OGD) empowers the citizens with the access to information and builds the basis for new applications using government
data as datasource. Parallel to the Open Data movement, the Semantic
Web expands the global graph of Open Linked Data building the Linked Open Data Cloud (LOV). Linked Data comes with the advantage of globally identifiable entities interlinked through relationships. Applications with access to the Web are able to access the graph of the LOV for the retrieval of information from various sources.
Linked Open Data datasets are majorly published as static datasets
which means they don't change over time in contrast to dynamic datasets. An ongoing research has the goal to publish dynamic datasets as data streams. A notable example for a framework approaching
the publication of Linked Data as data streams is TripleWave. It can interlink streamed input data and thus transform it to Linked Data
streams. As TripleWave can transform streamed input data, it is missing
the ability to transform static datasets which are updated
frequently into Linked Data streams. Throughout this thesis, I analyse the OGD datasets and identify the requirements for their publication as
Linked Open Data streams which are not yet covered by
TripleWave. I show that no existing mapping language fulfills the requirements for the publication of the OGD datasets. A new mapping language is therefore necessary and thus I introduce a new, RML oriented mapping module named JRML. JRML is a Javascript module for data mappings to Linked Data with an integrated, pull-based data-fetching strategy controlled by a scheduler. I show how JRML meets the requirements for transforming the OGD datasets of the survey,
how I implement JRML and how I integrate it into TripleWave. Finally I publish a range of transformed datasets to present the result and thus increase the number of
Linked Open Data streams on the web.
|
|
Peter Giger, Improvement Of Word Embeddings By Joining Visual Features, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
In machine learning, embeddings are used to encode information in a vector space. Word2Vec is a popular method for creating word embeddings, vector representations of words, and can be used for semantic tasks such as finding the similarity between words. Similarly, image embeddings are vector representations of images. The concatenation of word and image vectors is one possible multi-modal model and has shown to outperform the individual models. This work examines if and when a concatenation is beneficial and proposes an alternative model without using vector concatenation. |
|
Shen Gao, Daniele Dell'Aglio, Jeff Z Pan, Abraham Bernstein, Distributed Stream Consistency Checking, In: Web Engineering - 18th International Conference, ICWE 2018, Cáceres, Spain, June 5-8, 2018, Proceedings, Springer, Cham, 2018-06-05. (Conference or Workshop Paper published in Proceedings)
|
|
Alessandro Margara, Gianpaolo Cugola, Dario Collavini, Daniele Dell'Aglio, Efficient Temporal Reasoning on Streams of Events with DOTR, In: The Semantic Web - 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Springer, Cham, 2018-06-03. (Conference or Workshop Paper published in Proceedings)
|
|
Michael Feldman, Cristian Anastasiu, Abraham Bernstein, Towards Collaborative Data Analysis with Diverse Crowds – a Design Science Approach, In: 13th International Conference on Design Science Research in Information Systems and Technology, s.n., Heidelberg, DE, 2018-06-03. (Conference or Workshop Paper published in Proceedings)
The last years have witnessed an increasing shortage of data experts capable of analyzing the omnipresent data and producing meaningful insights. Furthermore, some data scientists mention data preprocessing to take up to 80% of the whole project time. This paper proposes a method for collaborative data analysis that involves a crowd without data analysis expertise. Orchestrated by an expert, the team of novices conducts data analysis through iterative refinement of results up to its successful completion. To evaluate the proposed method, we implemented a tool that supports collaborative data analysis for teams with mixed level of expertise. Our evaluation demonstrates that with proper guidance data analysis tasks, especially preprocessing, can be distributed and successfully accomplished by non-experts. Using the design science approach, iterative development also revealed some important features for the collaboration tool, such as support for dynamic development, code deliberation, and project journal. As such we pave the way for building tools that can leverage the crowd to address the shortage of data analysts. |
|
Céline Faverjon, Abraham Bernstein, Rolf Grütter, Heiko Nathues, Cristina Sarasua, Martin Sterchi, Maria-Elena Vargas, John Berezowski, PIG DATA: transdisciplinary approach for health analytics of the Swiss Swine Industry, In: ‘INNOVATION in Health Surveillance’ International Forum. 2018. (Conference Presentation)
|
|
Abraham Bernstein, Fabrizio Gilardi, Jetzt experimentieren!, Schweizer Monat (1056), 2018. (Journal Article)
Die föderale Schweiz eignet sich hervorragend, um experimentelle Pionierarbeit bei der Digitalisierung der Demokratie zu leisten. Warum sich das lohnt. |
|
Tobias Grubenmann, Monetization Strategies for the Web of Data, In: The 2018 Web Conference PhD Symposium, IW3C2, New York, NY, USA, 2018-04-23. (Conference or Workshop Paper published in Proceedings)
Inspired by the World Wide Web, the Web of Data is a network of interlinked data fragments. One of the main advantages of the Web of Data is that all of its content is processable by machines. However, this also has its drawbacks when it comes to monetization of the content: advertisements and donations—two important financial motors in the World Wide Web—do not translate into the Web of Data as they rely on exposing the user to advertisement/call for donations.
The remedy this situation, we propose two different monetization strategies for the Web of Data. The first strategy involves a marketplace where users can buy data in an integrated way. The second strategy allows third parties to promote certain data. In return, the sponsors pay money whenever a user follows a link contained in the sponsored data. We identified two different kind of data—commercial and sponsored data—which can benefit from the two respective monetization strategies. With our work, we propose solutions to the problem of financing the creation and maintenance of content in the Web of Data. |
|
Tobias Grubenmann, Abraham Bernstein, Dmitrii Moor, Sven Seuken, Financing the Web of Data with Delayed-Answer Auctions, In: WWW 2018: The 2018 Web Conference, International World Wide Web Conference Committee, New York, NY, USA, 2018-04-23. (Conference or Workshop Paper published in Proceedings)
The World Wide Web is a massive network of interlinked documents. One of the reasons the World Wide Web is so successful is the fact that most content is available free of any charge. Inspired by the success of the World Wide Web, the Web of Data applies the same strategy of interlinking to data. To this point, most of data in the Web of Data is also free of charge. The fact that the data is freely available raises the question of financing these services, however. As we will discuss in this paper, advertisement and donations cannot easily be applied to this new setting.
To create incentives to subsidize data providers, we propose that sponsors should pay the providers to promote sponsored data. In return, sponsored data will be privileged over non-sponsored data. Since it is not possible to enforce a certain ordering on the data the user will receive, we propose to split up the data into different batches and deliver these batches with different delays. In this way, we can privilege sponsored data without withholding any non-sponsored data from the user.
In this paper, we introduce a new concept of a delayed-answer auction, where sponsors can pay to prioritize their data. We introduce a new model which captures the particular situation when a user access data in the Web of Data. We show how the weighted Vickrey-Clarke-Groves auction mechanism can be applied to our scenario and we discuss how certain parameters can influence the nature of our auction. With our new concept, we build a first step to a free yet financial sustainable Web of Data. |
|
Leon Ruppen, Dependent Learning of Entity Vectors for Entity Alignment on Knowledge Graphs, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
The linking of correspondent entities between multiple knowledge graphs (KGs) is known as entity alignment. This thesis introduces the embedding-based method Dependent Learning of Entity Vectors (DELV) for entity alignment. In an iterative fashion, the method learns a low-dimensional vector representation for the entities in a satellite model in dependence of a pretrained central model. Word2vec and rdf2vec constitute the basis for the embedding learning process. DELV is evaluated on real-world datasets, originating from the three knowledge graphs DBpedia, Wikidata and Freebase. DELV outperforms most of its baselines in terms of the mean rank, the hits@1 and hits@10. While entity alignment is normally performed on two KGs, this thesis also demonstrates how DELV can be efficiently used for alignment of unlimited KGs. |
|
Benedikt Bleyer, Exploring Context-aware Stream Processing, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
Today's data is continuously produced by companies, private people and sensors, therefore processing the data should also be in a continuous way. An increasing number of use cases for streaming data require models and systems which can adapt their processing based on changes in the application context and need to be able to integrate various information types such as context, facts and background information to deliver valuable near real time insights. This thesis proposes a model for Context-aware, Facts and Background integrated dynamic Stream Processing (CoFaBidSP). The evaluation results for the implemented prototype show that the metrics run time and events per seconds remain nearly constant, even by including more functionality such as the integration of various information types in dynamic stream processing. |
|
Sandro Luck, Utilizing Eccentric User Preferences and Negative Feedback to Improve Recommendation Quality, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
User satisfaction in Recommender Systems is dependent on many factors other than prediction accuracy. People also value qualities like variety, novelty and diversity. In this work, we explore two different areas to increase the quality and diversity of recommendations using well known Collaborative Filtering techniques.
In the first problem, we focus on Two-Class Collaborative Filtering, where the goal is to recommend more positive items, while reducing the number of negative items at the top of recommendation lists.
Modeling user behavior by accounting for their negative preference has shown to produce more diverse and accurate recommendations.
In this work, we extend the recently developed Collaborative Metric Learning by modeling negative choices.
We show with experimental results on openly available datasets that our method is able to improve recommendation quality and reduce the number of negative recommendations at the top.
In the second problem, we look at the problem of improving recommendation diversity.
Not all users prefer niche items to the same extent, and it is important to diversify recommendations accordingly. We explore the concept of item controversy and eccentricity and develop a new method to recommend nice items to users based on their inclination to such items. Our experiments show that our method is able to diversify the recommendations while achieving competitive or better accuracy in most cases.
|
|