Tobias Grubenmann, Monetization strategies for the Web of Data, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Dissertation)
 
|
|
Lukas Vollenweider, Topic Extraction and Visualisation of Digitalisation Related Research from ZORA, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
 
Due to the fast increasement of available documents in the Internet, methods are needed which are able to present the content of the data, without the need to read them. This methods already exists, called topic models, but tend to work only for large documents. This work analyses current state-of-the-art topic models as well as presenting some own,
context-sensitive approaches on a restricted data set built from abstracts. Then, the best results are visualised to improve the interpretability of the data. |
|
Shen Gao, Efficient Processing and Reasoning of Semantic Streams, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Dissertation)
 
The digitalization of our society creates a large number of data streams, such as stock tickers, tweets, and sensor data. Making use of these streams has tremendous values. In the Semantic Web context, live information is queried from the streams in real-time. Knowledge is discovered by integrating streams with data from heterogeneous sources. Moreover, insights hidden in the streams are inferred and extracted by logical reasoning.
Handling large and complex streams in real-time challenges the capabilities of current systems. Therefore, this thesis studies how to improve the efficiency of processing and reasoning over semantic streams. It is composed of three projects that deal with different research problems motivated by real-world use cases. We propose new methods to address these problems and implement systems to test our hypotheses based on real datasets.
The first project focuses on the problem that sudden increases in the input stream rate overload the system, causing a reduced or unacceptable performance. We propose an eviction technique that, when a spike in the input data rate happens, discards data from the system to ensure the response latency at the cost of a lower recall. The novelty of our solution lies in a data-aware approach that carefully prioritizes the data and evicts the less important ones to achieve a high result recall.
The second project studies complex queries that need to integrate streams with remote and external background data (BGD). Accessing remote BGD is a very expensive process in terms of both latency and financial cost. We propose several methods to minimize the cost by exploiting the query and the data patterns. Our system only needs to retrieve data that are more critical to answer the query and avoids wasting resources on the remaining data in BGD.
Lastly, as noise is inevitable in real-world semantic streams, the third project investigates how to use logical reasoning to identify and exclude the noise from high-volume streams. We adopt a distributed stream processing engine (DSPE) to achieve scalability. On top of a DSPE, we optimize the reasoning procedures by balancing the costs of computation and communication. Therefore, reasoning tasks are compiled into efficient DSPE workflows that can be deployed across large-scale computing clusters. |
|
Cristina Sarasua, Alessandro Checco, Gianluca Demartini, Djellel Difallah, Michael Feldman, Lydia Pintscher, The Evolution of Power and Standard Wikidata Editors: Comparing Editing Behavior over Time to Predict Lifespan and Volume of Edits, Journal of Computer Supported Cooperative Work, 2018. (Journal Article)

Knowledge bases are becoming a key asset leveraged for various types of applications on the Web, from search engines presenting `entity cards’ as the result of a query, to the use of structured data of knowledge bases to empower virtual personal assistants. Wikidata is an open general-interest knowledge base that is collaboratively developed and maintained by a community of thousands of volunteers. One of the major challenges faced in such a crowdsourcing project is to attain a high level of editor engagement. In order to intervene and encourage editors to be more committed to editing Wikidata, it is important to be able to predict at an early stage, whether an editor will or not become an engaged editor. In this paper, we investigate this problem and study the evolution that editors with different levels of engagement exhibit in their editing behaviour over time. We measure an editor’s engagement in terms of (i) the volume of edits provided by the editor and (ii) their lifespan (i.,e. the length of time for which an editor is present at Wikidata). The large-scale longitudinal data analysis that we perform covers Wikidata edits over almost 4 years. We monitor evolution in a session-by-session- and monthly-basis, observing the way the participation, the volume and the diversity of edits done by Wikidata editors change. Using the findings in our exploratory analysis, we define and implement prediction models that use the multiple evolution indicators. |
|
Markus Christen, Sabine Müller, The ethics of expanding applications of deep brain stimulation, In: The Routledge Handbook of Neuroethics, Taylor & Francis, New York, USA, p. 51 - 65, 2018. (Book Chapter)
 
|
|
Michael Feldman, Adir Even, Yisrael Parmet, A Methodology for Quantifying the Effect of Missing Data on Decision Quality in Classification Problems, Communications in Statistics. Theory and Methods, Vol. 47 (11), 2018. (Journal Article)
 
Decision-making is often supported by decision models. This study suggests that the negative impact of poor data quality (DQ) on decision making is often mediated by biased model estimation. To highlight this perspective, we develop an analytical framework that links three quality levels – data, model, and decision. The general framework is first developed at a high-level, and then extended further toward understanding the effect of incomplete datasets on Linear Discriminant Analysis (LDA) classifiers. The interplay between the three quality levels is evaluated analytically - initially for a one-dimensional case, and then for multiple dimensions. The impact is then further analyzed through several simulative experiments with artificial and real-world datasets. The experiment results support the analytical development and reveal nearly-exponential decline in the decision error as the completeness level increases. To conclude, we discuss the framework and the empirical findings, elaborate on the implications of our model on the data quality management, and the use of data for decision-models estimation. |
|
Daniele Dell'Aglio, Emanuele Della Valle, Frank van Harmelen, Abraham Bernstein, Stream reasoning: A survey and outlook : A summary of ten years of research and a vision for the next decade, Data Science, Vol. 1 (1-2), 2017. (Journal Article)
 
Stream reasoning studies the application of inference techniques to data characterised by being highly dynamic. It can find application in several settings, from Smart Cities to Industry 4.0, from Internet of Things to Social Media analytics. This year stream reasoning turns ten, and in this article we analyse its growth. In the first part, we trace the main results obtained so far, by presenting the most prominent studies. We start by an overview of the most relevant studies developed in the context of semantic web, and then we extend the analysis to include contributions from adjacent areas, such as database and artificial intelligence. Looking at the past is useful to prepare for the future: in the second part, we present a set of open challenges and issues that stream reasoning will face in the next future. |
|
Jennifer Duchetta, Optimization of a Monitoring System for Preterm Infants, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Master's Thesis)
 
For the optimization of a monitoring system for preterm infants, two specifc objectives were pursued in this thesis. On the one hand the possibility was examined to measure heart rate and arterial oxygenation with a NIRS device. Various techniques have been tested and compared. The second objective has been to compare classifers for the specific task to lower the false alarm rate, without missing any real alarms. With theanalysis of the ROC curve, the 2-Nearest Neighbor has proven to be the most effective
classifier. |
|
Daniele Dell'Aglio, Danh Le Phuoc, Anh Le-Tuan, Muhammed Intizar Ali, Jean-Paul Calbimonte, On a Web of Data Streams, In: ISWC2017 workshop on Decentralizing the Semantic Web, s.n., 2017-10-22. (Conference or Workshop Paper published in Proceedings)
 
With the growing adoption of IoT and sensor technologies, an enormous amount of data is being produced at a very rapid pace and in different application domains. This sensor data consists mostly of live data streams containing sensor observations, generated in a distributed fashion by multiple heterogeneous infrastructures with minimal or no interoperability. RDF streams emerged as a model to represent data streams, and RDF Stream Processing (RSP) refers to a set of technologies to process such data. RSP research has produced several successful results and scientific output, but it can be evidenced that in most of the cases the Web dimension is marginal or missing. It also noticeable the lack of proper infrastructures to enable the exchange of RDF streams over heterogeneous and different types of RSP systems, whose features may vary from data generation to querying, and from reasoning to visualisation. This article defines a set of requirements related to the creation of a web of RDF stream processors. These requirements are then used to analyse the current state of the art, and to build a novel proposal, WeSP, which addresses these concerns. |
|
Matt Dennis, Kees van Deemter, Daniele Dell'Aglio, Jeff Z Pan, Computing Authoring Tests from Competency Questions: Experimental Validation, In: 16th International Semantic Web Conference, Springer International Publishing, Cham, 2017-10-21. (Conference or Workshop Paper published in Proceedings)
 
|
|
Tobias Grubenmann, Abraham Bernstein, Dmitrii Moor, Sven Seuken, Challenges of source selection in the WoD, In: ISWC 2017 - The 16th International Semantic Web Conference, 2017-10-21. (Conference or Workshop Paper published in Proceedings)
 
Federated querying, the idea to execute queries over several distributed knowledge bases, lies at the core of the semantic web vision. To accommodate this vision, SPARQL provides the SERVICE keyword that allows one to allocate sub-queries to servers. In many cases, however, data may be available from multiple sources resulting in a combinatorially growing number of alternative allocations of subqueries to sources.
Running a federated query on all possible sources might not be very lucrative from a user's point of view if extensive execution times or fees are involved in accessing the sources' data. To address this shortcoming, federated join-cardinality approximation techniques have been proposed to narrow down the number of possible allocations to a few most promising (or results-yielding) ones.
In this paper, we analyze the usefulness of cardinality approximation for source selection. We compare both the runtime and accuracy of Bloom Filters empirically and elaborate on their suitability and limitations for different kind of queries. As we show, the performance of cardinality approximations of federated SPARQL queries degenerates when applied to queries with multiple joins of low selectivity. We generalize our results analytically to any estimation technique exhibiting false positives.
These findings argue for a renewed effort to find novel join-cardinality approximation techniques or a change of paradigm in query execution to settings, where such estimations play a less important role. |
|
Tobias Grubenmann, Daniele Dell'Aglio, Abraham Bernstein, Dmitry Moor, Sven Seuken, Decentralizing the Semantic Web: Who will pay to realize it?, In: ISWC2017 workshop on Decentralizing the Semantic Web, OpenReview, 2017-10-20. (Conference or Workshop Paper published in Proceedings)
 
Fueled by enthusiasm of volunteers, government subsidies, and open data legislation, the Web of Data (WoD) has enjoyed a phenomenal growth. Commercial data, however, has been stuck in proprietary silos, as the monetization strategy for sharing data in the WoD is unclear. This is in contrast to the traditional web where advertisement fueled a lot of the growth. This raises the question how the WoD can (i) maintain its success when government subsidies disappear and (ii) convince commercial entities to share their wealth of data.
In this talk based on a paper, we propose a marketplace for decentralized data following basic WoD principles. Our approach allows a customer to buy data from different, decentralized providers in a transparent way. As such, our marketplace presents a first step towards an economically viable WoD beyond subsidies. |
|
Patrick De Boer, Marcel C. Bühler, Abraham Bernstein, Expert estimates for feature relevance are imperfect, In: DSAA2017 - The 4th IEEE International Conference on Data Science and Advanced Analytics, Tokyo, 2017. (Conference or Workshop Paper published in Proceedings)
 
An early step in the knowledge discovery process is deciding on what data to look at when trying to predict a given target variable. Most of KDD so far is focused on the workflow after data has been obtained, or settings where data is readily available and easily integrable for model induction. However, in practice, this is rarely the case, and many times data requires cleaning and transformation before it can be used for feature selection and knowledge discovery. In such environments, it would be costly to obtain and integrate data that is not relevant to the predicted target variable. To reduce the risk of such scenarios in practice, we often rely on experts to estimate the value of potential data based on its meta information (e.g. its description). However, as we will find in this paper, experts perform abysmally at this task. We therefore developed a methodology, KrowDD, to help humans estimate how relevant a dataset might be based on such meta data. We evaluate KrowDD on 3 real-world problems and compare its relevancy estimates with data scientists’ and domain experts’. Our findings indicate large possible cost savings when using our tool in bias-free environments, which may pave the way for lowering the cost of classifier design in practice. |
|
Patrick De Boer, Crowd process design : how to coordinate crowds to solve complex problems, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Dissertation)
 
The Internet facilitates an on-demand workforce, able to dynamically scale up and down depending on the requirements of a given project. Such crowdsourcing is increasingly used to engage workers available online. Similar to organizational design, where business processes are used to organize and coordinate employees, so-called crowd processes can be employed to facilitate work on a given problem. But as with business processes, it is unclear which crowd process performs best for a problem at hand. Aggravating the problem further, the impersonal, usually short-lived, relationship between an employer and crowd workers leads to major challenges in the organization of (crowd-) labor in general.
In this dissertation, we explore crowd process design. We start by finding a crowd process for a specific use case. We then outline a potential remedy for the more general problem of finding a crowd process for any use case.
The specific use case we focus on first, is an expert task, part of the review of statistical validity of research papers. Researchers often use statistical methods, such as t-test or ANOVA, to evaluate hypotheses. Recently, the use of such methods has been called into question. One of the reasons is that many studies fail to check the underlying assumptions of the employed statistical methods. This results in a threat to the statistical validity of a study and hampers the reuse of results. We propose an automated approach for checking the reporting of statistical assumptions. Our crowd process identifies reported assumptions in research papers achieving 85% accuracy.
Finding this crowd process took us more than a year, due to the trial-and-error approach underlying current crowd process design, where in some cases a candidate crowd process was not reliable enough, in some cases it was too expensive, and in others it took too long to complete. We address this issue in a more generic manner, through the automatic recombination of crowd processes for a given problem at hand based on an extensible repository of existing crowd process fragments. The potentially large number of candidate crowd processes derived for a given problem is subjected to Auto-Experimentation in order to identify a candidate matching a user’s performance requirements. We implemented our approach as an Open Source system and called it PPLib (pronounced “People Lib”). PPLib is validated in two real-world experiments corresponding to two common crowdsourcing problems, where PPLib successfully identified crowd processes performing well for the respective problem domains.
In order to reduce the search cost for Auto-Experimentation, we then propose to use black-box optimization to identify a well-performing crowd process among a set of candidates. Specifically, we adopt Bayesian Optimization to approximate the maximum of a utility function quantifying the user’s (business-) objectives while minimizing search cost. Our approach was implemented as an extension to PPLib and validated in a simulation and three real-world experiments.
Through an effective means to generate crowd process candidates for a given problem by recombination and by reducing the entry barriers to using black-box optimization for crowd process selection, PPLib has the potential to automate the tedious trial-and-error underlying the construction of a large share of today’s crowd powered systems. Given the trends of an ever more connected future, where on-demand labor likely plays a key role, an efficient approach to organizing crowds is paramount. PPLib helps pave the way to an automated solution for this problem. |
|
Cyrus Einsele, SPARQL query evaluation in Big Data processors, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
 
This thesis aims to investigate the usage of Big Data processors, namely Apache Beam, to extract information from Semantic Web data, specifically RDF. We present a Java application that can match SPARQL basic graph patterns, within a stream of triples. It does so utilizing the open-source Big Data processing framework Apache Beam, which is largely based upon Google Cloud Dataflow. We describe the principles of the algorithm used to achieve this goal and evaluate the correctness and the performance of the application by comparing it to a traditional SPARQL query execution engine, namely Apache Jena ARQ. |
|
Christoph Weber, A platform to integrate heterogeneous data, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
 
Heterogeneity is a very common and complex problem that arises in information systems. Gathering data from different data sources leads to data heterogeneity on a system, format, schematic and semantic level. The PigData Project aims to tackle the adoption of Big Data methods by the Swiss swine and pork production industry. Many actors in this supply chain collect data with overlapping targets but differing views and representation. To integrate this data for Big Data uses heterogeneity must be overcome. We elicit the stakeholder requirements for a data integration platform of the PigData Project and apply these requirements to a prototype. The scope of this data integration platform contains data upload over the web, integrating data into a common representation and allowing data analysts to execute queries on integrated data. |
|
Vidyadhar Rao, KV Rosni, Vineet Padmanabhan, Divide and Transfer: Understanding Latent Factors for Recommendation Tasks., In: Proceedings of the 1st Workshop on Intelligent Recommender Systems by Knowledge Transfer & Learning co-located with ACM Conference on Recommender Systems (RecSys 2017), Como, Italy, 2017. (Conference or Workshop Paper published in Proceedings)

|
|
Bibek Paudel, Thilo Haas, Abraham Bernstein, Fewer Flops at the Top: Accuracy, Diversity, and Regularization in Two-Class Collaborative Filtering, In: 11th ACM Conference on Recommender Systems RecSys 2017, ACM Press, New York, NY, USA, 2017-08-27. (Conference or Workshop Paper published in Proceedings)
 
In most existing recommender systems, implicit or explicit interactions are treated as positive links and all unknown interactions are treated as negative links. The goal is to suggest new links that will be perceived as positive by users. However, as signed social networks and newer content services become common, it is important to distinguish between positive and negative preferences. Even in existing applications, the cost of a negative recommendation could be high when people are looking for new jobs, friends, or places to live.
In this work, we develop novel probabilistic latent factor models to recommend positive links and compare them with existing methods on five different openly available datasets. Our models are able to produce better ranking lists and are effective in the task of ranking positive links at the top, with fewer negative links (flops). Moreover, we find that modeling signed social networks and user preferences this way has the advantage of increasing the diversity of recommendations. We also investigate the effect of regularization on the quality of recommendations, a matter that has not received enough attention in the literature. We find that regularization parameter heavily affects the quality of recommendations in terms of both accuracy and diversity. |
|
Nicola Staub, Revealing the inherent variability in data analysis, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Master's Thesis)
 
The variation in data analysis has been recently recognized as one of the major reasons for the reproducibility crisis in science. Many scientific findings have been proven to be statistically significant, which is however not necessarily an indication, that the results are indeed meaningful. There are many factors playing a role in the analytical choices a data analyst makes during an analysis. The goal of this thesis is to find potential factors which can explain the variability in data analysis. With a special platform designed in accordance with this thesis, rationales for different analytical choices along the path of a data analysis were elicited. The result of this thesis is a system of factors, which allow for examining data analysis workflows in different levels of depth. |
|
Michael Feldman, Frida Juldaschewa, Abraham Bernstein, Data Analytics on Online Labor Markets: Opportunities and Challenges, In: ArXiv.org, No. 1707.01790, 2017. (Working Paper)
 
The data-driven economy has led to a significant shortage of data scientists. To address this shortage, this study explores the prospects of outsourcing data analysis tasks to freelancers available on online labor markets (OLMs) by identifying the essential factors for this endeavor. Specifically, we explore the skills required from freelancers, collect information about the skills present on major OLMs, and identify the main hurdles for out-/crowd-sourcing data analysis. Adopting a sequential mixed-method approach, we interviewed 20 data scientists and subsequently surveyed 80 respondents from OLMs. Besides confirming the need for expected skills such as technical/mathematical capabilities, it also identifies less known ones such as domain understanding, an eye for aesthetic data visualization, good communication skills, and a natural understanding of the possibilities/limitations of data analysis in general. Finally, it elucidates obstacles for crowdsourcing like the communication overhead, knowledge gaps, quality assurance, and data confidentiality, which need to be mitigated. |
|