Cristina Sarasua, Alessandro Checco, Gianluca Demartini, Djellel Difallah, Michael Feldman, Lydia Pintscher, The Evolution of Power and Standard Wikidata Editors: Comparing Editing Behavior over Time to Predict Lifespan and Volume of Edits, Journal of Computer Supported Cooperative Work, 2018. (Journal Article)

Knowledge bases are becoming a key asset leveraged for various types of applications on the Web, from search engines presenting `entity cards’ as the result of a query, to the use of structured data of knowledge bases to empower virtual personal assistants. Wikidata is an open general-interest knowledge base that is collaboratively developed and maintained by a community of thousands of volunteers. One of the major challenges faced in such a crowdsourcing project is to attain a high level of editor engagement. In order to intervene and encourage editors to be more committed to editing Wikidata, it is important to be able to predict at an early stage, whether an editor will or not become an engaged editor. In this paper, we investigate this problem and study the evolution that editors with different levels of engagement exhibit in their editing behaviour over time. We measure an editor’s engagement in terms of (i) the volume of edits provided by the editor and (ii) their lifespan (i.,e. the length of time for which an editor is present at Wikidata). The large-scale longitudinal data analysis that we perform covers Wikidata edits over almost 4 years. We monitor evolution in a session-by-session- and monthly-basis, observing the way the participation, the volume and the diversity of edits done by Wikidata editors change. Using the findings in our exploratory analysis, we define and implement prediction models that use the multiple evolution indicators. |
|
Steffen Hölldobler, Ausgezeichnete Informatikdissertationen 2016, Köllen Druck + Verlag GmBH, Bonn, 2018. (Book/Research Monograph)

|
|
Markus Christen, Sabine Müller, The ethics of expanding applications of deep brain stimulation, In: The Routledge Handbook of Neuroethics, Taylor & Francis, New York, USA, p. 51 - 65, 2018. (Book Chapter)
 
|
|
Michael Feldman, Adir Even, Yisrael Parmet, A Methodology for Quantifying the Effect of Missing Data on Decision Quality in Classification Problems, Communications in Statistics. Theory and Methods, Vol. 47 (11), 2018. (Journal Article)
 
Decision-making is often supported by decision models. This study suggests that the negative impact of poor data quality (DQ) on decision making is often mediated by biased model estimation. To highlight this perspective, we develop an analytical framework that links three quality levels – data, model, and decision. The general framework is first developed at a high-level, and then extended further toward understanding the effect of incomplete datasets on Linear Discriminant Analysis (LDA) classifiers. The interplay between the three quality levels is evaluated analytically - initially for a one-dimensional case, and then for multiple dimensions. The impact is then further analyzed through several simulative experiments with artificial and real-world datasets. The experiment results support the analytical development and reveal nearly-exponential decline in the decision error as the completeness level increases. To conclude, we discuss the framework and the empirical findings, elaborate on the implications of our model on the data quality management, and the use of data for decision-models estimation. |
|
Daniele Dell'Aglio, Emanuele Della Valle, Frank van Harmelen, Abraham Bernstein, Stream reasoning: A survey and outlook : A summary of ten years of research and a vision for the next decade, Data Science, Vol. 1 (1-2), 2017. (Journal Article)
 
Stream reasoning studies the application of inference techniques to data characterised by being highly dynamic. It can find application in several settings, from Smart Cities to Industry 4.0, from Internet of Things to Social Media analytics. This year stream reasoning turns ten, and in this article we analyse its growth. In the first part, we trace the main results obtained so far, by presenting the most prominent studies. We start by an overview of the most relevant studies developed in the context of semantic web, and then we extend the analysis to include contributions from adjacent areas, such as database and artificial intelligence. Looking at the past is useful to prepare for the future: in the second part, we present a set of open challenges and issues that stream reasoning will face in the next future. |
|
Jennifer Duchetta, Optimization of a Monitoring System for Preterm Infants, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Master's Thesis)
 
For the optimization of a monitoring system for preterm infants, two specifc objectives were pursued in this thesis. On the one hand the possibility was examined to measure heart rate and arterial oxygenation with a NIRS device. Various techniques have been tested and compared. The second objective has been to compare classifers for the specific task to lower the false alarm rate, without missing any real alarms. With theanalysis of the ROC curve, the 2-Nearest Neighbor has proven to be the most effective
classifier. |
|
Daniele Dell'Aglio, Danh Le Phuoc, Anh Le-Tuan, Muhammed Intizar Ali, Jean-Paul Calbimonte, On a Web of Data Streams, In: ISWC2017 workshop on Decentralizing the Semantic Web, s.n., 2017-10-22. (Conference or Workshop Paper published in Proceedings)
 
With the growing adoption of IoT and sensor technologies, an enormous amount of data is being produced at a very rapid pace and in different application domains. This sensor data consists mostly of live data streams containing sensor observations, generated in a distributed fashion by multiple heterogeneous infrastructures with minimal or no interoperability. RDF streams emerged as a model to represent data streams, and RDF Stream Processing (RSP) refers to a set of technologies to process such data. RSP research has produced several successful results and scientific output, but it can be evidenced that in most of the cases the Web dimension is marginal or missing. It also noticeable the lack of proper infrastructures to enable the exchange of RDF streams over heterogeneous and different types of RSP systems, whose features may vary from data generation to querying, and from reasoning to visualisation. This article defines a set of requirements related to the creation of a web of RDF stream processors. These requirements are then used to analyse the current state of the art, and to build a novel proposal, WeSP, which addresses these concerns. |
|
Matt Dennis, Kees van Deemter, Daniele Dell'Aglio, Jeff Z Pan, Computing Authoring Tests from Competency Questions: Experimental Validation, In: 16th International Semantic Web Conference, Springer International Publishing, Cham, 2017-10-21. (Conference or Workshop Paper published in Proceedings)
 
|
|
Tobias Grubenmann, Abraham Bernstein, Dmitrii Moor, Sven Seuken, Challenges of source selection in the WoD, In: ISWC 2017 - The 16th International Semantic Web Conference, 2017-10-21. (Conference or Workshop Paper published in Proceedings)
 
Federated querying, the idea to execute queries over several distributed knowledge bases, lies at the core of the semantic web vision. To accommodate this vision, SPARQL provides the SERVICE keyword that allows one to allocate sub-queries to servers. In many cases, however, data may be available from multiple sources resulting in a combinatorially growing number of alternative allocations of subqueries to sources.
Running a federated query on all possible sources might not be very lucrative from a user's point of view if extensive execution times or fees are involved in accessing the sources' data. To address this shortcoming, federated join-cardinality approximation techniques have been proposed to narrow down the number of possible allocations to a few most promising (or results-yielding) ones.
In this paper, we analyze the usefulness of cardinality approximation for source selection. We compare both the runtime and accuracy of Bloom Filters empirically and elaborate on their suitability and limitations for different kind of queries. As we show, the performance of cardinality approximations of federated SPARQL queries degenerates when applied to queries with multiple joins of low selectivity. We generalize our results analytically to any estimation technique exhibiting false positives.
These findings argue for a renewed effort to find novel join-cardinality approximation techniques or a change of paradigm in query execution to settings, where such estimations play a less important role. |
|
Tobias Grubenmann, Daniele Dell'Aglio, Abraham Bernstein, Dmitry Moor, Sven Seuken, Decentralizing the Semantic Web: Who will pay to realize it?, In: ISWC2017 workshop on Decentralizing the Semantic Web, OpenReview, 2017-10-20. (Conference or Workshop Paper published in Proceedings)
 
Fueled by enthusiasm of volunteers, government subsidies, and open data legislation, the Web of Data (WoD) has enjoyed a phenomenal growth. Commercial data, however, has been stuck in proprietary silos, as the monetization strategy for sharing data in the WoD is unclear. This is in contrast to the traditional web where advertisement fueled a lot of the growth. This raises the question how the WoD can (i) maintain its success when government subsidies disappear and (ii) convince commercial entities to share their wealth of data.
In this talk based on a paper, we propose a marketplace for decentralized data following basic WoD principles. Our approach allows a customer to buy data from different, decentralized providers in a transparent way. As such, our marketplace presents a first step towards an economically viable WoD beyond subsidies. |
|
Patrick De Boer, Marcel C. Bühler, Abraham Bernstein, Expert estimates for feature relevance are imperfect, In: DSAA2017 - The 4th IEEE International Conference on Data Science and Advanced Analytics, Tokyo, 2017. (Conference or Workshop Paper published in Proceedings)
 
An early step in the knowledge discovery process is deciding on what data to look at when trying to predict a given target variable. Most of KDD so far is focused on the workflow after data has been obtained, or settings where data is readily available and easily integrable for model induction. However, in practice, this is rarely the case, and many times data requires cleaning and transformation before it can be used for feature selection and knowledge discovery. In such environments, it would be costly to obtain and integrate data that is not relevant to the predicted target variable. To reduce the risk of such scenarios in practice, we often rely on experts to estimate the value of potential data based on its meta information (e.g. its description). However, as we will find in this paper, experts perform abysmally at this task. We therefore developed a methodology, KrowDD, to help humans estimate how relevant a dataset might be based on such meta data. We evaluate KrowDD on 3 real-world problems and compare its relevancy estimates with data scientists’ and domain experts’. Our findings indicate large possible cost savings when using our tool in bias-free environments, which may pave the way for lowering the cost of classifier design in practice. |
|
Patrick De Boer, Crowd process design : how to coordinate crowds to solve complex problems, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Dissertation)
 
The Internet facilitates an on-demand workforce, able to dynamically scale up and down depending on the requirements of a given project. Such crowdsourcing is increasingly used to engage workers available online. Similar to organizational design, where business processes are used to organize and coordinate employees, so-called crowd processes can be employed to facilitate work on a given problem. But as with business processes, it is unclear which crowd process performs best for a problem at hand. Aggravating the problem further, the impersonal, usually short-lived, relationship between an employer and crowd workers leads to major challenges in the organization of (crowd-) labor in general.
In this dissertation, we explore crowd process design. We start by finding a crowd process for a specific use case. We then outline a potential remedy for the more general problem of finding a crowd process for any use case.
The specific use case we focus on first, is an expert task, part of the review of statistical validity of research papers. Researchers often use statistical methods, such as t-test or ANOVA, to evaluate hypotheses. Recently, the use of such methods has been called into question. One of the reasons is that many studies fail to check the underlying assumptions of the employed statistical methods. This results in a threat to the statistical validity of a study and hampers the reuse of results. We propose an automated approach for checking the reporting of statistical assumptions. Our crowd process identifies reported assumptions in research papers achieving 85% accuracy.
Finding this crowd process took us more than a year, due to the trial-and-error approach underlying current crowd process design, where in some cases a candidate crowd process was not reliable enough, in some cases it was too expensive, and in others it took too long to complete. We address this issue in a more generic manner, through the automatic recombination of crowd processes for a given problem at hand based on an extensible repository of existing crowd process fragments. The potentially large number of candidate crowd processes derived for a given problem is subjected to Auto-Experimentation in order to identify a candidate matching a user’s performance requirements. We implemented our approach as an Open Source system and called it PPLib (pronounced “People Lib”). PPLib is validated in two real-world experiments corresponding to two common crowdsourcing problems, where PPLib successfully identified crowd processes performing well for the respective problem domains.
In order to reduce the search cost for Auto-Experimentation, we then propose to use black-box optimization to identify a well-performing crowd process among a set of candidates. Specifically, we adopt Bayesian Optimization to approximate the maximum of a utility function quantifying the user’s (business-) objectives while minimizing search cost. Our approach was implemented as an extension to PPLib and validated in a simulation and three real-world experiments.
Through an effective means to generate crowd process candidates for a given problem by recombination and by reducing the entry barriers to using black-box optimization for crowd process selection, PPLib has the potential to automate the tedious trial-and-error underlying the construction of a large share of today’s crowd powered systems. Given the trends of an ever more connected future, where on-demand labor likely plays a key role, an efficient approach to organizing crowds is paramount. PPLib helps pave the way to an automated solution for this problem. |
|
Cyrus Einsele, SPARQL query evaluation in Big Data processors, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
 
This thesis aims to investigate the usage of Big Data processors, namely Apache Beam, to extract information from Semantic Web data, specifically RDF. We present a Java application that can match SPARQL basic graph patterns, within a stream of triples. It does so utilizing the open-source Big Data processing framework Apache Beam, which is largely based upon Google Cloud Dataflow. We describe the principles of the algorithm used to achieve this goal and evaluate the correctness and the performance of the application by comparing it to a traditional SPARQL query execution engine, namely Apache Jena ARQ. |
|
Christoph Weber, A platform to integrate heterogeneous data, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
 
Heterogeneity is a very common and complex problem that arises in information systems. Gathering data from different data sources leads to data heterogeneity on a system, format, schematic and semantic level. The PigData Project aims to tackle the adoption of Big Data methods by the Swiss swine and pork production industry. Many actors in this supply chain collect data with overlapping targets but differing views and representation. To integrate this data for Big Data uses heterogeneity must be overcome. We elicit the stakeholder requirements for a data integration platform of the PigData Project and apply these requirements to a prototype. The scope of this data integration platform contains data upload over the web, integrating data into a common representation and allowing data analysts to execute queries on integrated data. |
|
Bibek Paudel, Thilo Haas, Abraham Bernstein, Fewer Flops at the Top: Accuracy, Diversity, and Regularization in Two-Class Collaborative Filtering, In: 11th ACM Conference on Recommender Systems RecSys 2017, ACM Press, New York, NY, USA, 2017-08-27. (Conference or Workshop Paper published in Proceedings)
 
In most existing recommender systems, implicit or explicit interactions are treated as positive links and all unknown interactions are treated as negative links. The goal is to suggest new links that will be perceived as positive by users. However, as signed social networks and newer content services become common, it is important to distinguish between positive and negative preferences. Even in existing applications, the cost of a negative recommendation could be high when people are looking for new jobs, friends, or places to live.
In this work, we develop novel probabilistic latent factor models to recommend positive links and compare them with existing methods on five different openly available datasets. Our models are able to produce better ranking lists and are effective in the task of ranking positive links at the top, with fewer negative links (flops). Moreover, we find that modeling signed social networks and user preferences this way has the advantage of increasing the diversity of recommendations. We also investigate the effect of regularization on the quality of recommendations, a matter that has not received enough attention in the literature. We find that regularization parameter heavily affects the quality of recommendations in terms of both accuracy and diversity. |
|
Nicola Staub, Revealing the inherent variability in data analysis, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Master's Thesis)
 
The variation in data analysis has been recently recognized as one of the major reasons for the reproducibility crisis in science. Many scientific findings have been proven to be statistically significant, which is however not necessarily an indication, that the results are indeed meaningful. There are many factors playing a role in the analytical choices a data analyst makes during an analysis. The goal of this thesis is to find potential factors which can explain the variability in data analysis. With a special platform designed in accordance with this thesis, rationales for different analytical choices along the path of a data analysis were elicited. The result of this thesis is a system of factors, which allow for examining data analysis workflows in different levels of depth. |
|
Michael Feldman, Frida Juldaschewa, Abraham Bernstein, Data Analytics on Online Labor Markets: Opportunities and Challenges, In: arXiv.org, No. 1707.01790, 2017. (Working Paper)
 
The data-driven economy has led to a significant shortage of data scientists. To address this shortage, this study explores the prospects of outsourcing data analysis tasks to freelancers available on online labor markets (OLMs) by identifying the essential factors for this endeavor. Specifically, we explore the skills required from freelancers, collect information about the skills present on major OLMs, and identify the main hurdles for out-/crowd-sourcing data analysis. Adopting a sequential mixed-method approach, we interviewed 20 data scientists and subsequently surveyed 80 respondents from OLMs. Besides confirming the need for expected skills such as technical/mathematical capabilities, it also identifies less known ones such as domain understanding, an eye for aesthetic data visualization, good communication skills, and a natural understanding of the possibilities/limitations of data analysis in general. Finally, it elucidates obstacles for crowdsourcing like the communication overhead, knowledge gaps, quality assurance, and data confidentiality, which need to be mitigated. |
|
Patrick Muntwyler, Increasing the number of open data streams on the Web, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
 
Open Government Data describes data from Governments which are available for everyone and often can be used for any purpose. Linked Data was invented to increase the value of data which are published on the Web. TripleWave is a framework that applies the concept of Linked Data to streaming data. However, it lacks a prototype, which shows how TripleWave can be used to publish Open Government Data as Linked Data streams.
In the context of this Bachelor Thesis we increase the number of open data streams on the Web. We examine the available Open Government Data portals. Then we develop an application which fetches and transforms several suitable Open Government Data sets and finally publishes them as Linked Data streams on the Web. |
|
Johannes Schneider, Abraham Bernstein, Jan vom Brocke, Kostadin Damevski, David C Shepherd, Detecting Plagiarism based on the Creation Process, IEEE Transactions on Learning Technologies, Vol. 11 (3), 2017. (Journal Article)
 
All methodologies for detecting plagiarism to date have focused on the final digital “outcome”, such as a document or source code. Our novel approach takes the creation process into account using logged events collected by special software or by the macro recorders found in most office applications. We look at an author's interaction logs with the software used to create the work. Detection relies on comparing the histograms of multiple logs' command use. A work is classified as plagiarism if its log deviates too much from logs of “honestly created” works or if its log is too similar to another log. The technique supports the detection of plagiarism for digital outcomes that stem from unique tasks, such as theses and equal tasks such as assignments for which the same problem sets are solved by multiple students. Focusing on the latter case, we evaluate this approach using logs collected by an interactive development environment (IDE) from more than 60 students who completed three programming assignments. |
|
Shima Zahmatkesh, Emanuele Della Valle, Daniele Dell'Aglio, Using Rank Aggregation in Continuously Answering SPARQL Queries on Streaming and Quasi-static Linked Data, In: Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems, DEBS 2017, ACM, New York, NY, USA, 2017-06-19. (Conference or Workshop Paper published in Proceedings)
 
|
|