Oana Inel, Giannis Haralabopoulos, Dan Li, Christophe Van Gysel, Zolt\'an Szl\'avik, Elena Simperl, Evangelos Kanoulas, Lora Aroyo, Studying topical relevance with evidence-based crowdsourcing, In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018. (Conference or Workshop Paper published in Proceedings)
|
|
Proceedings of the 1st Workshop SAD and CrowdBias co-located with HCOMP 2018, Zurich, Switzerland, July 5, 2018, Edited by: Lora Aroyo, Anca Dumitrache, Praveen Paritosh, Alexander J. Quinn, Chris Welty, Alessandro Checco, Gianluca Demartini, Ujwal Gadiraju, Cristina Sarasua, CEUR-WS.org, Zurich, Switzerland, 2018. (Proceedings)
|
|
Bibek Paudel, Improving recommendation diversity and identifying cultural biases for personalized ranking in large networks, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Dissertation)
Personalized ranking and filtering algorithms, also known as recommender systems, form the backbone of many modern web applications. They are used to tailor and rank suggestions for users in search engines, e-commerce sites, social networks, and news aggregators. As such systems gain prevalence in people’s day-to-day lives, they also affect people’s behavior in several ways.
Of the several concerns regarding these systems, the diversity of choices they offer to users is one of the important ones. Exposure to diverse items is considered important for many reasons: for improving user-experience by adding richness, novelty and variety, reducing polarization and helping improve political participation through exposure to diverse viewpoints. It is therefore important to investigate ways to make recommender algorithms serve more diverse content. In this thesis, we present three new recommender algorithms for increasing the diversity of suggestions. We also present a new method to detect biases in knowledge bases, which are often used as input data source by recommender systems.
The first algorithm uses a local exploration of the user-item feedback graph to increase the long-tail diversity of items. Long-tail items form a bulk of many product catalogs but compared to the few popular items that dominate recommendation lists, they are not recommended often. Our random-walk based method of promoting such long-tail items results in both more accurate and more diverse recommendations. In the second algorithm, we use a probabilistic latent-factor model to differentiate between positive and negative items in recommender systems. We find that the state-of-the-art algorithms not only have more negative items at the top of their recommendations, they also have low diversity and coverage. The recommendations produced by our approach is able to put fewer negative items at the top, and are also more diverse. In the third strategy, we look into the problem of diversifying political content recommendation. We collected data from the popular social network Twitter and created datasets that can be used to study political content recommendations. Based on these datasets, we first develop a new method to identify the ideological positions of not just users and political elites, but also of web-content. Then we used the identified ideological positions to diversify the recommendations based on diversification strategies that can be specified by the service provider. Our method is able to correctly identify political ideologies and to diversify recommendation of political content. Finally, since knowledge bases are used as input in many systems including recommender algorithms, we investigate them for the presence of human-like biases related to gender and race. We develop a new method based on cultural dimensions that can identify such biases in knowledge bases. Using our approach, it is possible to develop methods that can learn unbiased representations from knowledge bases, which can then be used by recommender algorithms. With our work, we present new ways to diversify and de-bias the output of recommender systems and we hope this will enable them to better serve the diverse needs of our societies. |
|
Silvio Frankhauser, Detecting and Mitigating Social Biases in Knowledge Bases, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Master's Thesis)
This master thesis is about investigation of social biases in the knowledge bases. We examine, among other things, different professions and their association with a person's gender, race or regional differences. We present three methods to detect such biases. The differences between single regions or the varying distribution regarding the genders and professions are significant. We demonstrate with experiments on two large and widely used knowledge bases the different kinds of biases they can contain. The purpose of this work is to raise awareness, that this social biases can have an impact on the usage of those databases, given that mitigating is not a trivial task. |
|
Roland Schläfli, Analysis of Weather Data using Graph-based and Neural Network Methods, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
Each year, the Indian Summer Monsoon affects more than one billion people, making clear the importance of accurate statistical analysis of its behavior. In this work, we analyze the spatial distribution of extreme monsoon rainfall and propose a new way of predicting monsoon onset dates. We build networks of correlated locations on the Indian subcontinent, analyzing them with established centrality measures. These measures reveal the relative importance of locations like the Indian Ocean, the Tibetan Plateau, and Northern Pakistan. We additionally adopt recent advances in the area of neural networks to predict monsoon onset dates based on spatiotemporal meteorological datasets. With experiments on these datasets, we show that our model is able to predict onset dates more accurately than existing methods several days in advance. |
|
Ausgezeichnete Informatikdissertationen 2017, Edited by: Steffen Hölldobler, Abraham Bernstein, et al, Gesellschaft für Informatik, Bonn, 2018. (Edited Scientific Work)
|
|
Tobias Grubenmann, Monetization strategies for the Web of Data, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Dissertation)
|
|
Lukas Vollenweider, Topic Extraction and Visualisation of Digitalisation Related Research from ZORA, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
Due to the fast increasement of available documents in the Internet, methods are needed which are able to present the content of the data, without the need to read them. This methods already exists, called topic models, but tend to work only for large documents. This work analyses current state-of-the-art topic models as well as presenting some own,
context-sensitive approaches on a restricted data set built from abstracts. Then, the best results are visualised to improve the interpretability of the data. |
|
Shen Gao, Efficient Processing and Reasoning of Semantic Streams, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Dissertation)
The digitalization of our society creates a large number of data streams, such as stock tickers, tweets, and sensor data. Making use of these streams has tremendous values. In the Semantic Web context, live information is queried from the streams in real-time. Knowledge is discovered by integrating streams with data from heterogeneous sources. Moreover, insights hidden in the streams are inferred and extracted by logical reasoning.
Handling large and complex streams in real-time challenges the capabilities of current systems. Therefore, this thesis studies how to improve the efficiency of processing and reasoning over semantic streams. It is composed of three projects that deal with different research problems motivated by real-world use cases. We propose new methods to address these problems and implement systems to test our hypotheses based on real datasets.
The first project focuses on the problem that sudden increases in the input stream rate overload the system, causing a reduced or unacceptable performance. We propose an eviction technique that, when a spike in the input data rate happens, discards data from the system to ensure the response latency at the cost of a lower recall. The novelty of our solution lies in a data-aware approach that carefully prioritizes the data and evicts the less important ones to achieve a high result recall.
The second project studies complex queries that need to integrate streams with remote and external background data (BGD). Accessing remote BGD is a very expensive process in terms of both latency and financial cost. We propose several methods to minimize the cost by exploiting the query and the data patterns. Our system only needs to retrieve data that are more critical to answer the query and avoids wasting resources on the remaining data in BGD.
Lastly, as noise is inevitable in real-world semantic streams, the third project investigates how to use logical reasoning to identify and exclude the noise from high-volume streams. We adopt a distributed stream processing engine (DSPE) to achieve scalability. On top of a DSPE, we optimize the reasoning procedures by balancing the costs of computation and communication. Therefore, reasoning tasks are compiled into efficient DSPE workflows that can be deployed across large-scale computing clusters. |
|
Cristina Sarasua, Alessandro Checco, Gianluca Demartini, Djellel Difallah, Michael Feldman, Lydia Pintscher, The Evolution of Power and Standard Wikidata Editors: Comparing Editing Behavior over Time to Predict Lifespan and Volume of Edits, Journal of Computer Supported Cooperative Work, 2018. (Journal Article)
Knowledge bases are becoming a key asset leveraged for various types of applications on the Web, from search engines presenting `entity cards’ as the result of a query, to the use of structured data of knowledge bases to empower virtual personal assistants. Wikidata is an open general-interest knowledge base that is collaboratively developed and maintained by a community of thousands of volunteers. One of the major challenges faced in such a crowdsourcing project is to attain a high level of editor engagement. In order to intervene and encourage editors to be more committed to editing Wikidata, it is important to be able to predict at an early stage, whether an editor will or not become an engaged editor. In this paper, we investigate this problem and study the evolution that editors with different levels of engagement exhibit in their editing behaviour over time. We measure an editor’s engagement in terms of (i) the volume of edits provided by the editor and (ii) their lifespan (i.,e. the length of time for which an editor is present at Wikidata). The large-scale longitudinal data analysis that we perform covers Wikidata edits over almost 4 years. We monitor evolution in a session-by-session- and monthly-basis, observing the way the participation, the volume and the diversity of edits done by Wikidata editors change. Using the findings in our exploratory analysis, we define and implement prediction models that use the multiple evolution indicators. |
|
Markus Christen, Sabine Müller, The ethics of expanding applications of deep brain stimulation, In: The Routledge Handbook of Neuroethics, Taylor & Francis, New York, USA, p. 51 - 65, 2018. (Book Chapter)
|
|
Michael Feldman, Adir Even, Yisrael Parmet, A Methodology for Quantifying the Effect of Missing Data on Decision Quality in Classification Problems, Communications in Statistics. Theory and Methods, Vol. 47 (11), 2018. (Journal Article)
Decision-making is often supported by decision models. This study suggests that the negative impact of poor data quality (DQ) on decision making is often mediated by biased model estimation. To highlight this perspective, we develop an analytical framework that links three quality levels – data, model, and decision. The general framework is first developed at a high-level, and then extended further toward understanding the effect of incomplete datasets on Linear Discriminant Analysis (LDA) classifiers. The interplay between the three quality levels is evaluated analytically - initially for a one-dimensional case, and then for multiple dimensions. The impact is then further analyzed through several simulative experiments with artificial and real-world datasets. The experiment results support the analytical development and reveal nearly-exponential decline in the decision error as the completeness level increases. To conclude, we discuss the framework and the empirical findings, elaborate on the implications of our model on the data quality management, and the use of data for decision-models estimation. |
|
Daniele Dell'Aglio, Emanuele Della Valle, Frank van Harmelen, Abraham Bernstein, Stream reasoning: A survey and outlook : A summary of ten years of research and a vision for the next decade, Data Science, Vol. 1 (1-2), 2017. (Journal Article)
Stream reasoning studies the application of inference techniques to data characterised by being highly dynamic. It can find application in several settings, from Smart Cities to Industry 4.0, from Internet of Things to Social Media analytics. This year stream reasoning turns ten, and in this article we analyse its growth. In the first part, we trace the main results obtained so far, by presenting the most prominent studies. We start by an overview of the most relevant studies developed in the context of semantic web, and then we extend the analysis to include contributions from adjacent areas, such as database and artificial intelligence. Looking at the past is useful to prepare for the future: in the second part, we present a set of open challenges and issues that stream reasoning will face in the next future. |
|
Jennifer Duchetta, Optimization of a Monitoring System for Preterm Infants, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Master's Thesis)
For the optimization of a monitoring system for preterm infants, two specifc objectives were pursued in this thesis. On the one hand the possibility was examined to measure heart rate and arterial oxygenation with a NIRS device. Various techniques have been tested and compared. The second objective has been to compare classifers for the specific task to lower the false alarm rate, without missing any real alarms. With theanalysis of the ROC curve, the 2-Nearest Neighbor has proven to be the most effective
classifier. |
|
Daniele Dell'Aglio, Danh Le Phuoc, Anh Le-Tuan, Muhammed Intizar Ali, Jean-Paul Calbimonte, On a Web of Data Streams, In: ISWC2017 workshop on Decentralizing the Semantic Web, s.n., 2017-10-22. (Conference or Workshop Paper published in Proceedings)
With the growing adoption of IoT and sensor technologies, an enormous amount of data is being produced at a very rapid pace and in different application domains. This sensor data consists mostly of live data streams containing sensor observations, generated in a distributed fashion by multiple heterogeneous infrastructures with minimal or no interoperability. RDF streams emerged as a model to represent data streams, and RDF Stream Processing (RSP) refers to a set of technologies to process such data. RSP research has produced several successful results and scientific output, but it can be evidenced that in most of the cases the Web dimension is marginal or missing. It also noticeable the lack of proper infrastructures to enable the exchange of RDF streams over heterogeneous and different types of RSP systems, whose features may vary from data generation to querying, and from reasoning to visualisation. This article defines a set of requirements related to the creation of a web of RDF stream processors. These requirements are then used to analyse the current state of the art, and to build a novel proposal, WeSP, which addresses these concerns. |
|
Matt Dennis, Kees van Deemter, Daniele Dell'Aglio, Jeff Z Pan, Computing Authoring Tests from Competency Questions: Experimental Validation, In: 16th International Semantic Web Conference, Springer International Publishing, Cham, 2017-10-21. (Conference or Workshop Paper published in Proceedings)
|
|
Tobias Grubenmann, Abraham Bernstein, Dmitrii Moor, Sven Seuken, Challenges of source selection in the WoD, In: ISWC 2017 - The 16th International Semantic Web Conference, 2017-10-21. (Conference or Workshop Paper published in Proceedings)
Federated querying, the idea to execute queries over several distributed knowledge bases, lies at the core of the semantic web vision. To accommodate this vision, SPARQL provides the SERVICE keyword that allows one to allocate sub-queries to servers. In many cases, however, data may be available from multiple sources resulting in a combinatorially growing number of alternative allocations of subqueries to sources.
Running a federated query on all possible sources might not be very lucrative from a user's point of view if extensive execution times or fees are involved in accessing the sources' data. To address this shortcoming, federated join-cardinality approximation techniques have been proposed to narrow down the number of possible allocations to a few most promising (or results-yielding) ones.
In this paper, we analyze the usefulness of cardinality approximation for source selection. We compare both the runtime and accuracy of Bloom Filters empirically and elaborate on their suitability and limitations for different kind of queries. As we show, the performance of cardinality approximations of federated SPARQL queries degenerates when applied to queries with multiple joins of low selectivity. We generalize our results analytically to any estimation technique exhibiting false positives.
These findings argue for a renewed effort to find novel join-cardinality approximation techniques or a change of paradigm in query execution to settings, where such estimations play a less important role. |
|
Tobias Grubenmann, Daniele Dell'Aglio, Abraham Bernstein, Dmitry Moor, Sven Seuken, Decentralizing the Semantic Web: Who will pay to realize it?, In: ISWC2017 workshop on Decentralizing the Semantic Web, OpenReview, 2017-10-20. (Conference or Workshop Paper published in Proceedings)
Fueled by enthusiasm of volunteers, government subsidies, and open data legislation, the Web of Data (WoD) has enjoyed a phenomenal growth. Commercial data, however, has been stuck in proprietary silos, as the monetization strategy for sharing data in the WoD is unclear. This is in contrast to the traditional web where advertisement fueled a lot of the growth. This raises the question how the WoD can (i) maintain its success when government subsidies disappear and (ii) convince commercial entities to share their wealth of data.
In this talk based on a paper, we propose a marketplace for decentralized data following basic WoD principles. Our approach allows a customer to buy data from different, decentralized providers in a transparent way. As such, our marketplace presents a first step towards an economically viable WoD beyond subsidies. |
|
Patrick De Boer, Marcel C. Bühler, Abraham Bernstein, Expert estimates for feature relevance are imperfect, In: DSAA2017 - The 4th IEEE International Conference on Data Science and Advanced Analytics, Tokyo, 2017. (Conference or Workshop Paper published in Proceedings)
An early step in the knowledge discovery process is deciding on what data to look at when trying to predict a given target variable. Most of KDD so far is focused on the workflow after data has been obtained, or settings where data is readily available and easily integrable for model induction. However, in practice, this is rarely the case, and many times data requires cleaning and transformation before it can be used for feature selection and knowledge discovery. In such environments, it would be costly to obtain and integrate data that is not relevant to the predicted target variable. To reduce the risk of such scenarios in practice, we often rely on experts to estimate the value of potential data based on its meta information (e.g. its description). However, as we will find in this paper, experts perform abysmally at this task. We therefore developed a methodology, KrowDD, to help humans estimate how relevant a dataset might be based on such meta data. We evaluate KrowDD on 3 real-world problems and compare its relevancy estimates with data scientists’ and domain experts’. Our findings indicate large possible cost savings when using our tool in bias-free environments, which may pave the way for lowering the cost of classifier design in practice. |
|
Patrick De Boer, Crowd process design : how to coordinate crowds to solve complex problems, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Dissertation)
The Internet facilitates an on-demand workforce, able to dynamically scale up and down depending on the requirements of a given project. Such crowdsourcing is increasingly used to engage workers available online. Similar to organizational design, where business processes are used to organize and coordinate employees, so-called crowd processes can be employed to facilitate work on a given problem. But as with business processes, it is unclear which crowd process performs best for a problem at hand. Aggravating the problem further, the impersonal, usually short-lived, relationship between an employer and crowd workers leads to major challenges in the organization of (crowd-) labor in general.
In this dissertation, we explore crowd process design. We start by finding a crowd process for a specific use case. We then outline a potential remedy for the more general problem of finding a crowd process for any use case.
The specific use case we focus on first, is an expert task, part of the review of statistical validity of research papers. Researchers often use statistical methods, such as t-test or ANOVA, to evaluate hypotheses. Recently, the use of such methods has been called into question. One of the reasons is that many studies fail to check the underlying assumptions of the employed statistical methods. This results in a threat to the statistical validity of a study and hampers the reuse of results. We propose an automated approach for checking the reporting of statistical assumptions. Our crowd process identifies reported assumptions in research papers achieving 85% accuracy.
Finding this crowd process took us more than a year, due to the trial-and-error approach underlying current crowd process design, where in some cases a candidate crowd process was not reliable enough, in some cases it was too expensive, and in others it took too long to complete. We address this issue in a more generic manner, through the automatic recombination of crowd processes for a given problem at hand based on an extensible repository of existing crowd process fragments. The potentially large number of candidate crowd processes derived for a given problem is subjected to Auto-Experimentation in order to identify a candidate matching a user’s performance requirements. We implemented our approach as an Open Source system and called it PPLib (pronounced “People Lib”). PPLib is validated in two real-world experiments corresponding to two common crowdsourcing problems, where PPLib successfully identified crowd processes performing well for the respective problem domains.
In order to reduce the search cost for Auto-Experimentation, we then propose to use black-box optimization to identify a well-performing crowd process among a set of candidates. Specifically, we adopt Bayesian Optimization to approximate the maximum of a utility function quantifying the user’s (business-) objectives while minimizing search cost. Our approach was implemented as an extension to PPLib and validated in a simulation and three real-world experiments.
Through an effective means to generate crowd process candidates for a given problem by recombination and by reducing the entry barriers to using black-box optimization for crowd process selection, PPLib has the potential to automate the tedious trial-and-error underlying the construction of a large share of today’s crowd powered systems. Given the trends of an ever more connected future, where on-demand labor likely plays a key role, an efficient approach to organizing crowds is paramount. PPLib helps pave the way to an automated solution for this problem. |
|