Jennifer Duchetta, Optimization of a Monitoring System for Preterm Infants, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Master's Thesis)
 
For the optimization of a monitoring system for preterm infants, two specifc objectives were pursued in this thesis. On the one hand the possibility was examined to measure heart rate and arterial oxygenation with a NIRS device. Various techniques have been tested and compared. The second objective has been to compare classifers for the specific task to lower the false alarm rate, without missing any real alarms. With theanalysis of the ROC curve, the 2-Nearest Neighbor has proven to be the most effective
classifier. |
|
Daniele Dell'Aglio, Danh Le Phuoc, Anh Le-Tuan, Muhammed Intizar Ali, Jean-Paul Calbimonte, On a Web of Data Streams, In: ISWC2017 workshop on Decentralizing the Semantic Web, s.n., 2017-10-22. (Conference or Workshop Paper published in Proceedings)
 
With the growing adoption of IoT and sensor technologies, an enormous amount of data is being produced at a very rapid pace and in different application domains. This sensor data consists mostly of live data streams containing sensor observations, generated in a distributed fashion by multiple heterogeneous infrastructures with minimal or no interoperability. RDF streams emerged as a model to represent data streams, and RDF Stream Processing (RSP) refers to a set of technologies to process such data. RSP research has produced several successful results and scientific output, but it can be evidenced that in most of the cases the Web dimension is marginal or missing. It also noticeable the lack of proper infrastructures to enable the exchange of RDF streams over heterogeneous and different types of RSP systems, whose features may vary from data generation to querying, and from reasoning to visualisation. This article defines a set of requirements related to the creation of a web of RDF stream processors. These requirements are then used to analyse the current state of the art, and to build a novel proposal, WeSP, which addresses these concerns. |
|
Matt Dennis, Kees van Deemter, Daniele Dell'Aglio, Jeff Z Pan, Computing Authoring Tests from Competency Questions: Experimental Validation, In: 16th International Semantic Web Conference, Springer International Publishing, Cham, 2017-10-21. (Conference or Workshop Paper published in Proceedings)
 
|
|
Tobias Grubenmann, Abraham Bernstein, Dmitrii Moor, Sven Seuken, Challenges of source selection in the WoD, In: ISWC 2017 - The 16th International Semantic Web Conference, 2017-10-21. (Conference or Workshop Paper published in Proceedings)
 
Federated querying, the idea to execute queries over several distributed knowledge bases, lies at the core of the semantic web vision. To accommodate this vision, SPARQL provides the SERVICE keyword that allows one to allocate sub-queries to servers. In many cases, however, data may be available from multiple sources resulting in a combinatorially growing number of alternative allocations of subqueries to sources.
Running a federated query on all possible sources might not be very lucrative from a user's point of view if extensive execution times or fees are involved in accessing the sources' data. To address this shortcoming, federated join-cardinality approximation techniques have been proposed to narrow down the number of possible allocations to a few most promising (or results-yielding) ones.
In this paper, we analyze the usefulness of cardinality approximation for source selection. We compare both the runtime and accuracy of Bloom Filters empirically and elaborate on their suitability and limitations for different kind of queries. As we show, the performance of cardinality approximations of federated SPARQL queries degenerates when applied to queries with multiple joins of low selectivity. We generalize our results analytically to any estimation technique exhibiting false positives.
These findings argue for a renewed effort to find novel join-cardinality approximation techniques or a change of paradigm in query execution to settings, where such estimations play a less important role. |
|
Tobias Grubenmann, Daniele Dell'Aglio, Abraham Bernstein, Dmitry Moor, Sven Seuken, Decentralizing the Semantic Web: Who will pay to realize it?, In: ISWC2017 workshop on Decentralizing the Semantic Web, OpenReview, 2017-10-20. (Conference or Workshop Paper published in Proceedings)
 
Fueled by enthusiasm of volunteers, government subsidies, and open data legislation, the Web of Data (WoD) has enjoyed a phenomenal growth. Commercial data, however, has been stuck in proprietary silos, as the monetization strategy for sharing data in the WoD is unclear. This is in contrast to the traditional web where advertisement fueled a lot of the growth. This raises the question how the WoD can (i) maintain its success when government subsidies disappear and (ii) convince commercial entities to share their wealth of data.
In this talk based on a paper, we propose a marketplace for decentralized data following basic WoD principles. Our approach allows a customer to buy data from different, decentralized providers in a transparent way. As such, our marketplace presents a first step towards an economically viable WoD beyond subsidies. |
|
Patrick De Boer, Marcel C. Bühler, Abraham Bernstein, Expert estimates for feature relevance are imperfect, In: DSAA2017 - The 4th IEEE International Conference on Data Science and Advanced Analytics, Tokyo, 2017. (Conference or Workshop Paper published in Proceedings)
 
An early step in the knowledge discovery process is deciding on what data to look at when trying to predict a given target variable. Most of KDD so far is focused on the workflow after data has been obtained, or settings where data is readily available and easily integrable for model induction. However, in practice, this is rarely the case, and many times data requires cleaning and transformation before it can be used for feature selection and knowledge discovery. In such environments, it would be costly to obtain and integrate data that is not relevant to the predicted target variable. To reduce the risk of such scenarios in practice, we often rely on experts to estimate the value of potential data based on its meta information (e.g. its description). However, as we will find in this paper, experts perform abysmally at this task. We therefore developed a methodology, KrowDD, to help humans estimate how relevant a dataset might be based on such meta data. We evaluate KrowDD on 3 real-world problems and compare its relevancy estimates with data scientists’ and domain experts’. Our findings indicate large possible cost savings when using our tool in bias-free environments, which may pave the way for lowering the cost of classifier design in practice. |
|
Patrick De Boer, Crowd process design : how to coordinate crowds to solve complex problems, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Dissertation)
 
The Internet facilitates an on-demand workforce, able to dynamically scale up and down depending on the requirements of a given project. Such crowdsourcing is increasingly used to engage workers available online. Similar to organizational design, where business processes are used to organize and coordinate employees, so-called crowd processes can be employed to facilitate work on a given problem. But as with business processes, it is unclear which crowd process performs best for a problem at hand. Aggravating the problem further, the impersonal, usually short-lived, relationship between an employer and crowd workers leads to major challenges in the organization of (crowd-) labor in general.
In this dissertation, we explore crowd process design. We start by finding a crowd process for a specific use case. We then outline a potential remedy for the more general problem of finding a crowd process for any use case.
The specific use case we focus on first, is an expert task, part of the review of statistical validity of research papers. Researchers often use statistical methods, such as t-test or ANOVA, to evaluate hypotheses. Recently, the use of such methods has been called into question. One of the reasons is that many studies fail to check the underlying assumptions of the employed statistical methods. This results in a threat to the statistical validity of a study and hampers the reuse of results. We propose an automated approach for checking the reporting of statistical assumptions. Our crowd process identifies reported assumptions in research papers achieving 85% accuracy.
Finding this crowd process took us more than a year, due to the trial-and-error approach underlying current crowd process design, where in some cases a candidate crowd process was not reliable enough, in some cases it was too expensive, and in others it took too long to complete. We address this issue in a more generic manner, through the automatic recombination of crowd processes for a given problem at hand based on an extensible repository of existing crowd process fragments. The potentially large number of candidate crowd processes derived for a given problem is subjected to Auto-Experimentation in order to identify a candidate matching a user’s performance requirements. We implemented our approach as an Open Source system and called it PPLib (pronounced “People Lib”). PPLib is validated in two real-world experiments corresponding to two common crowdsourcing problems, where PPLib successfully identified crowd processes performing well for the respective problem domains.
In order to reduce the search cost for Auto-Experimentation, we then propose to use black-box optimization to identify a well-performing crowd process among a set of candidates. Specifically, we adopt Bayesian Optimization to approximate the maximum of a utility function quantifying the user’s (business-) objectives while minimizing search cost. Our approach was implemented as an extension to PPLib and validated in a simulation and three real-world experiments.
Through an effective means to generate crowd process candidates for a given problem by recombination and by reducing the entry barriers to using black-box optimization for crowd process selection, PPLib has the potential to automate the tedious trial-and-error underlying the construction of a large share of today’s crowd powered systems. Given the trends of an ever more connected future, where on-demand labor likely plays a key role, an efficient approach to organizing crowds is paramount. PPLib helps pave the way to an automated solution for this problem. |
|
Cyrus Einsele, SPARQL query evaluation in Big Data processors, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
 
This thesis aims to investigate the usage of Big Data processors, namely Apache Beam, to extract information from Semantic Web data, specifically RDF. We present a Java application that can match SPARQL basic graph patterns, within a stream of triples. It does so utilizing the open-source Big Data processing framework Apache Beam, which is largely based upon Google Cloud Dataflow. We describe the principles of the algorithm used to achieve this goal and evaluate the correctness and the performance of the application by comparing it to a traditional SPARQL query execution engine, namely Apache Jena ARQ. |
|
Christoph Weber, A platform to integrate heterogeneous data, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
 
Heterogeneity is a very common and complex problem that arises in information systems. Gathering data from different data sources leads to data heterogeneity on a system, format, schematic and semantic level. The PigData Project aims to tackle the adoption of Big Data methods by the Swiss swine and pork production industry. Many actors in this supply chain collect data with overlapping targets but differing views and representation. To integrate this data for Big Data uses heterogeneity must be overcome. We elicit the stakeholder requirements for a data integration platform of the PigData Project and apply these requirements to a prototype. The scope of this data integration platform contains data upload over the web, integrating data into a common representation and allowing data analysts to execute queries on integrated data. |
|
Vidyadhar Rao, KV Rosni, Vineet Padmanabhan, Divide and Transfer: Understanding Latent Factors for Recommendation Tasks., In: Proceedings of the 1st Workshop on Intelligent Recommender Systems by Knowledge Transfer & Learning co-located with ACM Conference on Recommender Systems (RecSys 2017), Como, Italy, 2017. (Conference or Workshop Paper published in Proceedings)

|
|
Bibek Paudel, Thilo Haas, Abraham Bernstein, Fewer Flops at the Top: Accuracy, Diversity, and Regularization in Two-Class Collaborative Filtering, In: 11th ACM Conference on Recommender Systems RecSys 2017, ACM Press, New York, NY, USA, 2017-08-27. (Conference or Workshop Paper published in Proceedings)
 
In most existing recommender systems, implicit or explicit interactions are treated as positive links and all unknown interactions are treated as negative links. The goal is to suggest new links that will be perceived as positive by users. However, as signed social networks and newer content services become common, it is important to distinguish between positive and negative preferences. Even in existing applications, the cost of a negative recommendation could be high when people are looking for new jobs, friends, or places to live.
In this work, we develop novel probabilistic latent factor models to recommend positive links and compare them with existing methods on five different openly available datasets. Our models are able to produce better ranking lists and are effective in the task of ranking positive links at the top, with fewer negative links (flops). Moreover, we find that modeling signed social networks and user preferences this way has the advantage of increasing the diversity of recommendations. We also investigate the effect of regularization on the quality of recommendations, a matter that has not received enough attention in the literature. We find that regularization parameter heavily affects the quality of recommendations in terms of both accuracy and diversity. |
|
Nicola Staub, Revealing the inherent variability in data analysis, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Master's Thesis)
 
The variation in data analysis has been recently recognized as one of the major reasons for the reproducibility crisis in science. Many scientific findings have been proven to be statistically significant, which is however not necessarily an indication, that the results are indeed meaningful. There are many factors playing a role in the analytical choices a data analyst makes during an analysis. The goal of this thesis is to find potential factors which can explain the variability in data analysis. With a special platform designed in accordance with this thesis, rationales for different analytical choices along the path of a data analysis were elicited. The result of this thesis is a system of factors, which allow for examining data analysis workflows in different levels of depth. |
|
Michael Feldman, Frida Juldaschewa, Abraham Bernstein, Data Analytics on Online Labor Markets: Opportunities and Challenges, In: arXiv.org, No. 1707.01790, 2017. (Working Paper)
 
The data-driven economy has led to a significant shortage of data scientists. To address this shortage, this study explores the prospects of outsourcing data analysis tasks to freelancers available on online labor markets (OLMs) by identifying the essential factors for this endeavor. Specifically, we explore the skills required from freelancers, collect information about the skills present on major OLMs, and identify the main hurdles for out-/crowd-sourcing data analysis. Adopting a sequential mixed-method approach, we interviewed 20 data scientists and subsequently surveyed 80 respondents from OLMs. Besides confirming the need for expected skills such as technical/mathematical capabilities, it also identifies less known ones such as domain understanding, an eye for aesthetic data visualization, good communication skills, and a natural understanding of the possibilities/limitations of data analysis in general. Finally, it elucidates obstacles for crowdsourcing like the communication overhead, knowledge gaps, quality assurance, and data confidentiality, which need to be mitigated. |
|
Patrick Muntwyler, Increasing the number of open data streams on the Web, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
 
Open Government Data describes data from Governments which are available for everyone and often can be used for any purpose. Linked Data was invented to increase the value of data which are published on the Web. TripleWave is a framework that applies the concept of Linked Data to streaming data. However, it lacks a prototype, which shows how TripleWave can be used to publish Open Government Data as Linked Data streams.
In the context of this Bachelor Thesis we increase the number of open data streams on the Web. We examine the available Open Government Data portals. Then we develop an application which fetches and transforms several suitable Open Government Data sets and finally publishes them as Linked Data streams on the Web. |
|
Johannes Schneider, Abraham Bernstein, Jan vom Brocke, Kostadin Damevski, David C Shepherd, Detecting Plagiarism based on the Creation Process, IEEE Transactions on Learning Technologies, Vol. 11 (3), 2017. (Journal Article)
 
All methodologies for detecting plagiarism to date have focused on the final digital “outcome”, such as a document or source code. Our novel approach takes the creation process into account using logged events collected by special software or by the macro recorders found in most office applications. We look at an author's interaction logs with the software used to create the work. Detection relies on comparing the histograms of multiple logs' command use. A work is classified as plagiarism if its log deviates too much from logs of “honestly created” works or if its log is too similar to another log. The technique supports the detection of plagiarism for digital outcomes that stem from unique tasks, such as theses and equal tasks such as assignments for which the same problem sets are solved by multiple students. Focusing on the latter case, we evaluate this approach using logs collected by an interactive development environment (IDE) from more than 60 students who completed three programming assignments. |
|
Shima Zahmatkesh, Emanuele Della Valle, Daniele Dell'Aglio, Using Rank Aggregation in Continuously Answering SPARQL Queries on Streaming and Quasi-static Linked Data, In: Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems, DEBS 2017, ACM, New York, NY, USA, 2017-06-19. (Conference or Workshop Paper published in Proceedings)
 
|
|
Christian Ineichen, Markus Christen, Hypo-and Hyperagentic Psychiatric States, Next-Generation Closed-Loop DBS, and question of agency, AJOB Neuroscience, Vol. 8 (2), 2017. (Journal Article)
 
|
|
Kürsat Aydinli, SSE - An Automated Sample Size Extractor for Empirical Studies, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
 
This thesis describes SSE - a system for automatically retrieving the sample size from an empirical study. SSE employs a three-stage pipelined architecture. The first stage utilizes Pattern Matching in order to extract potentially relevant sentence fragements from a document. The second stage is responsible for rule-based filtering of the matches returned in the first level. The last and most important stage is accountable for the application of case-specific heuristics in order to return the correct sample size for the document. The strengths of SSE can be seen in the fact that it is applicable to a variety of research publications. |
|
David Ackermann, Predicting SPARQL Query Performance with TensorFlow, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
 
As the Semantic Web receives increasing attention, there is a challenge in managing large RDF datasets efficiently. In this thesis, we address the problem of predicting SPARQL query performance using machine learning. We build a feature vector describing the query structurally and train different machine learning models with it. We explore ways to optimize our model's performance and analyze TensorFlow deployed on the IFI cluster.
While we adopt known feature modeling, we can reduce the vector size and save computation time. Our approach can significantly outperform existing approaches in a more efficient way. |
|
Jorge Goncalves, Michael Feldman, Subingqian Hu, Vassilis Kostakos, Abraham Bernstein, Task Routing and Assignment in Crowdsourcing based on Cognitive Abilities, In: World Wide Web Conference - Web Science Track, Geneva, 2017-04-03. (Conference or Workshop Paper published in Proceedings)
 
Appropriate task routing and assignment is an important, but often overlooked, element in crowdsourcing research and practice. In this paper, we explore and evaluate a mechanism that can enable matching crowdsourcing tasks to suitable crowd-workers based on their cognitive abilities. We measure participants’ visual and fluency cognitive abilities with the well-established Kit of Factor- Referenced Cognitive Test, and measure crowdsourcing performance with our own set of developed tasks. Our results indicate that participants’ cognitive abilities correlate well with their crowdsourcing performance. We also built two predictive models (beta and linear regression) for crowdsourcing task performance based on the performance on cognitive tests as explanatory variables. The model results suggest that it is feasible to predict crowdsourcing performance based on cognitive abilities. Finally, we discuss the benefits and challenges of leveraging workers’ cognitive abilities to improve task routing and assignment in crowdsourcing environments. |
|