Cyrus Einsele, SPARQL query evaluation in Big Data processors, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
This thesis aims to investigate the usage of Big Data processors, namely Apache Beam, to extract information from Semantic Web data, specifically RDF. We present a Java application that can match SPARQL basic graph patterns, within a stream of triples. It does so utilizing the open-source Big Data processing framework Apache Beam, which is largely based upon Google Cloud Dataflow. We describe the principles of the algorithm used to achieve this goal and evaluate the correctness and the performance of the application by comparing it to a traditional SPARQL query execution engine, namely Apache Jena ARQ. |
|
Christoph Weber, A platform to integrate heterogeneous data, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
Heterogeneity is a very common and complex problem that arises in information systems. Gathering data from different data sources leads to data heterogeneity on a system, format, schematic and semantic level. The PigData Project aims to tackle the adoption of Big Data methods by the Swiss swine and pork production industry. Many actors in this supply chain collect data with overlapping targets but differing views and representation. To integrate this data for Big Data uses heterogeneity must be overcome. We elicit the stakeholder requirements for a data integration platform of the PigData Project and apply these requirements to a prototype. The scope of this data integration platform contains data upload over the web, integrating data into a common representation and allowing data analysts to execute queries on integrated data. |
|
Vidyadhar Rao, KV Rosni, Vineet Padmanabhan, Divide and Transfer: Understanding Latent Factors for Recommendation Tasks., In: Proceedings of the 1st Workshop on Intelligent Recommender Systems by Knowledge Transfer & Learning co-located with ACM Conference on Recommender Systems (RecSys 2017), Como, Italy, 2017. (Conference or Workshop Paper published in Proceedings)
|
|
Bibek Paudel, Thilo Haas, Abraham Bernstein, Fewer Flops at the Top: Accuracy, Diversity, and Regularization in Two-Class Collaborative Filtering, In: 11th ACM Conference on Recommender Systems RecSys 2017, ACM Press, New York, NY, USA, 2017-08-27. (Conference or Workshop Paper published in Proceedings)
In most existing recommender systems, implicit or explicit interactions are treated as positive links and all unknown interactions are treated as negative links. The goal is to suggest new links that will be perceived as positive by users. However, as signed social networks and newer content services become common, it is important to distinguish between positive and negative preferences. Even in existing applications, the cost of a negative recommendation could be high when people are looking for new jobs, friends, or places to live.
In this work, we develop novel probabilistic latent factor models to recommend positive links and compare them with existing methods on five different openly available datasets. Our models are able to produce better ranking lists and are effective in the task of ranking positive links at the top, with fewer negative links (flops). Moreover, we find that modeling signed social networks and user preferences this way has the advantage of increasing the diversity of recommendations. We also investigate the effect of regularization on the quality of recommendations, a matter that has not received enough attention in the literature. We find that regularization parameter heavily affects the quality of recommendations in terms of both accuracy and diversity. |
|
Nicola Staub, Revealing the inherent variability in data analysis, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Master's Thesis)
The variation in data analysis has been recently recognized as one of the major reasons for the reproducibility crisis in science. Many scientific findings have been proven to be statistically significant, which is however not necessarily an indication, that the results are indeed meaningful. There are many factors playing a role in the analytical choices a data analyst makes during an analysis. The goal of this thesis is to find potential factors which can explain the variability in data analysis. With a special platform designed in accordance with this thesis, rationales for different analytical choices along the path of a data analysis were elicited. The result of this thesis is a system of factors, which allow for examining data analysis workflows in different levels of depth. |
|
Michael Feldman, Frida Juldaschewa, Abraham Bernstein, Data Analytics on Online Labor Markets: Opportunities and Challenges, In: ArXiv.org, No. 1707.01790, 2017. (Working Paper)
The data-driven economy has led to a significant shortage of data scientists. To address this shortage, this study explores the prospects of outsourcing data analysis tasks to freelancers available on online labor markets (OLMs) by identifying the essential factors for this endeavor. Specifically, we explore the skills required from freelancers, collect information about the skills present on major OLMs, and identify the main hurdles for out-/crowd-sourcing data analysis. Adopting a sequential mixed-method approach, we interviewed 20 data scientists and subsequently surveyed 80 respondents from OLMs. Besides confirming the need for expected skills such as technical/mathematical capabilities, it also identifies less known ones such as domain understanding, an eye for aesthetic data visualization, good communication skills, and a natural understanding of the possibilities/limitations of data analysis in general. Finally, it elucidates obstacles for crowdsourcing like the communication overhead, knowledge gaps, quality assurance, and data confidentiality, which need to be mitigated. |
|
Patrick Muntwyler, Increasing the number of open data streams on the Web, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
Open Government Data describes data from Governments which are available for everyone and often can be used for any purpose. Linked Data was invented to increase the value of data which are published on the Web. TripleWave is a framework that applies the concept of Linked Data to streaming data. However, it lacks a prototype, which shows how TripleWave can be used to publish Open Government Data as Linked Data streams.
In the context of this Bachelor Thesis we increase the number of open data streams on the Web. We examine the available Open Government Data portals. Then we develop an application which fetches and transforms several suitable Open Government Data sets and finally publishes them as Linked Data streams on the Web. |
|
Johannes Schneider, Abraham Bernstein, Jan vom Brocke, Kostadin Damevski, David C Shepherd, Detecting Plagiarism based on the Creation Process, IEEE Transactions on Learning Technologies, Vol. 11 (3), 2017. (Journal Article)
All methodologies for detecting plagiarism to date have focused on the final digital “outcome”, such as a document or source code. Our novel approach takes the creation process into account using logged events collected by special software or by the macro recorders found in most office applications. We look at an author's interaction logs with the software used to create the work. Detection relies on comparing the histograms of multiple logs' command use. A work is classified as plagiarism if its log deviates too much from logs of “honestly created” works or if its log is too similar to another log. The technique supports the detection of plagiarism for digital outcomes that stem from unique tasks, such as theses and equal tasks such as assignments for which the same problem sets are solved by multiple students. Focusing on the latter case, we evaluate this approach using logs collected by an interactive development environment (IDE) from more than 60 students who completed three programming assignments. |
|
Shima Zahmatkesh, Emanuele Della Valle, Daniele Dell'Aglio, Using Rank Aggregation in Continuously Answering SPARQL Queries on Streaming and Quasi-static Linked Data, In: Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems, DEBS 2017, ACM, New York, NY, USA, 2017-06-19. (Conference or Workshop Paper published in Proceedings)
|
|
Christian Ineichen, Markus Christen, Hypo-and Hyperagentic Psychiatric States, Next-Generation Closed-Loop DBS, and question of agency, AJOB Neuroscience, Vol. 8 (2), 2017. (Journal Article)
|
|
Kürsat Aydinli, SSE - An Automated Sample Size Extractor for Empirical Studies, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
This thesis describes SSE - a system for automatically retrieving the sample size from an empirical study. SSE employs a three-stage pipelined architecture. The first stage utilizes Pattern Matching in order to extract potentially relevant sentence fragements from a document. The second stage is responsible for rule-based filtering of the matches returned in the first level. The last and most important stage is accountable for the application of case-specific heuristics in order to return the correct sample size for the document. The strengths of SSE can be seen in the fact that it is applicable to a variety of research publications. |
|
David Ackermann, Predicting SPARQL Query Performance with TensorFlow, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
As the Semantic Web receives increasing attention, there is a challenge in managing large RDF datasets efficiently. In this thesis, we address the problem of predicting SPARQL query performance using machine learning. We build a feature vector describing the query structurally and train different machine learning models with it. We explore ways to optimize our model's performance and analyze TensorFlow deployed on the IFI cluster.
While we adopt known feature modeling, we can reduce the vector size and save computation time. Our approach can significantly outperform existing approaches in a more efficient way. |
|
Jorge Goncalves, Michael Feldman, Subingqian Hu, Vassilis Kostakos, Abraham Bernstein, Task Routing and Assignment in Crowdsourcing based on Cognitive Abilities, In: World Wide Web Conference - Web Science Track, Geneva, 2017-04-03. (Conference or Workshop Paper published in Proceedings)
Appropriate task routing and assignment is an important, but often overlooked, element in crowdsourcing research and practice. In this paper, we explore and evaluate a mechanism that can enable matching crowdsourcing tasks to suitable crowd-workers based on their cognitive abilities. We measure participants’ visual and fluency cognitive abilities with the well-established Kit of Factor- Referenced Cognitive Test, and measure crowdsourcing performance with our own set of developed tasks. Our results indicate that participants’ cognitive abilities correlate well with their crowdsourcing performance. We also built two predictive models (beta and linear regression) for crowdsourcing task performance based on the performance on cognitive tests as explanatory variables. The model results suggest that it is feasible to predict crowdsourcing performance based on cognitive abilities. Finally, we discuss the benefits and challenges of leveraging workers’ cognitive abilities to improve task routing and assignment in crowdsourcing environments. |
|
Alessandro Margara, Daniele Dell'Aglio, Abraham Bernstein, Break the Windows: Explicit State Management for Stream Processing Systems, In: EDBT, OpenProceedings.org, 2017-03-21. (Conference or Workshop Paper published in Proceedings)
|
|
Bill Bosshard, Exploring the Variability in Data Analysis: the case of TopCoder.com, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
Crowdsourcing has raised interest in both the scientific and industrial community as an online distributed problem-solving model. TopCoder is one of the biggest crowdsourcing platform, regularly hosting different types of competition. This paper analyzes ten different data science competitions on TopCoder. Following the grounded theory method, we identified key factors leading to different results. We found low diversity of high quality results and try to find the reason for it. We further discuss the influence of the competition structure on the results and suggest a less limiting format to improve the quality of results. |
|
Patrick De Boer, Abraham Bernstein, Efficiently identifying a well-performing crowd process for a given problem, In: 20th ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW 2017), s.n., Portland, OR, 2017-02-25. (Conference or Workshop Paper published in Proceedings)
With the increasing popularity of crowdsourcing and crowd computing, the question of how to select a well-performing crowd process for a problem at hand is growing ever more important. Prior work casted crowd process selection to an optimization problem, whose solution is the crowd process performing best for a user’s problem. However, existing approaches require users to probabilistically model aspects of the problem, which may entail a substantial investment of time and may be error-prone. We propose to use black- box optimization instead, a family of techniques that do not require probabilistic modelling by the end user. Specifically, we adopt Bayesian Optimization to approximate the maximum of a utility function quantifying the user’s (business-) objectives while minimizing search cost. Our approach is validated in a simulation and three real-world experiments.
The black-box nature of our approach may enable us to reduce the entry barrier for efficiently building crowdsourcing solutions. |
|
Thomas W Malone, Jeffrey V Nickerson, Robert J Laubacher, Laur Hesse Fisher, Patrick De Boer, Yue Han, W Ben Towne, Putting the Pieces Back Together Again: Contest Webs for Large-Scale Problem Solving, In: 20th ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW 2017), s.n., Portland, OR, 2017-02-25. (Conference or Workshop Paper published in Proceedings)
A key issue, whenever people work together to solve a complex problem, is how to divide the problem into parts done by different people and combine the parts into a solution for the whole problem. This paper presents a novel way of doing this with groups of contests called contest webs. Based on the analogy of supply chains for physical products, the method provides incentives for people to (a) reuse work done by themselves and others, (b) simultaneously explore multiple ways of combining interchangeable parts, and (c) work on parts of the problem where they can contribute the most.
The paper also describes a field test of this method in an online community of over 50,000 people who are developing proposals for what to do about global climate change. The early results suggest that the method can, indeed, work at scale as intended. |
|
Marcel C. Bühler, KrowDD: Estimating Feature Relevance before Obtaining Data, University of Zurich, Faculty of Business, Economics and Informatics, 2017. (Bachelor's Thesis)
Before building a classifier to make predictions about a target variable, one must decide what input data to use. Most scientific publications about feature selection deal with methods that can be used once training data has been collected. Yet, in the real world, one has to collect, clean and transform data before it can be used to create predictive models. Collecting data is a very expensive and time consuming process. Going through this process for data not relevant to the target variable is very inefficient. A common approach to minimize the effort for feature selection is asking domain experts for their opinion. However, experts have been shown to perform worse at this task than one might expect. In this paper, I present a tool, KrowDD, that is able to identify relevant features among a number of feature ideas before obtaining data. An evaluation using three datasets shows that KrowDD performs significantly better than human experts. KrowDD is the first step on the way to more efficient feature selection: feature selection before obtaining training data.
|
|
Lenz Baumann, communitweet - Analyzing twitter communities, 2017. (Other Publication)
As the importance of data from the social web increases, research and attempted analysis of such data, gain more and more relevance in science. In a time where the digital and the real world are connected like never in history, such data provide important insight on social developments on a local and a global scale. The ability to handle this kind of data is hindered not only by humongous size and complexity, but also by the researchers’ ability to handle it in a programmatically way. This despite the fact, that the most needed operations include very basic and generically reusable scripts. The following work provides a package, that includes mechanisms to process, transform, enrich and visualize data gathered from the Twitter-API. It can be useful to data scientists, social scientists or journalists.
We present a description of the package and its abilities through some examples.
|
|
Ausgezeichnete Informatikdissertationen 2016, Edited by: Abraham Bernstein, Steffen Hölldobler, et al, Gesellschaft für Informatik, Bonn, 2017. (Edited Scientific Work)
|
|