Abraham Bernstein, Jan Marco Leimeister, Natasha Noy, Cristina Sarasua, Elena Simperl, Crowdsourcing and the Semantic Web (Dagstuhl Seminar 14282), Dagstuhl Reports, Vol. 4 (7), 2014. (Journal Article)
 
Semantic technologies provide flexible and scalable solutions to master and make sense of an increasingly vast and complex data landscape. However, while this potential has been acknowledged for various application scenarios and domains, and a number of success stories exist, it is equally clear that the development and deployment of semantic technologies will always remain reliant of human input and intervention. This is due to the very nature of some of the tasks associated with the semantic data management life cycle, which are famous for their knowledge-intensive and/or context-specific character; examples range from conceptual modeling in almost any flavor, to labeling resources (in different languages), describing their content in terms of ontological terms, or recognizing similar concepts and entities. For this reason, the Semantic Web community has always looked into applying the latest theories, methods and tools from CSCW (Computer Supported Cooperative Work), participatory design, Web 2.0, social computing, and, more recently crowdsourcing to find ways to engage with users and encourage their involvement in the execution of technical tasks. Existing approaches include the usage of wikis as semantic content authoring environments, leveraging folksonomies to create formal ontologies, but also human computation approaches such as games with a purpose or micro-tasks. This document provides a summary of the Dagstuhl Seminar 14282: Crowdsourcing and the Semantic Web, which in July 2014 brought together researchers of the emerging scientific community at the intersection of crowdsourcing and Semantic Web technologies. We collect the position statements written by the participants of seminar, which played a central role in the discussions about the evolution of our research field. |
|
Abraham Bernstein, Mit Computer Sprechen: unterschiede und Gemeinsamkeiten zwischen menschlicher und maschineller Sprache, In: Sprache(n) verstehen, vdf, Zurich, p. 197 - 214, 2014. (Book Chapter)

|
|
The Semantic Web – ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part II, Edited by: Peter Mika, Tania Tudorache, Abraham Bernstein, Chris Welty, Craig Knoblock, Denny Vrandečić, Paul Groth, Natasha Noy, K Jacnowitz, Carole Goble, Springer, Heidelberg, 2014. (Proceedings)

|
|
The Semantic Web – ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part I, Edited by: Peter Mika, Tania Tudorache, Abraham Bernstein, Chris Welty, Craig Knoblock, Denny Vrandečić, Paul Groth, Natasha Noy, K Jacnowitz, Carole Goble, Springer, Heidelberg, 2014. (Proceedings)

|
|
Sai Tung On, Shen Gao, Bingsheng He, Ming Wu, Qiong Luo, Jianliang Xu, FD-Buffer: A Cost-Based Adaptive Buffer Replacement Algorithm for FlashMemory Devices, IEEE Transactions on Computers, Vol. 63 (9), 2014. (Journal Article)

In this paper, we present a design and implementation of FD-Buffer, a cost-based adaptive buffer manager for flash memory devices. Due to flash memory’s unique hardware features, it has an inherent read-write asymmetry: writes involve expensive erase operations, which usually makes them much slower than reads. To address this read-write asymmetry, we revisit buffer management and consider the average I/O cost per page access as the main cost metric, as opposed to the traditional miss rate. While there have been a number of buffer management algorithms that take the read-write asymmetry into consideration, most algorithms fail to effectively adapt to the runtime workload or different degrees of asymmetry. In this paper, we develop a new replacement algorithm in which we separate clean and dirty pages into two pools. The size ratio of the two pools is automatically adapted based on the read-write asymmetry and the runtime workload. We evaluate the FD-Buffer with trace-driven experiments on real flash memory devices. Our trace-driven evaluation results show that our algorithm achieves 4.0-33.4 percent improvement of I/O performance on flash memory, compared to state-of-the-art flash-aware replacement policies. |
|
Abraham Bernstein, Natasha Noy, Is This Really Science? The Semantic Webber’s Guide to Evaluating Research Contributions, Version: 1, 2014. (Technical Report)
 
The Semantic Web is an extremely diverse research area. Unlike scientists in other research fields, we investigate a diverse set questions using a plethora of methods. The goal of this primer is to provide context for scientists in the Semantic Web and Linked Data domain about the purpose of research questions and their associated hypotheses, the tension between rigor and relevance thereof, possible evaluation approaches typically used, and pitfalls in terms of reliability and validity.
For example, where is the scientific problem in developing a system or a tool? How do we frame the discussion of generating linked data from a given corpus such that others will actually care about our work? When is it a good idea to use our (Semantic Web) technology when the problem has already been successfully attacked by other means?
We strive to make this primer as practical as possible. Hence, after a short more theoretical introduction we will pick up a series of examples from our research domain and use them to exemplify the implications of our introductory theoretical treatment. We hope that this text will help the reader to explore the scientific basis of their research more systematically. |
|
Amancio Bouza, Abraham Bernstein, (Partial) user preference similarity as classification-based model similarity, Semantic Web, Vol. 5 (1), 2014. (Journal Article)
 
Recommender systems play an important role in helping people finding items they like. One type of recommender system is collaborative filtering that considers feedback of like-minded people. The fundamental assumption of collaborative filtering is that people who previously shared similar preferences behave similarly later on. This paper introduces several novel, classification-based similarity metrics that are used to compare user preferences. Furthermore, the concept of partial preference similarity based on a machine learning model is presented. For evaluation the cold-start behavior of the presented classification-based similarity metrics is evaluated in a large-scale experiment. It is shown that classification-based similarity metrics with machine learning significantly outperforms other similarity approaches in different cold-start situations under different degrees of data-sparseness. |
|
Daniel Strebel, Scalable forensic transaction matching: and its application for detecting patterns of fraudulent financial transactions, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Master's Thesis)
 
The detection of fraudulent patterns in large sets of financial transaction data is a crucial task in forensic investigations of money laundering, employee fraud and various other illegal activities. Scalable and flexible tools are needed to be able to analyze these large amounts of data and express the complex structures of the patterns that should be detected.
This thesis presents a novel approach of locally identifying associations between incoming and outgoing transactions for each participant of the transaction network and then aggregating these associations to larger patterns. The identified patters can be pruned and visualized in a graphical user interface to conduct further investigations.
The evaluation of our approach shows that it allows a stream-processing of real-world financial transactions with a throughput of more than one million transactions per minute. Furthermore we demonstrate the capability of our approach to express six sophisticated money laundering patterns, as reported by the Egmont group, and successfully retrieve components that correspond to these patterns.
To the best of our knowledge, this approach is the first to scalably identify dependent financial transactions based on local transaction matching, while providing a flexible query language to cover a broad range of financial fraud cases. |
|
Minh Khoa Nguyen, Thomas Scharrenbach, Abraham Bernstein, Eviction strategies for semantic flow processing, In: SSWS 2013 - Scalable Semantic Web Knowledge Base Systems 2013, CEUR-WS, Aachen, Germany, 2013-10-21. (Conference or Workshop Paper published in Proceedings)
 
In order to cope with the ever-increasing data volume continuous processing of incoming data via Semantic Flow Processing systems have been proposed. These systems allow to answer queries on streams of RDF triples. To achieve this goal they match (triple) patterns against the incoming stream and generate/update variable bindings. Yet, given the continuous nature of the stream the number of bindings can explode and exceed memory; in particular when computing aggregates. To make the information processing practical Semantic Flow Processing systems, therefore, typically limit the considered data to a (moving) window. Whilst this technique is simple it may not be able to nd patterns spread further than the window or may still cause memory overruns when data is highly bursty. In this paper we propose to maintain bindings (and thus memory) not on recency (i.e., a window) but on the likelihood of contributing to a complete match. We propose to base the decision on the matching likelihood and not creation time (fo) or at random. Furthermore we propose to drop variable bindings instead of data as do load shedding approaches. Specically, we systematically investigate deterministic and the matching-likelihood based probabilistic eviction strategy for dropping variable bindings in terms of recall. We find that a matching likelihood based eviction can outperform fo and random eviction strategies on synthetic as well as real world data. |
|
Cosmin Basca, Abraham Bernstein, Distributed SPARQL Throughput Increase: On the effectiveness of Workload-driven RDF partitioning, In: International Semantic Web Conference, CEUR-WS.org. 2013. (Conference Presentation)
 
The current size and expansion of the Web of Data or WoD, as shown by the stag- gering growth of the Linked Open Data (LOD) project1, which reached to over 31 billion triples towards the end of 2011, leaves federated and distributed Semantic DBMS’ or SDBMS’ facing the open challenge of scalable SPARQL query pro- cessing. Traditionally, SDBMS’ push the burden of efficiency at runtime on the query optimizer. This is in many cases too late (i.e., queries with many and/or non-trivial joins). Extensive research in the general field of Databases has iden- tified partitioning, in particular horizontal partitioning, as a primary means to achieve scalability. Similarly to [2] we adopt the assumption that minimizing the number of distributed-joins as a result of reorganizing the data over participating nodes will lead to increased throughput in distributed SDBMS’. Consequently, the benefit of reducing the number of distributed joins in this context is twofold:
A) Query optimization becomes simpler. Generally regarded as a hard prob- lem in a distributed setup, query optimization benefits, at all execution levels, from fewer distributed joins. During source selection the optimizer can use spe- cialized indexes like in [5], while during query planning better query plans can be devised quicker, since much of the optimization burden and complexity is shifted away from the distributed optimizer to local optimizers.
B) Query execution becomes faster. Not having to pay for the overhead of shipping partial results around, naturally reduces the time spent waiting for usually higher latency network transfers. Furthermore, federated SDBMS’ incur higher costs as they have to additionally serialize and deserialize data.
The main contributions of this poster are: i) the presentation of a novel and na ̈ıve workload-based RDF partitioning method2 and ii) an evaluation and study using a large real-world query log and dataset. Specifically, we investigate the impact of various method-specific parameters and query log sizes, comparing the performance of our method with traditional partitioning approaches. |
|
Lorenz Fischer, Thomas Scharrenbach, Abraham Bernstein, Scalable linked data stream processing via network-aware workload scheduling, In: 9th International Workshop on Scalable Semantic Web Knowledge Base Systems, 2013-10-21. (Conference or Workshop Paper published in Proceedings)
 
In order to cope with the ever-increasing data volume, distributed stream processing systems have been proposed. To ensure scalability most distributed systems partition the data and distribute the workload among multiple machines. This approach does, however, raise the question how the data and the workload should be partitioned and distributed. A uniform scheduling strategy — a uniform distribution of computation load among available machines — typically used by stream processing systems, disregards network-load as one of the major bottlenecks for throughput resulting in an immense load in terms of intermachine communication. In this paper we propose a graph-partitioning based approach for workload scheduling within stream processing systems. We implemented a distributed triple-stream processing engine on top of the Storm realtime computation framework and evaluate its communication behavior using two real-world datasets. We show that the application of graph partitioning algorithms can decrease inter-machine communication substantially (by 40% to 99%) whilst maintaining an even workload distribution, even using very limited data statistics. We also find that processing RDF data as single triples at a time rather than graph fragments (containing multiple triples), may decrease throughput indicating the usefulness of semantics. |
|
Philip Stutz, Coralia-Mihaela Verman, Lorenz Fischer, Abraham Bernstein, TripleRush: a fast and scalable triple store, In: 9th International Workshop on Scalable Semantic Web Knowledge Base Systems, CEUR Workshop Proceedings, http://ceur-ws.org, Aachen, Germany, 2013-10-21. (Conference or Workshop Paper published in Proceedings)
 
TripleRush is a parallel in-memory triple store designed to address the need for efficient graph stores that answer queries over large-scale graph data fast. To that end it leverages a novel, graph-based architecture. Specifically, TripleRush is built on our parallel and distributed graph processing framework Signal/Collect. The index structure is represented as a graph where each index vertex corresponds to a triple pattern. Partially matched copies of a query are routed in parallel along different paths of this index structure. We show experimentally that TripleRush takes less than a third of the time to answer queries compared to the fastest of three state-of-the-art triple stores, when measuring time as the geometric mean of all queries for two benchmarks. On individual queries, TripleRush is up to three orders of magnitude faster than other triple stores. |
|
Genc Mazlami, Scaling message passing algorithms for Distributed Constraint Optimization Problems in Signal/Collect, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Bachelor's Thesis)
 
The concept of Distributed Constraint Optimization Problems (DCOPs) is becoming more and more relevant to research in fields such as
computer science, engineering, game theory and others. Many real world problems, such as congestion management in data communication or traffic, and applications on sensor networks, are potential application fields for DCOPs. Hence, there is a need for research on different algorithms and approaches to this class of problems.
This thesis considers the evaluation and distribution of the Max-Sum algorithm. Specifically, the thesis first illustrates a detailed
example computation of the algorithm in order to contribute in the understanding of the algorithm. The main contribution of the thesis is
the implementation of the Max-Sum algorithm in the novel graph processing framework Signal/Collect. Also, a theoretical complexity analysis of said implementation is performed. Based on the implementation, a second contribution of the thesis follows: The benchmarking of the Max-Sum algorithm and its comparison to the DSA-A, DSA-B and Best-Response algorithms. The benchmarks first
tries to reproduce the results found in [Farinelli et. al, 2008] by analyzing the conflicts over the execution cycles and the cycles until convergence. Then, the thesis contributes new empirical results by evaluating and comparing synchronous and asynchronous Max-Sum with respect to conflicts over time and time to convergence. Also, the analysis of the relation between the execution cycles and the execution time will be part of the novel contribution.
Another main contribution of the thesis is the distributed evaluation of the algorithm on a multiple machine cluster. The benchmarks on multiple machines first compares the solution quality of asynchronous and synchronous Max-Sum on multiple machines. This is followed by an analysis how the number of machines used in the execution impacts the results for the conflicts over time. The thesis also adresses performance questions raised by the theoretical complexity analysis by
analyzing the influence of the average vertex degree on the solution quality. |
|
Markus Christen, Florian Faller, Ulrich Götz, Cornelius Müller, Serious Moral Games in Bioethics, In: Workshop on “Ubiquitous games and gamification for promoting behavior change and wellbeing”, 2013-09-16. (Conference or Workshop Paper published in Proceedings)
 
|
|
Damian Schärli, Coordinating large crowds in a complex language - translation task, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Master's Thesis)
 
The number of Human Computation Systems (HCS), in which people and computers cooperate with complex solution processes, has increased massively in recent years and allows great opportunities. Research and practical work in recent years confirm the benefits of this collaboration aiming at solving difficult problems with the support of people. In this work, we show how the execution of such processes are simplified within a work flow engine, which coordinates the processes. With an example, the simplicity and strength of this framework are shown. The evaluation of this thesis shows that the framework can be applied fault-tolerant and scalable to implement processes with little effort. |
|
Livio Hobi, Benchmarking Big Data Streams: Joins in Informationsstrom-Verarbeitungssystemen, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Bachelor's Thesis)
 
This thesis investigates an approach for systematically stress-testing Information flow processing (IFP) systems with sequential joins. The main contribution of this work is not only the result of the evaluation of a five-way sequential join but also the provided methodic approach going through the properties and challenges of joins in the context of IFP systems, the described data preparation and the statistical evaluation based on some defined key performance indicators. This approach can simplify future work. Furthermore this work emphasizes the immense importance of knowing and selecting the dataset for a benchmark. The developer of a benchmark needs to know some statistics about the dataset and then choose the parameters that fit the defined requirements. |
|
Markus Christen, The neuroethical challenges of brain simulations, In: Meeting of the International Association for Computing and Philosophy, 2013-07-15. (Conference or Workshop Paper published in Proceedings)
 
|
|
Thomas Keller, Graph partitioning for Signal/Collect, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Master's Thesis)
 
Signal/Collect is a vertex-centric programming model and framework for
graph processing. The communication between the vertices of a graph
impairs the performance of distributed framework executions due to
message serializations and network limitations. In this thesis,
Signal/Collect is extended to support algorithms that reduce the
number of remote messages by moving vertices between compute nodes
during the computation. Several algorithms are evaluated and the best
performing candidate is discussed in detail. The evaluation results
indicate an improvement of the runtime performance in one of two
cases. However, the performed evaluations are not sufficient to draw
final conclusions about the implemented approach. |
|
Robin Hafen, Benchmarking algorithms for distributed constraint optimization problems in Signal/Collect, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Bachelor's Thesis)
 
Many real world problems, such as network congestion control, can be mapped to the concept of a distributed constraint optimization problem (DCOP). By analyzing a class of DCOP algorithms known as local iterative approximate best-response (LIBR) algorithms, [Chapman et al., 2011b] constructed a framework enabling the study and modular design of new hybrid algorithms. In [Chapman et al., 2011a], several classical, as well as new hybrid algorithms, were benchmarked in a series of graph coloring experiments. It was found that the modular approach to algorithm design allowed the creation of new, better performing algorithms.
In this thesis a similar approach was taken: selected existing LIBR algorithms, such as the distributed stochastic algorithm and distributed simulated annealing, were implemented and benchmarked using a graph processing framework called Signal/Collect [Stutz et al., 2010], with which no such benchmark has ever been conducted. As a further contribution, an existing, non-distributed algorithm from computer science literature called tabu search [Nurmela, 1993] was modularized and distributed in the same manner. |
|
Markus Christen, Deborah A Vitacco, Lara Huber, Julie Harboe, Sara I Fabrikant, Peter Brugger, Colorful brains: 14 years of display practice in functional neuroimaging, NeuroImage, Vol. 73, 2013. (Journal Article)
 
Neuroimaging results are typically graphically rendered and color-coded, which influences the process of knowledge generation within neuroscience as well as the public perception of brain research. Analyzing these issues requires empirical information on the display practice in neuroimaging. In our study we evaluated more than 9000 functional images (fMRI and PET) published between 1996 and 2009 with respect to the use of color, image structure, image production software and other factors that may determine the display practice. We demonstrate a variety of display styles despite a remarkable dominance of few image production sites and software systems, outline some tendencies of standardization, and identify shortcomings with respect to color scale explication in neuroimages. We discuss the importance of the finding for knowledge production in neuroimaging, and we make suggestions to improve the display practice in neuroimaging, especially on regimes of color coding. |
|