Minh Khoa Nguyen, Thomas Scharrenbach, Abraham Bernstein, Eviction strategies for semantic flow processing, In: SSWS 2013 - Scalable Semantic Web Knowledge Base Systems 2013, CEUR-WS, Aachen, Germany, 2013-10-21. (Conference or Workshop Paper published in Proceedings)
 
In order to cope with the ever-increasing data volume continuous processing of incoming data via Semantic Flow Processing systems have been proposed. These systems allow to answer queries on streams of RDF triples. To achieve this goal they match (triple) patterns against the incoming stream and generate/update variable bindings. Yet, given the continuous nature of the stream the number of bindings can explode and exceed memory; in particular when computing aggregates. To make the information processing practical Semantic Flow Processing systems, therefore, typically limit the considered data to a (moving) window. Whilst this technique is simple it may not be able to nd patterns spread further than the window or may still cause memory overruns when data is highly bursty. In this paper we propose to maintain bindings (and thus memory) not on recency (i.e., a window) but on the likelihood of contributing to a complete match. We propose to base the decision on the matching likelihood and not creation time (fo) or at random. Furthermore we propose to drop variable bindings instead of data as do load shedding approaches. Specically, we systematically investigate deterministic and the matching-likelihood based probabilistic eviction strategy for dropping variable bindings in terms of recall. We find that a matching likelihood based eviction can outperform fo and random eviction strategies on synthetic as well as real world data. |
|
Cosmin Basca, Abraham Bernstein, Distributed SPARQL Throughput Increase: On the effectiveness of Workload-driven RDF partitioning, In: International Semantic Web Conference, CEUR-WS.org. 2013. (Conference Presentation)
 
The current size and expansion of the Web of Data or WoD, as shown by the stag- gering growth of the Linked Open Data (LOD) project1, which reached to over 31 billion triples towards the end of 2011, leaves federated and distributed Semantic DBMS’ or SDBMS’ facing the open challenge of scalable SPARQL query pro- cessing. Traditionally, SDBMS’ push the burden of efficiency at runtime on the query optimizer. This is in many cases too late (i.e., queries with many and/or non-trivial joins). Extensive research in the general field of Databases has iden- tified partitioning, in particular horizontal partitioning, as a primary means to achieve scalability. Similarly to [2] we adopt the assumption that minimizing the number of distributed-joins as a result of reorganizing the data over participating nodes will lead to increased throughput in distributed SDBMS’. Consequently, the benefit of reducing the number of distributed joins in this context is twofold:
A) Query optimization becomes simpler. Generally regarded as a hard prob- lem in a distributed setup, query optimization benefits, at all execution levels, from fewer distributed joins. During source selection the optimizer can use spe- cialized indexes like in [5], while during query planning better query plans can be devised quicker, since much of the optimization burden and complexity is shifted away from the distributed optimizer to local optimizers.
B) Query execution becomes faster. Not having to pay for the overhead of shipping partial results around, naturally reduces the time spent waiting for usually higher latency network transfers. Furthermore, federated SDBMS’ incur higher costs as they have to additionally serialize and deserialize data.
The main contributions of this poster are: i) the presentation of a novel and na ̈ıve workload-based RDF partitioning method2 and ii) an evaluation and study using a large real-world query log and dataset. Specifically, we investigate the impact of various method-specific parameters and query log sizes, comparing the performance of our method with traditional partitioning approaches. |
|
Lorenz Fischer, Thomas Scharrenbach, Abraham Bernstein, Scalable linked data stream processing via network-aware workload scheduling, In: 9th International Workshop on Scalable Semantic Web Knowledge Base Systems, 2013-10-21. (Conference or Workshop Paper published in Proceedings)
 
In order to cope with the ever-increasing data volume, distributed stream processing systems have been proposed. To ensure scalability most distributed systems partition the data and distribute the workload among multiple machines. This approach does, however, raise the question how the data and the workload should be partitioned and distributed. A uniform scheduling strategy — a uniform distribution of computation load among available machines — typically used by stream processing systems, disregards network-load as one of the major bottlenecks for throughput resulting in an immense load in terms of intermachine communication. In this paper we propose a graph-partitioning based approach for workload scheduling within stream processing systems. We implemented a distributed triple-stream processing engine on top of the Storm realtime computation framework and evaluate its communication behavior using two real-world datasets. We show that the application of graph partitioning algorithms can decrease inter-machine communication substantially (by 40% to 99%) whilst maintaining an even workload distribution, even using very limited data statistics. We also find that processing RDF data as single triples at a time rather than graph fragments (containing multiple triples), may decrease throughput indicating the usefulness of semantics. |
|
Philip Stutz, Coralia-Mihaela Verman, Lorenz Fischer, Abraham Bernstein, TripleRush: a fast and scalable triple store, In: 9th International Workshop on Scalable Semantic Web Knowledge Base Systems, CEUR Workshop Proceedings, http://ceur-ws.org, Aachen, Germany, 2013-10-21. (Conference or Workshop Paper published in Proceedings)
 
TripleRush is a parallel in-memory triple store designed to address the need for efficient graph stores that answer queries over large-scale graph data fast. To that end it leverages a novel, graph-based architecture. Specifically, TripleRush is built on our parallel and distributed graph processing framework Signal/Collect. The index structure is represented as a graph where each index vertex corresponds to a triple pattern. Partially matched copies of a query are routed in parallel along different paths of this index structure. We show experimentally that TripleRush takes less than a third of the time to answer queries compared to the fastest of three state-of-the-art triple stores, when measuring time as the geometric mean of all queries for two benchmarks. On individual queries, TripleRush is up to three orders of magnitude faster than other triple stores. |
|
Genc Mazlami, Scaling message passing algorithms for Distributed Constraint Optimization Problems in Signal/Collect, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Bachelor's Thesis)
 
The concept of Distributed Constraint Optimization Problems (DCOPs) is becoming more and more relevant to research in fields such as
computer science, engineering, game theory and others. Many real world problems, such as congestion management in data communication or traffic, and applications on sensor networks, are potential application fields for DCOPs. Hence, there is a need for research on different algorithms and approaches to this class of problems.
This thesis considers the evaluation and distribution of the Max-Sum algorithm. Specifically, the thesis first illustrates a detailed
example computation of the algorithm in order to contribute in the understanding of the algorithm. The main contribution of the thesis is
the implementation of the Max-Sum algorithm in the novel graph processing framework Signal/Collect. Also, a theoretical complexity analysis of said implementation is performed. Based on the implementation, a second contribution of the thesis follows: The benchmarking of the Max-Sum algorithm and its comparison to the DSA-A, DSA-B and Best-Response algorithms. The benchmarks first
tries to reproduce the results found in [Farinelli et. al, 2008] by analyzing the conflicts over the execution cycles and the cycles until convergence. Then, the thesis contributes new empirical results by evaluating and comparing synchronous and asynchronous Max-Sum with respect to conflicts over time and time to convergence. Also, the analysis of the relation between the execution cycles and the execution time will be part of the novel contribution.
Another main contribution of the thesis is the distributed evaluation of the algorithm on a multiple machine cluster. The benchmarks on multiple machines first compares the solution quality of asynchronous and synchronous Max-Sum on multiple machines. This is followed by an analysis how the number of machines used in the execution impacts the results for the conflicts over time. The thesis also adresses performance questions raised by the theoretical complexity analysis by
analyzing the influence of the average vertex degree on the solution quality. |
|
Markus Christen, Florian Faller, Ulrich Götz, Cornelius Müller, Serious Moral Games in Bioethics, In: Workshop on “Ubiquitous games and gamification for promoting behavior change and wellbeing”, 2013-09-16. (Conference or Workshop Paper published in Proceedings)
 
|
|
Damian Schärli, Coordinating large crowds in a complex language - translation task, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Master's Thesis)
 
The number of Human Computation Systems (HCS), in which people and computers cooperate with complex solution processes, has increased massively in recent years and allows great opportunities. Research and practical work in recent years confirm the benefits of this collaboration aiming at solving difficult problems with the support of people. In this work, we show how the execution of such processes are simplified within a work flow engine, which coordinates the processes. With an example, the simplicity and strength of this framework are shown. The evaluation of this thesis shows that the framework can be applied fault-tolerant and scalable to implement processes with little effort. |
|
Livio Hobi, Benchmarking Big Data Streams: Joins in Informationsstrom-Verarbeitungssystemen, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Bachelor's Thesis)
 
This thesis investigates an approach for systematically stress-testing Information flow processing (IFP) systems with sequential joins. The main contribution of this work is not only the result of the evaluation of a five-way sequential join but also the provided methodic approach going through the properties and challenges of joins in the context of IFP systems, the described data preparation and the statistical evaluation based on some defined key performance indicators. This approach can simplify future work. Furthermore this work emphasizes the immense importance of knowing and selecting the dataset for a benchmark. The developer of a benchmark needs to know some statistics about the dataset and then choose the parameters that fit the defined requirements. |
|
Markus Christen, The neuroethical challenges of brain simulations, In: Meeting of the International Association for Computing and Philosophy, 2013-07-15. (Conference or Workshop Paper published in Proceedings)
 
|
|
Thomas Keller, Graph partitioning for Signal/Collect, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Master's Thesis)
 
Signal/Collect is a vertex-centric programming model and framework for
graph processing. The communication between the vertices of a graph
impairs the performance of distributed framework executions due to
message serializations and network limitations. In this thesis,
Signal/Collect is extended to support algorithms that reduce the
number of remote messages by moving vertices between compute nodes
during the computation. Several algorithms are evaluated and the best
performing candidate is discussed in detail. The evaluation results
indicate an improvement of the runtime performance in one of two
cases. However, the performed evaluations are not sufficient to draw
final conclusions about the implemented approach. |
|
Robin Hafen, Benchmarking algorithms for distributed constraint optimization problems in Signal/Collect, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Bachelor's Thesis)
 
Many real world problems, such as network congestion control, can be mapped to the concept of a distributed constraint optimization problem (DCOP). By analyzing a class of DCOP algorithms known as local iterative approximate best-response (LIBR) algorithms, [Chapman et al., 2011b] constructed a framework enabling the study and modular design of new hybrid algorithms. In [Chapman et al., 2011a], several classical, as well as new hybrid algorithms, were benchmarked in a series of graph coloring experiments. It was found that the modular approach to algorithm design allowed the creation of new, better performing algorithms.
In this thesis a similar approach was taken: selected existing LIBR algorithms, such as the distributed stochastic algorithm and distributed simulated annealing, were implemented and benchmarked using a graph processing framework called Signal/Collect [Stutz et al., 2010], with which no such benchmark has ever been conducted. As a further contribution, an existing, non-distributed algorithm from computer science literature called tabu search [Nurmela, 1993] was modularized and distributed in the same manner. |
|
Markus Christen, Deborah A Vitacco, Lara Huber, Julie Harboe, Sara I Fabrikant, Peter Brugger, Colorful brains: 14 years of display practice in functional neuroimaging, NeuroImage, Vol. 73, 2013. (Journal Article)
 
Neuroimaging results are typically graphically rendered and color-coded, which influences the process of knowledge generation within neuroscience as well as the public perception of brain research. Analyzing these issues requires empirical information on the display practice in neuroimaging. In our study we evaluated more than 9000 functional images (fMRI and PET) published between 1996 and 2009 with respect to the use of color, image structure, image production software and other factors that may determine the display practice. We demonstrate a variety of display styles despite a remarkable dominance of few image production sites and software systems, outline some tendencies of standardization, and identify shortcomings with respect to color scale explication in neuroimages. We discuss the importance of the finding for knowledge production in neuroimaging, and we make suggestions to improve the display practice in neuroimaging, especially on regimes of color coding. |
|
Mengia Zollinger, Cosmin Basca, Abraham Bernstein, Market-based SPARQL brokerage: Towards Economic Incentives for Linked Data Growth, In: Extended Semantic Web Conference 2013 (ESWC2013). 2013. (Conference Presentation)
 
The growth of the Web of Data (WoD) has primarily been funded by subsidies. New datasets are financed via public funding or re- search programs. This may eventually limit the growth and could hamper data quality for lack of clear incentives. We propose MaTriX , a market- based SPARQL broker over the WoD as an economically viable growth option. Similar to others, queries are associated with a budget and min- imal result-set quality. The broker then employs auction mechanisms to find a set of data providers that jointly deliver the results. Preliminary results shows that mixing free and commercial providers exhibits supe- rior: consumer surplus, producer profit, total welfare, and recall |
|
Thomas Scharrenbach, Jacopo Urbani, Alessandro Margara, Emanuele Della Valle, Abraham Bernstein, Seven commandments for benchmarking semantic flow processing systems, In: The Semantic Web: Semantics and Big Data, 10th International Conference, ESWC 2013, Springer, Berlin-Heidelberg, 2013-05-26. (Conference or Workshop Paper published in Proceedings)
 
Over the last few years, the processing of dynamic data has gained increasing attention in the Semantic Web community. This led to the development of several stream reasoning systems that enable on-the-fly processing of semantically annotated data that changes over time. Due to their streaming nature, analyzing such systems is extremely difficult. Currently, their evaluation is conducted under heterogeneous scenarios, which makes it hard to clearly compare them, understanding the benefits and limitations of each of them. In this paper, we strive for a better understanding the key challenges that these systems must face and define a generic methodology to evaluate their performance. Specifically, we identify three Key Performance Indicators (KPIs) and seven commandments that specify how to design the stress tests for system evaluation. |
|
Alessandro Rigamonti, Extending Rdfbox with named graphs support, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Bachelor's Thesis)
 
The Semantic Web is a concept for extending the World Wide Web with semantic information in a way that computers are able to ”understand” it. A technical realization for that is provided by the Resource Description Framework (RDF). Thereby the information is stored uniquely in the form of triples (subject, predicate, object). Such data can be queried using the SPARQL Protocol And RDF Query Language (SPARQL). There are a number of native and relational-based indexing schemes for RDF. A read-optimized schema is depicted by Hexastore, whereat for all six triple permutations a separate index is created. Rdfbox is a tool for querying and storing RDF data. It extends the Hexastore indexing paradigm with update/write support by relying on B+Trees as the main physical index storage. SPARQL 1.1 brought the GRAPH clause that allows for querying the source of RDF data. The goal of this thesis is to extend Rdfbox with support of that GRAPH clause. |
|
Daniel Spicar, Extending Rdfbox with distributed RDF management: efficient RDF indexing and loading, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Master's Thesis)
 
The Semantic Web is growing at a fast pace and single machines can reach the limits of their capabilities when trying to manage RDF data sets that contain billions of triples. In order to tackle this problem, this thesis extends Rdfbox, a Hexastore-based semantic database management system, with support for distributed index backends, a parallelised query execution engine and a distributed index loader. Evaluations show that the distributed index loader outperforms previous approaches and that the improved query execution engine is performance-critical with distributed and indices. But limits to the efficiency of distributed indices in Rdfbox are discovered that may require fundamental changes in the approach to query resolution in order to achieve a significant performance increase. |
|
Katharina Reinecke, Minh Khoa Nguyen, Abraham Bernstein, Michael Näf, Krzysztof Z Gajos, Doodle around the world: Online scheduling behavior reflects cultural differences in time perception and group decision-making, In: 2013 ACM Conference on Computer Supported Cooperative Work (CSCW 2013), ACM, New York, NY, 2013-02-23. (Conference or Workshop Paper published in Proceedings)
 
Event scheduling is a group decision-making process in which social dynamics influence people’s choices and the overall outcome. As a result, scheduling is not simply a matter of finding a mutually agreeable time, but a process that is shaped by social norms and values, which can highly vary between countries. To investigate the influence of national culture on people’s scheduling behavior we analyzed more than 1.5 million Doodle date/time polls from 211 countries. We found strong correlations between characteristics of national culture and several behavioral phenomena, such as that poll participants from collectivist countries respond earlier, agree to fewer options but find more consensus than predominantly individualist societies. Our study provides empirical evidence of behavioral differences in group decision-making and time perception with implications for cross-cultural collaborative work. |
|
Nico Rutishauser, Real-time crowd-based subtitle translation, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Bachelor's Thesis)
 
|
|
Patrick Winzenried, Evaluation of quality assurance mechanisms in paid crowdsourcing, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Bachelor's Thesis)
 
Computer systems still struggle with tasks such as categorizing photos, translating phrases or verifying collected data. Crowdsourcing platforms like Amazon’s Mechanical Turk are the ideal place to solve such challenging assignments by using the collective power of human beings, which are willing to work for a small amount of money. However, not every individual pro- vides perfect answers – some of them try to maximize their income by cheating, while others may misunderstood the task or do not have sufficient skills to solve it properly. In order to hold a certain level of quality, it is necessary to apply some sort of quality assurance mechanisms. This thesis will compare two such mechanisms by applying them on three different use cases and finally presents results about the performance in terms of quality, time and costs. |
|
Thomas Ritter, Extending rdfbox with centralized rdf management. Efficient rdf indexing and loading., University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Master's Thesis)
 
Rdfbox is a native triple store and implements the Hexastore indexing concept. Hexastore permutes RDF triple elements (subject, predicate, object) to build six indexes. Rdfbox takes this indexing scheme and maps it to key-value storage. Prior to this work, it used the Tokyo Cabinet key-value store as its indexing backend. This work extends it with Kyoto Cabinet, LevelDB and Redis backends and enables to easily exchange them, even on a per-index level. Further, Rdfbox's triple loading is analyzed, improved and adapted to the added backends. The backends' performance is then compared and analyzed with synthetic and real-world data sets and queries. |
|