Philip Stutz, Daniel Strebel, Abraham Bernstein, Signal/collect12: processing large graphs in seconds, Semantic Web, Vol. 7 (2), 2016. (Journal Article)
 
Both researchers and industry are confronted with the need to process increasingly large amounts of data, much of which has a natural graph representation. Some use MapReduce for scalable processing, but this abstraction is not designed for graphs and has shortcomings when it comes to both iterative and asynchronous processing, which are particularly important for graph algorithms. This paper presents the Signal/Collect programming model for scalable synchronous and asynchronous graph processing. We show that this abstraction can capture the essence of many algorithms on graphs in a concise and elegant way by giving Signal/Collect adaptations of algorithms that solve tasks as varied as clustering, inferencing, ranking, classification, constraint optimisation, and even query processing. Furthermore, we built and evaluated a parallel and distributed framework that executes algorithms in our programming model. We empirically show that our framework efficiently and scalably parallelises and distributes algorithms that are expressed in the programming model. We also show that asynchronicity can speed up execution times. Our framework can compute a PageRank on a large (>1.4 billion vertices, >6.6 billion edges) real-world graph in 112 seconds on eight machines, which is competitive with other graph processing approaches. |
|
Matthias Klusch, Patrick Kapahnke, Stefan Schulte, Freddy Lecue, Abraham Bernstein, Semantic web service search: a brief survey, Künstliche Intelligenz (KI), Vol. 30 (2), 2016. (Journal Article)
 
Scalable means for the search of relevant web services are essential for the development of intelligent service-based applications in the future Internet. Key idea of semantic web services is to enable such applications to perform a high-precision search and automated composition of services based on formal ontology-based representations of service semantics. In this paper, we briefly survey the state of the art of semantic web service search. |
|
Inhalt. Perspektiven einer categoria non grata im philologischen Diskurs, Edited by: Christoph Steier, Daniel Alder, Markus Christen, Jeannine Hauser, Königshausen und Neumann, Würzburg, 2015-12-19. (Edited Scientific Work)

|
|
Markus Christen, The Ethics of Neuromodulation-Induced Behavior Changes, University of Zurich, Faculty of Economics, 2015. (Habilitation)
 
|
|
Michael Feldman, Massively Collaborative Complex Work — Exploring the Frontiers of Crowdsourcing, In: Doctoral Consortium of the 36th International Conference on Information Systems (ICIS). Fort Worth, US., 2015. (Conference or Workshop Paper)

|
|
Cristian Anastasiu, Collaborative Data Analysis in a Crowdsourcing Environment Using Jupyter Notebook, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2015. (Master's Thesis)
 
The availability of data is growing faster than the availability of experts with the relevant skill set needed to interpret it. Finding competent experts for data analysis tasks is becoming increasingly challenging due to the variety of required skills. It is well known that data preparation and filtering steps take a considerable amount of processing time in ML problems [Kotsiantis et al., 2006]. Business and academic settings assume analysts to be proficient not only in the domain of their interest, but also in core analysis disciplines such as statistics, computing, software engineering, and algorithms. Data analysis routines in these domains span over multiple disciplines and individuals involved in their accomplishment are subject to many biases due to their personal traits/background, which may cause errors.
This paper proposes a collaborative data analysis framework based on Jupyter Notebook, allowing structured data analysis tasks to be distributed as a collaborative process to a group of people with a diverse set of abilities and knowledge. Our evaluations showed that data analysis tasks, especially the pre-processing part, can be distributed to nonexpert workers, where it is assumed that every member possesses a tiny fragment of the required knowledge and, taken together, they can use their collective intelligence for successful data analytics. Specifically, the goal of this paper is to contribute to this field by discussing and implementing a framework to structure data analysis as a collaborative and distributed process accessible to a public with a diverse set of skills. |
|
Lorenz Fischer, Abraham Bernstein, Workload Scheduling in Distributed Stream Processors using Graph Partitioning, In: 2015 IEEE International Conference on Big Data (IEEE BigData 2015), IEEE Computer Society, 2015-10-29. (Conference or Workshop Paper published in Proceedings)
 
With ever increasing data volumes, large compute clusters that process data in a distributed manner have become prevalent in industry. For distributed stream processing platforms (such as Storm) the question of how to distribute workload to available machines, has important implications for the overall performance of the system.
We present a workload scheduling strategy that is based on a graph partitioning algorithm. The scheduler is application agnostic: it collects the communication behavior of running applications and creates the schedules by partitioning the resulting communication graph using the METIS graph partitioning software. As we build upon graph partitioning algorithms that have been shown to scale to very large graphs, our approach can cope with topologies with millions of tasks. While the experiments in this paper assume static data loads, our approach could also be used in a dynamic setting.
We implemented our proposed algorithm for the Storm stream processing system and evaluated it on a commodity cluster with up to 80 machines. The evaluation was conducted on four different use cases – three using synthetic data loads and one application that processes real data.
We compared our algorithm against two state-of-the-art scheduler implementations and show that our approach offers significant improvements in terms of resource utilization, enabling higher throughput at reduced network loads. We show that these improvements can be achieved while maintaining a balanced workload in terms of CPU usage and bandwidth consumption across the cluster. We also found that the performance advantage increases with message size, providing an important insight for stream-processing approaches based on micro-batching. |
|
Christian Ineichen, Markus Christen, Analyzing 7000 texts on Deep Brain Stimulation : what do they tell us?, Frontiers in Integrative Neuroscience, Vol. 9 (52), 2015. (Journal Article)
 
The enormous increase in numbers of scientific publications in the last decades requires quantitative methods for obtaining a better understanding of topics and developments in various fields. In this exploratory study, we investigate the emergence, trends, and connections of topics within the whole text corpus of the deep brain stimulation (DBS) literature based on more than 7000 papers (title and abstracts) published between 1991 to 2014 using a network approach. Taking the co-occurrence of basic terms that represent important topics within DBS as starting point, we outline the statistics of interconnections between DBS indications, anatomical targets, positive, and negative effects, as well as methodological, technological, and economic issues. This quantitative approach confirms known trends within the literature (e.g., regarding the emergence of psychiatric indications). The data also reflect an increased discussion about complex issues such as personality connected tightly to the ethical context, as well as an apparent focus on depression as important DBS indication, where the co-occurrence of terms related to negative effects is low both for the indication as well as the related anatomical targets. We also discuss consequences of the analysis from a bioethical perspective, i.e., how such a quantitative analysis could uncover hidden subject matters that have ethical relevance. For example, we find that hardware-related issues in DBS are far more robustly connected to an ethical context compared to impulsivity, concrete side-effects or death/suicide. Our contribution also outlines the methodology of quantitative text analysis that combines statistical approaches with expert knowledge. It thus serves as an example how innovative quantitative tools can be made useful for gaining a better understanding in the field of DBS. |
|
Shen Gao, Thomas Scharrenbach, Jörg-Uwe Kietz, Abraham Bernstein, Running out of Bindings? Integrating Facts and Events in Linked Data Stream Processing, In: 4th International Workshop on Ordering and Reasoning, s.n., Aachen, Germany, 2015-10-11. (Conference or Workshop Paper published in Proceedings)
 
Processing streams of linked data has gained increased importance over the past years. In many cases the streams contain events generated by sensors such as traffic control systems or news releases. As a reaction to this increased need, a number of languages and systems were developed that are aimed at processing linked data streams. These systems/languages follow one of two pertinent traditions: either they perform complex event processing or stream reasoning. However, both kinds of systems only support simulating system states as a sequence of events.
This paper proposes to model a new kind of data – Facts. Facts are temporal states stored in systems that combine events. Essentially, they trade space complexity for time complexity and reduce the intermediate variable bindings compared to other approaches. They also have the advantage of keeping queries relatively simple. In our evaluation, we compile queries for typical sensor-based use-cases in TEF-SPARQL, our SPARQL extension supporting Facts, C-SPARQL, and EP-SPARQL to the well-established Event Processing Language (EPL) running on the Esper complex event processing engine. Compared to simulate Facts, we show that modeling Facts directly only creates less than 1% of intermediate bindings and improves the throughput by up to 4 times. |
|
Lorenz Fischer, Roi Blanco, Peter Mika, Abraham Bernstein, Timely Semantics: A Study of a Stream-based Ranking System for Entity Relationships, In: The 14th International Semantic Web Conference, Heidelberg, Germany, 2015-10-11. (Conference or Workshop Paper published in Proceedings)
 
In recent years, search engines have started presenting se- mantically relevant entity information together with document search results. Entity ranking systems are used to compute recommendations for related entities that a user might also be interested to explore. Typically, this is done by ranking relationships between entities in a semantic knowledge graph using signals found in a data source as well as type annotations on the nodes and links of the graph. However, the process of producing these rankings can take a substantial amount of time. As a result, entity ranking systems typically lag behind real-world events and present relevant entities with outdated relationships to the search term or even outdated entities that should be replaced with more recent relations or entities.
This paper presents a study using a real-world stream-processing based implementation of an entity ranking system, to understand the effect of data timeliness on entity rankings. We describe the system and the data it processes in detail. Using a longitudinal case-study, we demonstrate (i) that low-latency, large-scale entity relationship ranking is feasible using moderate resources and (ii) that stream-based entity ranking improves the freshness of related entities while maintaining relevance. |
|
Markus Christen, Thema im Fokus „Urteilsfähigkeit“ – Ethische Kernfragen, Thema im Fokus : die Zeitschrift von Dialog Ethik, Vol. 2015 (Oktober), 2015. (Journal Article)
 
|
|
Markus Christen, Thomas Niederberger, Thomas Ott, Suleiman Aryobsei, Reto Hofstetter, Micro-text classification between small and big data, Nonlinear Theory and Its Applications, Vol. 6 (4), 2015. (Journal Article)
 
Micro-texts emerging from social media platforms have become an important source for research. Automatized classification and interpretation of such micro-texts is challenging. The problem is exaggerated if the number of texts is at a medium level, making it too small for effective machine learning, but too big to be efficiently analyzed solely by humans. We present a semi-supervised learning system for micro-text classification that combines machine learning techniques with the unmatched human ability for making demanding, i.e. nonlinear decisions based on sparse data. We compare our system with human performance and a predefined optimal classifier using a validated benchmark data-set. |
|
Thomas Brenner, Modeling of User Preferences using graph-based Recommender Systems, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2015. (Master's Thesis)
 
Recommender systems have become an important tool to help conquer the immense flood of Internet information. In recent years, focus has shifted from just increasing accuracy to improving user satisfaction by producing more diverse recommendations. This thesis seeks deeper knowledge about diversity and how a user approaches it. Users
are assigned to two different groups: a diversity-seeking and non-diversity-seeking group; this paper explains different ways to separate the groups. In a second part, alterations to graph-based recommender systems, i.e. applying the tf-idf scheme and employing users' neighborhood relations are discussed. Separation of users into different groups and recommender system variations are evaluated; a useful combination to optimize the results according to a user's preferences is proposed. These new variations of recommender systems succeed in providing more accurate and at the same time more diverse
recommendations for certain groups of users compared to state-of-the-art recommender systems. |
|
David Arpad Pinezich, Crowdsourced recognition of recoil black holes, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2015. (Bachelor's Thesis)
 
This paper is focused on the "Blackhole Chaser", a crowdsourcing platform prototype which helps to find new recoil black holes. First this paper shows a general overview over the topic of crowdsourcing and other related research fields. Then it explains how the "Blackhole Chaser" was built and how the architecture was planned. After that, it describes how different users (paid crowd workers, it-specialists and professionals) act on this platform and if there is a coincidence in their classification on it, based on their individual cognitive skills, that have been tested prior to using the platform with the renowned ETS testing framework. This paper closes with a conclusion and a listing of future tasks.
|
|
Sofia Orlova, Interactive Advertising Analytics, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2015. (Bachelor's Thesis)
 
Advertisement is everywhere. Whether we are aware of it or not, in average we are daily exposed to 2500 - 10000 ads. We are used to customized ads from google search, youtube and several more big players. But not every industry is that far yet. In fact one of the oldest communication media, namely the television doesn’t show you customized ads yet. The demographic information needed for such a customization simply wasn’t available. Online televion portals are gaining more and more users, which register themselves with the necessary data, which makes their habits traceable. Regarding this development tools to recognize the ads in live streams are required, in order to use the demographic information of the user and propose customized ads to him. This thesis describes how I built such a tool, that compares the video streams to ads based on their colour distributions. This mechanism can be used to expand its functions for extraction of the demographic data and combine the comparison mechanism by adding further recognition features. |
|
Fabian Christoffel, Bibek Paudel, Chris Newell, Abraham Bernstein, Blockbusters and Wallflowers: Speeding up Diverse and Accurate Recommendations with Random Walks, In: 9th ACM Conference on Recommender Systems RecSys 2015, ACM Press, New York, NY, USA, 2015-09-16. (Conference or Workshop Paper published in Proceedings)
 
User satisfaction is often dependent on providing accurate and diverse recommendations. In this paper, we explore algorithms that exploit random walks as a sampling technique to obtain diverse recommendations without compromising on efficiency and accuracy. Specifically, we present a novel graph vertex ranking recommendation algorithm called RP3β that re-ranks items based on 3-hop random walk transition probabilities. We show empirically, that RP3β provides accu- rate recommendations with high long-tail item frequency at the top of the recommendation list. We also present approx- imate versions of RP3β and the two most accurate previously published vertex ranking algorithms based on random walk transition probabilities and show that these approximations converge with increasing number of samples. |
|
Dmitry Moor, Tobias Grubenmann, Sven Seuken, Abraham Bernstein, A Double Auction for Querying the Web of Data, In: The Third Conference on Auctions, Market Mechanisms and Their Applications, ACM, New York, USA, 2015-09-08. (Conference or Workshop Paper published in Proceedings)
 
|
|
Lorenz Fischer, Shen Gao, Abraham Bernstein, Machines Tuning Machines: Configuring Distributed Stream Processors with Bayesian Optimization, In: 2015 IEEE International Conference on Cluster Computing (CLUSTER 2015), IEEE Computer Society, 2015-09-08. (Conference or Workshop Paper published in Proceedings)
 
Modern distributed computing frameworks such as Apache Hadoop, Spark, or Storm distribute the workload of applications across a large number of machines. Whilst they abstract the details of distribution they do require the programmer to set a number of configuration parameters before deployment. These parameter settings (usually) have a substantial impact on execution efficiency. Finding the right values for these parameters is considered a difficult task and requires domain, application, and framework expertise.
In this paper, we propose a machine learning approach to the problem of configuring a distributed computing framework. Specifically, we propose using Bayesian Optimization to find good parameter settings. In an extensive empirical evaluation, we show that Bayesian Optimization can effectively find good parameter settings for four different stream processing topologies implemented in Apache Storm resulting in significant gains over a parallel linear approach. |
|
Basil Philipp, A Flexible Viewership Analytics System for Online TV, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2015. (Master's Thesis)
 
The technologies used by online television providers make it possible to collect significantly more information on viewer behaviour than is possible with traditional, panel-based measurements. The fragmented market and the large data size call for novel approaches to handle this data and turn it into valuable insights. We propose a system that can deal with multiple data sources and offers advanced analyses of the data. We demonstrate the capabilities by showing an exemplary market analysis, an audience flow analysis and a viewership prediction. |
|
András Heé, Large-Scale Social Network Analysis with the igraph Toolbox and Signal/Collect, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2015. (Master's Thesis)
 
In the last years, the processing of huge graphs with millions and billions of vertices and edges has become feasible due to highly scalable distributed frameworks. But, the current systems are suffering from having to provide a high level language abstraction to allow data scientists the expression of large scale data analysis tasks. Our contribution has two main goals: Firstly, we build a generic network analysis toolbox (NAT) on top of Signal/Collect, a vertex-centric graph processing framework, to support the integration into existing statistical and scientific programming environments. We deliver an interface to the popular network analysis tool igraph. Secondly, we address the challenge to port social network analysis and graph exploration algorithms to the vertex-centric programming model to find implementations which do not operate on adjacency matrix representations of the graphs and do not rely on global state. |
|