Andrej Taliun, Michael Böhlen, Annelies Bracher, Francesco Cafagna, A GIS-based data analysis platform for analyzing the time-varying quality of animal feed and its impact on the environment, In: 6th International Congress on Environmental Modelling and Software, iEMSs 2012, 2012-07-01. (Conference or Workshop Paper published in Proceedings)
The Swiss Feed Data Warehouse is a public service for companies, farmers and research institutions that provides detailed and up-to-date information about the concentration of nutrients in animal feed from all across Switzerland. The core of the Swiss Feed Data Warehouse is a carefully curated data warehouse with more than 2 million chemical analyses of 600 feed types and 400 nutrients. The nutrient measurements are enriched with geographical (as postal code and altitude), temporal (as harvesting and analyses dates), biological (as variety and botanical composition) and technical information (as conservation and production methods). We propose a solution that offers to users a fast, effective and intuitive approach to query and analyze large amounts of high dimensional feed data. An interactive web-application enables dynamic query construction and offers multiple charts to visualize feed data. Techniques such as radius search and charts allow non-expert users to detect local correlations and trends in the feed data. Historical information makes it possible for advanced users to analyze changes in the feed quality and determine the balance of nutrients in feed mixes that minimizes the load of severe nutrients from animal excreta into ecosystems. |
|
Luis Schüller, Simultaneous execution of single- and multi-version protocols with the oshiya debugger and analyzer, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2012. (Bachelor's Thesis)
The Oshiya Debugger and Analyzer (ODA) is a tool for debugging, visualizing and comparing concurrency control techniques. Prior to this work ODA was limited to execute one single- version protocol concurrently. The goal of this thesis was to extend ODA, in order to support multi-version protocols as well as simultaneous protocol execution. Simultaneous protocol execution allows the user direct comparison of different protocols with regard to performance- and correctness-criteria. ODA simulates the execution of protocols over user-provided work- loads. Additional features for the workload are introduced, to model sophisticated client behavior. |
|
Patrick Leibundgut, Design and implementation of a statistics and visualization engine for the Oshiya debugger and analyzer, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2012. (Master's Thesis)
This thesis is about extending the Oshiya Debugger and Analyzer (ODA) with statistics and visualizations. ODA is an application which allows comparing and analyzing scheduling protocols. The implemented statistics collect statistical measures that can be displayed with charts and tables. To extend the statistical measures, the user can define his own statistical measures by setting up statistic queries. In addition to the statistics, ODA has been extended to support breakpoints. With the concept of break and analyze queries a user can set up breakpoints and further analyze the results. |
|
Johann Gamper, Michael Böhlen, Markus Innerebner, Scalable computation of Isochrones with network expiration, In: 24th International Conference on Scientific and Statistical Database Management, SSDBM 2012, Springer, Heidelberg, 2012-06-25. (Conference or Workshop Paper published in Proceedings)
An isochrone in a spatial network is the possibly disconnected set of all locations from where a query point is reachable within a given time span and by a given arrival time. In this paper we propose an efficient and scalable evaluation algorithm, termed (MINEX), for the computation of isochrones in multimodal spatial networks with different transportation modes. The space complexity of MINEX is independent of the network size and its runtime is determined by the incremental loading of the relevant network portions. We show that MINEX is optimal in the sense that only those network portions are loaded that eventually will be part of the isochrone. To keep the memory requirements low, we eagerly expire the isochrone and only keep in memory the minimal set of expanded vertices that is necessary to avoid cyclic expansions. The concept of expired vertices reduces MINEX’s memory requirements from O(|Viso|) to O(|Viso|) for grid and O(1) for spider networks, respectively. We show that an isochrone does not contain sufficient information to identify expired vertices, and propose an efficient solution that counts for each vertex the outgoing edges that have not yet been traversed. A detailed empirical study confirms the analytical results on synthetic data and shows that for real-world data the memory requirements are very small indeed, which makes the algorithm scalable for large networks and isochrones. |
|
Anton Dignös, Michael Hanspeter Böhlen, Johann Gamper, Temporal alignment, In: ACM SIGMOD 2012 international conference on Management of Data, ACM, New York, NY, USA, 2012-05-20. (Conference or Workshop Paper published in Proceedings)
In order to process interval timestamped data, the sequenced semantics has been proposed. This paper presents a relational algebra solution that provides native support for the three properties of the sequenced semantics: snapshot reducibility, extended snapshot reducibility, and change preservation. We introduce two temporal primitives, temporal splitter and temporal aligner, and define rules that use these primitives to reduce the operators of a temporal algebra to their nontemporal counterparts. Our solution supports the three properties of the sequenced semantics through interval adjustment and timestamp propagation. We have implemented the temporal primitives and reduction rules in the kernel of PostgreSQL to get native database support for processing interval timestamped data. The support is comprehensive and includes outer joins, antijoins, and aggregations with predicates and functions over the time intervals of argument relations. The implementation and empirical evaluation confirms effectiveness and scalability of our solution that leverages existing database query optimization techniques. |
|
Claudio Jossen, Lukas Blunschi, Magdalini Mori, Donald Kossmann, Kurt Stockinger, The Credit Suisse meta-data warehouse, In: 28th IEEE International Conference on Data Engineering (ICDE), 2012-04-01. (Conference or Workshop Paper published in Proceedings)
This paper describes the meta-data warehouse ofCredit Suisse that is productive since 2009. Like most otherlarge organizations, Credit Suisse has a complex applicationlandscape and several data warehouses in order to meet theinformation needs of its users. The problem addressed by themeta-data warehouse is to increase the agility and flexibility ofthe organization with regards to changes such as the developmentof a new business process, a new business analytics report, or theimplementation of a new regulatory requirement. The meta-datawarehouse supports these changes by providing services to searchfor information items in the data warehouses and to extract thelineage of information items. One difficulty in the design of sucha meta-data warehouse is that there is no standard or well-knownmeta-data model that can be used to support such search services.Instead, the meta-data structures need to be flexible themselvesand evolve with the changing IT landscape. This paper describesthe current data structures and implementation of the CreditSuisse meta-data warehouse and shows how its services helpto increase the flexibility of the whole organization. A seriesof example meta-data structures, use cases, and screenshots aregiven in order to illustrate the concepts used and the lessonslearned based on feedback of real business and IT users withinCredit Suisse. |
|
Christian Tilgner, Domain-specific scheduling protocols, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2012. (Dissertation)
A database scheduler takes requests from transactions and generates a request order that fulfills the constraints of a scheduling protocol, e.g., correctness criteria. The goal of this thesis is to provide a new method for the development of domain-specific protocols for scheduling database requests. Scheduling concurrent requests is an ubiquitous problem in modern server systems based on, e.g., Web Services that handle large numbers of concurrent requests. These systems require user scalability, performance predictability, and flexibility, i.e., the ability to adapt to domain-specific needs, e.g., relaxed correctness criteria or service-level agreements (SLAs). The traditional approach of outsourcing scheduling to database management systems (DBMSs) is of limited applicability for these systems, because DBMSs provide only a limited amount of predefined consistency levels and limited user scalability. The state of the art is to develop application-specific schedulers for a given application from scratch which yields fine-tuned schedulers satisfying the application’s scheduling constraints, albeit at a great cost and with long development times. Imperative implementations of schedulers can be complex, hard to verify, and adapting such schedulers results in expensive and error-prone re-implementations. The solution we propose for the development of domain-specific scheduling protocols is a generic scheduling model called Oshiya. The main idea of this model is to treat requests as data and employ database query processing techniques to produce request schedules. Oshiya can express traditional and domain-specific scheduling protocols. We introduce an Oshiya implementation of the traditional strong two-phase locking protocol and leverage the conciseness of Oshiya protocol implementations to prove its correctness. Our experiments show that for large numbers of concurrent requests our approach provides a better performance than a native database scheduler. Oshiya protocol implementations can be adapted easily to modified scheduling constraints. We leverage this advantage and develop the Declarative Serializable Snapshot Isolation protocol, a modified version of the Snapshot Isolation protocol, and prove that it produces serializable histories. We propose the resource acquisition protocol (RAP), a domain-specific protocol for scheduling transactions that compete for resources that are available in limited quantity, which is a typical usage pattern in booking, reservation, and webshop systems. We prove that RAP is deadlock-free and that it produces fewer aborts due to insufficient resource availability than SI. Our experimental results confirm that RAP performs better than SS2PL and SI with respect to aborts and throughput. We present the Oshiya Debugger and Analyzer (ODA), a novel system for debugging, visualizing, and comparing scheduling protocols developed using Oshiya. ODA supports the simultaneous execution of single- and multi version protocols, breakpoints, backward and forward debugging, as well as the statistical and visual protocol analysis.
Ein Datenbankscheduler hat die Aufgabe, für Anfragen (Requests) von Transaktionen eine Aus- führungsreihenfolge zu erstellen, welche die Bedingungen des verwendeten Protokolls wie z.B. Korrektheitskriterien erfüllt. Dieser Prozess wird Scheduling genannt. Das Ziel dieser Dissertation ist es, eine neue Methode zur Entwicklung domänenspezifischer Scheduling-Protokolle für Datenbankanfragen zu entwickeln. Das Scheduling von gleichzeitigen Requests ist ein vorherrschendes Problem in modernen (Web)Server-Systemen, welche viele gleichzeitige Requests verarbeiten müssen. Traditionellerweise übernimmt das Datenbankmanagementsystem das Scheduling. Für diese Systeme ist das jedoch nicht immer möglich, da Datenbankmanagementsysteme nur eine limitierte Anzahl von vordefinierten Konsistenzstufen sowie eine begrenzte Benutzerskalierbarkeit zur Verfügung stellen. Deshalb werden heutzutage anwendungsspezifische Scheduler von Grund auf neu entwickelt. Das Resultat sind spezifische Scheduler, die die Bedingungen der Anwendung erfüllen. Jedoch ist dies mit hohen Kosten und langen Entwicklungszeiten verbunden. Diese imperativen Scheduler-Implementierungen können zudem komplex und schwer zu verifizieren sein. Eine Anpassung dieser Scheduler resultiert in teuren und fehleranfälligen Neuimplementierungen. Für die Entwicklung von domänenspezifischen Protokollen präsentieren wir ein generisches Scheduling-Modell namens Oshiya. Die Grundidee dieses Modells ist es, Requests als Daten zu behandeln und die Datenbankanfrageverarbeitung zu verwenden, um eine Ausführungsreihenfolge (Schedule) für die Requests zu generieren. Oshiya erlaubt die deklarative Spezifikation von traditionellen und domänenspezifischen Protokollen. Wir stellen eine Oshiya-Implementierung des strengen Zwei-Phasen-Sperr-Protokolls (SS2PL) vor und beweisen dessen Korrektheit. Wie wir experimentell zeigen hat unser Ansatz bei vielen gleichzeitigen Requests eine bessere Perfor- mance als ein nativer Datenbank-Scheduler. Oshiya-Protokollimplementierungen können leicht an geänderte Bedingungen angepasst werden. Wir entwickeln eine modifizierte Version von Snapshot Isolation (SI) namens Declarative Serializable Snapshot Isolation und beweisen, dass dieses Protokoll serialisierbare Schedules generiert. Wir präsentieren das Resource Acquisition Protokoll (RAP), ein domänenspezifisches Protokoll für das Scheduling von Transaktionen welche um Resourcen mit begrenzter Verfügbarkeit konkurrieren. Dies ist ein typisches Zu- griffsmuster in Buchungs-, Reservations- und Webshop-Systemen. Wir beweisen, dass RAP Deadlocks vermeidet und weniger Aborts wegen zu geringer Verfügbarkeit erzeugt als Snap- shot Isolation. Unsere Experimente bestätigen, dass RAP weniger Aborts und einen höheren Durchsatz als SS2PL und SI produziert. Wir präsentieren den Oshiya Debugger und Analyzer (ODA), ein neuartiges System mit dem sich Oshiya-Protokollimplementierungen debuggen, visualisieren und vergleichen lassen. ODA unterstützt die simultane Ausführung von Einzel- und Multiversionsprotokollen, Breakpoints, vorwärts und rückwärts Debugging sowie eine statistische und visuelle Protokollanalyse. |
|
Michael Hanspeter Böhlen, Monika Boltshauser, Feeding time! Time for Tameus, International Innovation, 2012. (Journal Article)
Seeking to build on the success of Agroscope’s existing Swiss Feed Database, the collaborative Tameus project will manage time-varying measurement sets to establish an improved database for current and historical farm animal feed |
|
Monika Boltshauser, Annelies Bracher, Michael Böhlen, Francesco Cafagna, Andrej Taliun, Die Schweizerische Futtermitteldatenbank www.feedbase.ch, Agrarforschung Schweiz, Vol. 3 (2), 2012. (Journal Article)
|
|
Lukas Blunschi, Claudio Jossen, Donald Kossmann, Magdalini Mori, Kurt Stockinger, SODA: Generating SQL for business users, Proceedings of the VLDB Endowment, Vol. 5 (10), 2012. (Journal Article)
The purpose of data warehouses is to enable business analysts tomake better decisions. Over the years the technology has maturedand data warehouses have become extremely successful. Asa consequence, more and more data has been added to the datawarehouses and their schemas have become increasingly complex.These systems still work great in order to generate pre-canned reports.However, with their current complexity, they tend to be apoor match for non tech-savvy business analysts who need answersto ad-hoc queries that were not anticipated.This paper describes the design, implementation, and experienceof the SODA system (Search over DAta Warehouse). SODAbridges the gap between the business needs of analysts and thetechnical complexity of current data warehouses. SODA enables aGoogle-like search experience for data warehouses by taking keywordqueries of business users and automatically generating executableSQL. The key idea is to use a graph pattern matching algorithmthat uses the metadata model of the data warehouse. Ourresults with real data from a global player in the financial servicesindustry show that SODA produces queries with high precision andrecall, and makes it much easier for business users to interactivelyexplore highly-complex data warehouses. |
|
Mourad Khayati, Jay Martin Anderson, Michael Hanspeter Böhlen, Johann Gamper, Manfred Mitterer, Similarity of chemotherapy histories based on imputed values, International Journal of Medical Engineering and Informatics, Vol. 4 (3), 2012. (Journal Article)
The comparison of time series of multivariate data is a long-standing problem in many applications in the clinical domain. We propose two approaches to retrieve from a hospital data warehouse the k patients P1, ..., Pn with a chemotherapy history that is most similar to patient Q: the first is based on warping distance, together with an initial alignment using imputed values. The second is based on the volume of the Kiviat tube. In implementing the Euclidean distance, we investigate the addition of null events to achieve similar cardinality, and dynamic time warping, a widely-used technique in the comparison of time series data. The investigations are based on a real world clinical database. |
|
Sven Helmer, Nikolaus Augsten, Michael Böhlen, Measuring structural similarity of semistructured data based on information-theoretic approaches, VLDB Journal, Vol. 21 (5), 2012. (Journal Article)
We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on information-theoretic concepts. Common to all approaches is a two-step procedure: first, we extract and linearize the structural information from documents, and then, we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run-time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches. |
|
Juozas Gordevicius, Johann Gamper, Michael Böhlen, Parsimonious temporal aggregation, VLDB Journal, Vol. 21 (3), 2012. (Journal Article)
Temporal aggregation is an important operationin temporal databases, and different variants thereof havebeen proposed. In this paper, we introduce a novel temporalaggregation operator, termed parsimonious temporal aggregation (PTA), that overcomes major limitations of existingapproaches. PTA takes the result of instant temporal aggregation (ITA) of size n, which might be up to twice as largeas the argument relation, and merges similar tuples untila given error () or size (c) bound is reached. The newoperator is data-adaptive and allows the user to control thetrade-off between the result size and the error introduced bymerging. For the precise evaluation of PTA queries, we propose two dynamic programming-based algorithms for sizeand error-bounded queries, respectively, with a worst-casecomplexity that is quadratic in n. We present two optimizations that take advantage of temporal gaps and differentaggregation groups and achieve a linear runtime in experiments with real-world data. For the quick computation ofan approximate PTA answer, we propose an efficient greedymerging strategy with a precision that is upper bounded byO(log n). We present two algorithms that implement thisstrategy and begin to merge as ITA tuples are produced. Theyrequire O(n log(c + β)) time and O(c + β) space, where βis the size of a read-ahead buffer and is typically very small. An empirical evaluation on real-world and synthetic datashows that PTA considerably reduces the size of the aggregation result, yet introducing only small errors. The greedyalgorithms are scalable for large data sets and introduce lesserror than other approximation techniques. |
|
Robert Dewor, Design and implementation of a workload generator for the oshiya demo application, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2012. (Bachelor's Thesis)
The goal of this bachelor thesis was to design and implement a workload generator for the Oshiya
demo application. The demo application displays the functionality of the Oshiya scheduling
model and allows the user to create traditional and domain-specific scheduling protocols.
In order to develop protocols, the user has to be able to analyse and test them for specific properties
or behaviour. The workload generator enables the user to create customized transactions
that allow the user to do so. In this thesis, concept and design of the workload generator will be discussed and presented.
The solution that has been implemented allows the user to create customized transactions in a
flexible manner. |
|
Jonas Schmid, Implementation and evaluation of a key-value store for flash-based storage, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2012. (Bachelor's Thesis)
In the last decade solid state drives (SSD) gained more and more importance in the field of databases, due to their fast access time compared to traditional hard disk drives (HDD). The flash translation layer (FTL), an abstraction layer of SSDs, provides the same API as traditional HDDs and makes their use transparent. On top of FTL, traditional access methods and algorithms operate acceptably without any modification. The asymmetry of access time of read and write operations and the requirement to perform an expensive erase operation prior to an in-place update, raises the need for specialized access methods. This thesis shows an implementation and an evaluation of an approach for a key-value store, called in-page logging. It reduces the number of erase operations due to in-place updates on data pages by using logs. |
|
Nikolaus Augsten, Michael Böhlen, Curtis Dyreson, Johann Gamper, Windowed pq-grams for approximate joins of data-centric XML, VLDB Journal, Vol. 21 (4), 2012. (Journal Article)
In data integration applications, a join matches elements that are common to two data sources. Since elements are represented slightly different in each source, an approximate join must be used to do the matching. For XML data, most existing approximate join strategies are based on some ordered tree matching technique, such as the tree edit distance. In data-centric XML, however, the sibling order is irrelevant, and two elements should match even if their subelement order varies. Thus, approximate joins for data-centric XML must leverage unordered tree matching techniques. This is computationally hard since the algorithms cannot rely on a predefined sibling order. In this paper, we give a solution for approximate joins based on unordered tree matching. The core of our solution are windowed pq-grams which are small subtrees of a specific shape. We develop an efficient technique to generate windowed pq-grams in a three-step process: sort the tree, extend the sorted tree with dummy nodes, and decompose the extended tree into windowed pq-grams. The windowed pq-grams distance between two trees is the number of pq-grams that are in one tree decomposition only. We show that our distance is a pseudo-metric and empirically demonstrate that it effectively approximates the unordered tree edit distance. The approximate join using windowed pq-grams can be efficiently implemented as an equality join on strings, which avoids the costly computation of the distance between every pair of input trees. Experiments with synthetic and real world data confirm the analytic results and show the effectiveness and efficiency of our technique. |
|
Johann Gamper, Michael Böhlen, Willi Cometti, Markus Innerebner, Defining isochrones in multimodal spatial networks, In: 20th ACM Conference on Information and Knowledge Management, Association for Computing Machinery, New York, USA, 2011-10-24. (Conference or Workshop Paper published in Proceedings)
An isochrone in a spatial network is the minimal, possibly disconnected subgraph that covers all locations from where a query point is reachable within a given time span and by a given arrival time. In this paper we formally define isochrones for multimodal spatial networks with different transportation modes that can be discrete or continuous in, respectively, space and time. For the computation of isochrones we propose the multimodal incremental network expansion (MINE) algorithm, which is independent of the actual network size and depends only on the size of the isochrone. An empirical study using real-world data confirms the analytical results. |
|
Dietrich Christopeit, Michael Böhlen, Carl-Christian Kanne, Arturas Mazeika, Querying versioned software repositories, In: 15th international conference on Advances in databases and information systems, Springer, 2011-09-20. (Conference or Workshop Paper published in Proceedings)
Large parts of today’s data is stored in text documents that undergo a series of changes during their lifetime. For instance during the development of a software product the source code changes frequently. Currently, managing such data relies on version control systems (VCSs). Extracting information from large documents and their different versions is a manual and tedious process. We present Qvestor, a system that allows to declaratively query documents. It leverages information about the structure of a document that is available as a context-free grammar and allows to declaratively query document versions through a grammar annotated with relational algebra expressions. We define and illustrate the annotation of grammars with relational algebra expressions and show how to translate the annotations to easy to use SQL views. |
|
Christian Tilgner, Boris Glavic, Michael Böhlen, Carl-Christian Kanne, Declarative serializable snapshot isolation, In: 15th International Conference in Advances in Databases and Information Systems, Springer, 2011-09-19. (Conference or Workshop Paper published in Proceedings)
Snapshot isolation (SI) is a popular concurrency control protocol, but it permits non-serializable schedules that violate database integrity. The Serializable Snapshot Isolation (SSI) protocol ensures (view) serializability by preventing pivot structures in SI schedules. In this paper, we leverage the SSI approach and develop the Declarative Serializable Snapshot Isolation (DSSI) protocol, an SI protocol that guarantees serializable schedules. Our approach requires no analysis of application programs or changes to the underlying DBMS. We present an implementation and prove that it ensures serializability. |
|
Christian Tilgner, Boris Glavic, Michael Böhlen, Carl-Christian Kanne, Smile: Enabling easy and fast development of domain-specific scheduling protocols, In: 28th British National Conference on Databases, Springer, Berlin / Heidelberg, 2011-07-12. (Conference or Workshop Paper)
Modern server systems schedule large amounts of concurrent requests constrained by, e.g., correctness criteria and service-level agreements. Since standard database management systems provide only limited consistency levels, the state of the art is to develop schedulers imperatively which is time-consuming and error-prone. In this poster, we present Smile (declarative Scheduling MIddLEware), a tool for developing domain-specific scheduling protocols declaratively. Smile decreases the effort to implement and adapt such protocols because it abstracts from low level scheduling details allowing developers to focus on the protocol implementation. We demonstrate the advantages of our approach by implementing a domain-specific use case protocol. |
|