Bruno Cadonna, Johann Gamper, Michael Hanspeter Böhlen, Sequenced event set pattern matching, In: 14th International Conference on Extending Database Technology, Association for Computing Machinery, New York, USA, 2011-03-22. (Conference or Workshop Paper published in Proceedings)
Event pattern matching is a query technique where a sequence of input events is matched against a complex pattern that specifies constraints on extent, order, values, and quantification of matching events. The increasing importance of such query techniques is underpinned by a significant amount of research work, the availability of commercial products, and by a recent proposal to extend SQL for event pattern matching. The proposed SQL extension includes an operator PERMUTE, which allows to express patterns that match any permutation of a set of events. No implementation of this operator is known to the authors.In this paper, we study the sequenced event set pattern matching problem, which is the problem of matching a sequence of input events against a complex pattern that specifies a sequence of sets of events rather than a sequence of single events. Similar to the PERMUTE operator, events that match with a set specified in the pattern can occur in any permutation, whereas events that match with different sets have to be strictly consecutive, following the order of the sets in the pattern specification. We formally define the problem of sequenced event set pattern matching, propose an automaton-based evaluation algorithm, and provide a detailed analysis of its runtime complexity. An empirical evaluation with real-world data shows that our algorithm outperforms a brute force approach that uses existing techniques to solve the sequenced event set pattern matching problem, and it validates the results from our complexity analysis. |
|
Ekaterina Kuleshova, Provenance in temporal databases: Facharbeit, 2011. (Other Publication)
The purpose of this paper is to develop tracing of lineage and provenance techniques for temporal databases. Using the snapshot reducibility property of temporal databases we will define a pointwise lineage traceability for temporal databases. Merging time points with the same lineage in the result of temporal operators allows an interval-based model by still allowing lineage traceability. On examples we show the algebra and it lineage. To trace lineage we need to materialize intermediate results. Moreover, lineage tells us only about the tuples that contribute and not how they contribute to the result query. That is why we further define relations annotated with provenance semirings. To be able to to perform queries on such relations we generalize the algebra to operate on them, so that query execution propagates provenance information. Finally, we define positive relational algebra, which propagates how-provenance.
|
|
Hannes Tresch, Managing and querying derived nutrient parameters in the Swiss Feed database, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2011. (Bachelor's Thesis)
The aim of the thesis has been to integrate the computation of derived nutrients into the Swiss animal feed database. Derived nutrients are parameters calculated by a formula that involves other nutrient parameters. The computation of these parameters replaces expensive chemical lab analyses. In order to get a meaningful temporal distribution of derived nutrients, an efficient method that supports time-varying regressions must be found.
For that, different SQL-queries, SQL-views and PL/pg SQL functions are implemented and presented in this document. The most efficient solution is then used for the implementation of an extension to the actual web application, so that all the functionalities that currently exist for non-derived nutrients are to be supported for derived nutrients as well. |
|
Lukas Blunschi, Claudio Jossen, Donald Kossmann, Magdalini Mori, Kurt Stockinger, Data-thirsty business analysts need SODA: search over data warehouse, In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, ACM, New York, NY, USA, 2011. (Conference or Workshop Paper published in Proceedings)
|
|
Advances in Databases - 28th British National Conference on Databases, BNCOD 28, Manchester, UK, July 12-14, 2011, Revised Selected Papers, Edited by: Christian Tilgner, Boris Glavic, Michael Hanspeter Böhlen, Carl-Christian Kanne, Springer Verlag, Heidelberg, DE, 2011. (Proceedings)
Modern server systems schedule large amounts of concurrent requests constrained by, e.g., correctness criteria and service-level agreements. Since standard database management systems provide only limited consistency levels, the state of the art is to develop schedulers imperatively which is time-consuming and error-prone. In this poster, we present Smile (declarative Scheduling MIddLEware), a tool for developing domain-specific scheduling protocols declaratively. Smile decreases the effort to implement and adapt such protocols because it abstracts from low level scheduling details allowing developers to focus on the protocol implementation. We demonstrate the advantages of our approach by implementing a domain-specific use case protocol. |
|
Nikolaus Augsten, Denilson Barbosa, Michael Böhlen, Themis Palpanas, Efficient top-k approximate subtree matching in small memory, IEEE Transactions on Knowledge & Data Engineering, Vol. 23 (8), 2011. (Journal Article)
We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree within a large document tree using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runtime and quadratic space complexity, and, thus, do not scale. Our solution is TASM-postorder, a memory-efficient and scalable TASM algorithm. We prove an upper bound for the maximum subtree size for which the tree edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document size and structure. A core problem is to efficiently prune subtrees that are above this size threshold. We develop an algorithm based on the prefix ring buffer that allows us to prune all subtrees above the threshold in a single postorder scan of the document. The size of the prefix ring buffer is linear in the threshold. As a result, the space complexity of TASM-postorder depends only on k and the query size, and the runtime of TASM-postorder is linear in the size of the document. Our experimental evaluation on large synthetic and real XML documents confirms our analytic results. |
|
Igor Timko, Michael Böhlen, Johann Gamper, Sequenced spatiotemporal aggregation for coarse query granularities, VLDB Journal, Vol. 20 (5), 2011. (Journal Article)
Sequenced spatiotemporal aggregation (SSTA) is an important query for many applications of spatiotemporal databases, such as traffic analysis. Conceptually, an SSTA query returns one aggregate value for each individual spatiotemporal granule. While the data is typically recorded at a fine granularity, at query time a coarser granularity is common. This calls for efficient evaluation strategies that are granularity aware. In this paper, we formally define an SSTA operator that includes a data-to-query granularity conversion. Based on a discrete time model and a discrete 1.5 dimensional space model, we generalize the concept of time constant intervals to constant rectangles, which represent maximal rectangles in the spatiotemporal domain over which an aggregation result is constant. We propose an efficient evaluation algorithm for SSTA queries that takes advantage of a coarse query granularity. The algorithm is based on the plane sweep paradigm, and we propose a granularity aware event point schedule, termed gaEPS, and a granularity aware sweep line status, termed gaSLS. These data structures store space and time points from the input relation in a compressed form using a minimal set of counters. In extensive experiments, we show that for coarse query granularities gaEPS significantly outperforms a basic EPS that is based on an extension of previous work, both in terms of memory usage and runtime. |
|
Thomas Keller, The implementation of a DRC into SQL Translator, 2011. (Other Publication)
The manual translation of Domain Relational Calculus (DRC) queries into equivalent Structured Query Language (SQL) queries is cumbersome. Thus, the automation of this process is desired. This paper describes the implementation of an automatic DRC into SQL translator called DRC2SQL. The program requirements and the program design are described extensively. Further, DRC2SQL is critically discussed. DRC2SQL offers a graphical user interface and is able to translate correctly 14 out of 15 previously defined DRC queries.
|
|
Samuele Zoppi, Online computation of up-to-date summaries in the Swiss feed database, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2011. (Bachelor's Thesis)
This thesis develops the up-to-date summary, an automatic approach to aggregate
histories of nutrient measurements in the Swiss Feed Database. Since measurements
are taken irregularly and are sparse in the time, a simple aggregation over the entire
history is not representative of the real world state. We fight this challenge by detecting
trends in history of measurements with a set of data fitting functions: uniform
fitting function, linear regression and kernel regression. The experimental evaluation
proves the scalability of our approach to aggregate the measurements of good and
bad quality data. Further, this thesis contributes to the development of the Feed
Database with the integration of the up-to-date summaries into web application and
with the import of the raw temporal data. |
|
Kristin Kruse, Development of a database system based on geographical information: Facharbeit, 2011. (Other Publication)
These theses document the working process of a database constructive web application with name Swiss Feed Database System, Version 2.0. One major point is hereby the inclusion of geographical information such as address specifications and the procedure to display the resulting locations usefully on a map. Next to a short introduction into the topic of Geographical Information System (GIS), a tutorial in embedding Google Maps will mostly cover that theme. Furthermore, the construction of an interactive web page by using a PostgreSQL database, PHP server scripts and JavaScript along with its Asynchronous JavaScript and XML (Ajax) object is part of this documentation. |
|
Michael Hartmann, Design and implementation of a demo application for Oshiya, 2011. (Other Publication)
Modern web services have to schedule a large amount of concurrent requests according to different criteria (e.g. classical serializability or service-level agreements). The state of the art is to develop domain-specific scheduling protocols by hand. This development is elaborate and error-prone. Oshiya is a generic scheduling model which allows to implement scheduling protocols declaratively in a flexible manner. In this Facharbeit a demo application that illustrates the functionality of Oshiya is presented. It provides a debug-like mode which helps developers to test new scheduling protocols graphically. |
|
Michael Akinde, Michael Hanspeter Böhlen, Damianos Chatziantoniou, Johann Gamper, theta-Constrained multi-dimensional aggregation, Information Systems, Vol. 36 (2), 2011. (Journal Article)
The SQL:2003 standard introduced window functions to enhance the analytical processing capabilities of SQL. The key concept of window functions is to sort the input relation and to compute the aggregate results during a scan of the sorted relation. For multi-dimensional OLAP queries with aggregation groups defined by a general θ condition an appropriate ordering does not exist, though, and hence expensive join-based solutions are required.In this paper we introduce θ-constrained multi-dimensional aggregation (θ-MDA), which supports multi-dimensional OLAP queries with aggregation groups defined by inequalities. θ-MDA is not based on an ordering of the data relation. Instead, the tuples that shall be considered for computing an aggregate value can be determined by a general θ condition. This facilitates the formulation of complex queries, such as multi-dimensional cumulative aggregates, which are difficult to express in SQL because no appropriate ordering exists. We present algebraic transformation rules that demonstrate how the θ-MDA interacts with other operators of a multi-set algebra. Various techniques for achieving an efficient evaluation of the θ-MDA are investigated, and we integrate them into concrete evaluation algorithms and provide cost formulas. An empirical evaluation with data from the TPC-H benchmark confirms the scalability of the θ-MDA operator and shows performance improvements of up to one order of magnitude over equivalent SQL implementations.Keywords: OLAP; SQL/OLAP; Window functions; Multi-dimensional aggregation |
|
B Glavic, G Alonso, R J Miller, L M Haas, TRAMP: understanding the behavior of schema mappings through provenance, In: VLDB '10: 36th International Conference on Very Large Data Bases, 2010-09-13. (Conference or Workshop Paper published in Proceedings)
Though partially automated, developing schema mappings remains a complex and potentially error-prone task. In this paper, we present TRAMP (TRAnsformation Mapping Provenance), an extensive suite of tools supporting the debugging and tracing of schema mappings and transformation queries. TRAMP combines and extends data provenance with two novel notions, transformation provenance and mapping provenance, to explain the relationship between transformed data and those transformations and mappings that produced that data. In addition we provide meta-querying to support querying of transformations, data, and all forms of provenance. We formally define transformation and mapping provenance, present an efficient implementation of both forms of provenance along with meta-querying, and evaluate the resulting system. |
|
Michael Hanspeter Böhlen, R Kasperovics, J Gamper, On the efficient construction of multislices from recurrences, In: SSDBM: Scientific and Statistical Database Management, 22nd International Conference, SSDBM 2010, Heidelberg, Germany, June 30 - July 2, 2010-06-30. (Conference or Workshop Paper published in Proceedings)
Recurrences are defined as sets of time instants associated with events and they are present in many application domains, including public transport schedules and personal calendars. Because of their large size, recurrences are rarely stored explicitly, but some form of compact representation is used. Multislices are a compact representation that is well suited for storage in relational databases. A multislice is a set of time slices where each slice employs a hierarchy of time granularities to compactly represent multiple recurrences. In this paper we investigate the construction of multislices from recurrences. We define the compression ratio of a multislice, show that different construction strategies produce multislices with different compression ratios, and prove that the construction of minimal multislices, i.e., multislices with a maximal compression ratio, is an NP-hard problem. We propose a scalable algorithm, termed LMerge, for the construction of multislices from recurrences. Experiments with real-world recurrences from public transport schedules confirm the scalability and usefulness of LMerge: the generated multislices are very close to minimal multislices, achieving an average compression ratio of approx. 99%. A comparison with a baseline algorithm that iteratively merges pairs of mergeable slices shows significant improvements of LMerge over the baseline approach. |
|
Màrton Takàcs, Analyse des Verhaltens von Microsoft SQL Server unter verschiedenen Workloads, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2010. (Bachelor's Thesis)
The work presented here describes measurements of the transaction throughput
for Microsoft SQL Server 2008 which focuses on concurrency control. Two types
of self-defined workloads and a TPC benchmark were used for the measurements.
In the measurements, the parameters isolation level, amount of operations per
transaction, size of the used database and the ratio between read and write
operations were changed to determine their effect on the throughput, the
effort of multi-user synchronization, and the aborts. The amount of simulated
users was raised. The results of the measurements are explained based on
theoretical principles. Because of the expansion of the effort of locking and
the data basis the transaction throughput and the effort of the
multi-user synchronization reduces/increases up to a certain number of users
relative proportionally and exploit beyond this point. |
|
Christian Tilgner, Declarative Scheduling in Highly Scalable Systems, In: EDBT '10: Proceedings of the 2010 EDBT/ICDT Workshops, 2010-03-22. (Conference or Workshop Paper published in Proceedings)
In modern architectures based on Web Services or Cloud Computing, a very large number of user requests arrive concurrently and has to be scheduled for execution constrained by correctness criteria, service-level agreements etc. The state of the art is to develop hand-coded schedulers, though this tails great costs, long development times, reduced developer productivity and inflexibility of mapping frequently changing requirements. In this paper, we present our approach for a scheduler component that can be programmed using declarative rules. Instead of handling one request at a time, we propose to treat sets of requests as data collections and to employ database query processing techniques to produce high-quality schedules in an efficient manner. Our declarative scheduler will allow for a more flexible and productive way to define existing scheduling protocols, service level agreements and novel application specific consistency protocols. First results presented here are encouraging and motivate for further investigation. |
|
Lukas Michael Scheuner, Entwurf und Erstellung eines Data Warehouse für die Schweizerische Futtermitteldatenbank, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2010. (Master's Thesis)
The actual Swiss animal feed database stores aggregated values of feed nutrient substances. Single values and their additional qualitative and temporal information are not stored into the database. Therefore, the data is not sufcient in regard to traceability, trend analysis, statistical computation and visualization. This work presents a data warehouse (DWH) model designed for the storage of single values and their additional and temporal information in a multidimensional data model. With this data model, computations based on single values are possible. Additionally, the temporal and qualitative information allows advanced trend analysis and visualizations. Furthermore, the integrated fuzzy logic methods give the possibility to make fuzzy queries by classify the measurement values. |
|
N Augsten, Michael Hanspeter Böhlen, D Barbosa, T Palpanas, TASM: Top-k Approximate Subtree Matching, In: IEEE 26th International Conference on Data Engineering (ICDE), 2010, 2010-03-01. (Conference or Workshop Paper)
We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree, e.g., a DBLP article with 15 nodes, in a large document tree, e.g., DBLP with 26M nodes, using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runtime and quadratic space complexity, and, thus, do not scale. Our solution is TASMpostorder, a memory-efficient and scalable TASM algorithm. We prove an upper-bound for the maximum subtree size for which the tree edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document size and structure. A core problem is to efficiently prune subtrees that are above this size threshold. We develop an algorithm based on the prefix ring buffer that allows us to prune all subtrees above the threshold in a single postorder scan of the document. The size of the prefix ring buffer is linear in the threshold. As a result, the space complexity of TASM-postorder depends only on k and the query size, and the runtime of TASM-postorder is linear in the size of the document. Our experimental evaluation on large synthetic and real XML documents confirms our analytic results. |
|
Boris Glavic, Formal Foundation of Contribution Semantics and Provenance Computation through Query Rewrite in TRAMP, February 2010. (Other Publication)
In this report we present the theoretical foundation of TRAMP. TRAMP is a schema mapping debugging system that uses provides provenance and query support as debugging functality for schema mappings scenarios. TRAMP is an extension of Perm, a relational provenance management system developed at University of Zurich. In this report we are not focussing on the debugging functionality added by TRAMP, but instead focus on the theoretical foundation of the provenance types provided by the system. In chapter 2 we present the contribution semantics for data provenance, transformation provenance, and mapping provenance used by TRAMP. Contribution semantics define which parts of the input (in case of data provenance) and which operators of a transformation (in case of transformation provenance) belong to the provenance of an output of a transformation. Thus, contribution semantics define ``what provenance actually is''. Based on the presented contribution semantics we demonstrate in chapter 3 how provenance according to these provenance types can be computed using algebraic rewrite techniques and proof the correctness and completeness of the algorithms used to compute provenance. |
|
N Augsten, Michael Hanspeter Böhlen, J Gamper, The pq-gram distance between ordered labeled trees, ACM Transactions on Database Systems (TODS), Vol. 35 (1), 2010. (Journal Article)
When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ. We propose pq-grams to approximately match hierarchical data from autonomous sources and define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the fanout weighted tree edit distance. We prove that the pq-gram distance is a lower bound of the fanout weighted tree edit distance and give a normalization of the pq-gram distance for which the triangle inequality holds. Experiments on synthetic and real-world data (residential addresses and XML) confirm the scalability of our approach and show the effectiveness of pq-grams. |
|