Karin Bühler, Performance analysis of provenance computation using query rewrite techniques on a commercial database system, 2010. (Other Publication)
The aim of the paper is to compare the performance of provenance computation in Perm
(Provenance extension of the relational Model), a relational provenance management system
operating on the open source DBMS PostgreSQL, with the performance of provenance
computation using the commercial database system DB2. To be able to use DB2 for
provenance computation, the Perm query rewrite techniques are used to generate SQL queries
that generate provenance information. The performance of the systems is evaluated using
TPC-H, a decision support benchmark. The results of the experimental evaluation indicate
that Perm outperforms DB2 on this workload. |
|
Alessandro Vagliardo, MidPerm - Realisierung einer Provenance SQL-Erweiterung als Middleware Lösung, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2010. (Master's Thesis)
I have bought a bottle of milk and now I want to know from which farm and cows the milk
came from. Such questions can be answered by provenance applications. There are only a few
general solutions which every database can use.
MidPerm is a first try to generate such a solution. MidPerm separates the Perm module which
is integrated into the Postgres database. Perm calculates the provenance data for a SQL query.
A JDBC interface allows a communication with MidPerm. Through this interface MidPerm
allows all databases with a JDBC interface to use the provenance calculation. |
|
Christian Tilgner, Boris Glavic, Michael Hanspeter Böhlen, Carl-Christian Kanne, Correctness proof of the declarative SS2PL protocol implementation, No. IFI-2010.0008, Version: 1, 2010. (Technical Report)
|
|
B Glavic, Perm: efficient provenance support for relational databases, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2010. (Dissertation)
|
|
Stephan Jakob Benedikt Heuscher, Information system agnostic ancestry for digital objects, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2010. (Dissertation)
|
|
Romans Kasperovics, Michael Hanspeter Böhlen, Johann Gamper, Evaluating Exceptions on Time Slices, In: ER 2009: 28th International Conference on Conceptual Modeling, Springer, 2009-11-09. (Conference or Workshop Paper published in Proceedings)
Public transport schedules contain temporal data with many regular patterns that can be represented compactly. Exceptions come as modifications of the initial schedule and break the regular patterns increasing the size of the representation. A typical strategy to preserve the compactness of schedules is to keep exceptions separately. This, however, complicates the automated processing of schedules and imposes a more complex model on applications. In this paper we evaluate exceptions by incorporating them into the patterns that define schedules. We employ sets of time slices, termed multislices, as a representation formalism for schedules and exceptions. The difference of multislices corresponds to the evaluation of exceptions and produces an updated schedule in terms of a multislice. We propose a relational model for multislices, provide an algorithm for efficient evaluating the difference of multislices, and show analytically and experimentally that the evaluation of exceptions is a feasible strategy for realistic schedules. |
|
Samuel Mezger, Untersuchung der Skalierbarkeit verschiedener Datenbankmanagementsysteme unter hohen Nutzerzahlen, August 2009. (Other Publication)
The work presented here describes measurements of transaction throughput for different database
management systems that focus on concurrency control. The measurements were taken for IBM
DB2 9.5, PostgreSQL 8.3 and Microsoft SQL Server 2008. During the measurements, the follow-
ing parameters were being changed to determine their effect on throughput: the isolation level,
the amount of memory available for the database’s buffer pool, the database’s cardinality and the
amount of operations per transaction. When trying to relate the measurements’ results to the expec-
tations based on theoretical principles, it is found that while some effects show as expected, many
phenomena have to be attributed to speci?c implementations of the different database management
systems. Expected results like lock thrashing and throughput-limitations due to I/O-performance
or the CPU’s processing speed are apparent. Unexpectedly, I/O-performance is a limiting factor
not only when small database buffer pools are used, and lock thrashing affects all database man-
agement systems in a different way. Furthermore, it is found that all of the used management
systems can reach higher throughput numbers at higher isolation levels. DB2 is noted to break con-
nections when the database’s buffer pool is chosen too large, while SQL Server does the same when
the database’s buffer pool is chosen too small. For PostgresSQL, transaction throughput is reduced
whenthe level of concurrency is increased. This happens due to the multi-versioning protocol used
by PostgreSQL, which leads to an increase in memory consumption under these conditions. |
|
C Sturm, E Hunt, M H Scholl, Distributed privilege enforcement in PACS, In: 23rd Annual IFIP WG 11.3 Working Conference on Data and Applications Security (DBSec 2009), Springer, Berlin / Heidelberg, 2009-07-12. (Conference or Workshop Paper published in Proceedings)
We present a new access control mechanism for P2P networks
with distributed enforcement, called P2P Access Control System (PACS). PACS enforces powerful access control models like RBAC with administrative delegation inside a P2P network in a pure P2P manner, which is not possible in any of the currently used P2P access control mechanisms. PACS uses client-side enforcement to support the replication of confidential data. To avoid a single point of failure at the time of privilege enforcement, we use threshold cryptography to distribute the enforcement among the participants. Our analysis of the expected number of messages and the computational effort needed in PACS shows that its increased flexibility comes with an acceptable additional overhead. |
|
B Glavic, G Alonso, The perm provenance management system in action, In: 35th SIGMOD international conference on Management of data, ACM Press, New York, 2009-06-29. (Conference or Workshop Paper published in Proceedings)
In this demonstration we present the Perm provenance management system (PMS). Perm is capable of computing, storing and querying provenance information for the relational data model. Provenance is computed by using query rewriting techniques to annotate tuples with provenance information. Thus, provenance data and provenance computations are represented as relational data and queries and, hence, can be queried, stored and optimized using standard relational database techniques. This demo shows the complete Perm system and lets attendants examine in detail the process of query rewriting and provenance retrieval in Perm, the most complete data provenance system available today. For example, Perm supports lazy and eager provenance computation, external provenance and various contribution semantics. |
|
Andrej Taliun, Michael Hanspeter Böhlen, Arturas Mazeika, CORE: Nonparametric Clustering of Large Numeric Databases, In: SDM 2009: Proceedings of the SIAM International Conference on Data Mining, SIAM (Society for Industrial and Applied Mathematics), 2009-04-30. (Conference or Workshop Paper published in Proceedings)
Current clustering techniques are able to identify arbitrarily shaped clusters in the presence of noise, but depend on carefully chosen model parameters. The choice of model parameters is difficult: it depends on the data and the clustering technique at hand, and finding good model parameters often requires time consuming human interaction. In this paper we propose CORE, a new nonparametric clustering technique that explicitly computes the local maxima of the density and represents them with cores. CORE proposes an adaptive grid and gradients to define and compute the cores of clusters. The incrementally constructed adaptive grid and the gradients make the identification of cores robust, scalable, and independent of small density fluctuations. Our experimental studies show that CORE without any carefully chosen model parameters produces better quality clustering than related techniques and is efficient for large datasets. |
|
B Glavic, G Alonso, Perm: Processing provenance and data on the same data model through query rewriting, In: 25th International Conference on Data Engineering, IEEE Computer Society, Shanghai, China., 2009-03-29. (Conference or Workshop Paper published in Proceedings)
Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data
items are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In this paper, we present an alternative approach that uses query rewriting to annotate result tuples with provenance information. The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques. In the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently provides provenance information inducing only a small overhead on normal operations. |
|
Igor Timko, Michael Hanspeter Böhlen, Johann Gamper, Sequenced spatio-temporal aggregation in road networks, In: EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology, ACM, New York, USA, 2009-03-24. (Conference or Workshop Paper published in Proceedings)
Many applications of spatio-temporal databases require support for sequenced spatio-temporal (SST) aggregation, e. g., when analyzing traffic density in a city. Conceptually, an SST aggregation produces one aggregate value for each point in time and space.This paper is the first to propose a method to efficiently evaluate SST aggregation queries for the COUNT, SUM, and AVG aggregation functions. Based on a discrete time model and a discrete, 1.5 dimensional space model that represents a road network, we generalize the concept of (temporal) constant intervals towards constant rectangles that represent maximal rectangles in the space-time domain over which the aggregation result is constant. We propose a new data structure, termed SST-tree, which extends the Balanced Tree for one-dimensional temporal aggregation towards the support for two-dimensional, spatio-temporal aggregation. The main feature of the Balanced Tree to store constant intervals in a compact way by using two counters is extended towards a compact representation of constant rectangles in the space-time domain. We propose and evaluate two variants of the SST-tree. The SSTT-tree and SSTH-tree use trees and hashmaps to manage spacestamps, respectively. Our experiments show that both solutions outperform a brute force approach in terms of memory and time. The SSTH-tree is more efficient in terms of memory, whereas the SSTT-tree is more efficient in terms of time. |
|
B Glavic, G Alonso, Provenance for Nested Subqueries, In: 12th International Conference on Extending Database Technology, ACM, Saint Petersburg, Russia, 2009-03-24. (Conference or Workshop Paper published in Proceedings)
Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use.
In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach. |
|
Juozas Gordevicius, Johann Gamper, Michael Böhlen, Parsimonious temporal aggregation, In: EDBT/ICDT 2009 Joint Conference, 2009-03-23. (Conference or Workshop Paper published in Proceedings)
Temporal aggregation is a crucial operator in temporal databases and has been studied in various flavors, including instant temporal aggregation (ITA) and span temporal aggregation (STA), each having its strengths and weaknesses. In this paper we define a new temporal aggregation operator, called parsimonious temporal aggregation (PTA), which comprises two main steps: (i) it computes the ITA result over the input relation and (ii) it compresses this intermediate result to a user-specified size c by merging adjacent tuples and keeping the induced total error minimal; the compressed ITA result is returned as the final result. By considering the distribution of the input data and allowing to control the result size, PTA combines the best features of ITA and STA. We provide two evaluation algorithms for PTA queries. First, the oPTA algorithm computes an exact solution, by applying dynamic programming to explore all possibilities to compress the ITA result and selecting the compression with the minimal total error. It runs in O(n2pc) time and O(n2) space, where n is the size of the input relation and p is the number of aggregation functions in the query. Second, the more efficient gPTA algorithm computes an approximate solution by greedily merging the most similar ITA result tuples, which, however, does not guarantee a compression with a minimal total error. gPTA intermingles the two steps of PTA and avoids large intermediate results. The compression step of gPTA runs in O(np log(c + ?)) time and O(c + ?) space, where ? is a small buffer for ""look ahead"". An empirical evaluation shows good results: considerable reductions of the result size introduce only small errors, and gPTA scales to large data sets and is only slightly worse than the exact solution of PTA. |
|
Olivier Wirz, Entwurf und Implementierung eines SQL-DDL-Präprozessors zur Unterstützung von Datenbankentwurfsmustern, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2009. (Master's Thesis)
Design patterns have become essential in software development. They identify
good design and define a vocabulary among developers. Recurring patterns also
exist in database schema design but are not as well documented as design pat-
terns of other areas in software development such as e.g. design patterns of
object-oriented development. This diploma thesis provides examples of design
patterns for schema creation for relational databases. We refer to this type of
patterns as Database Design Patterns. We present SQLPP, a SQL preprocessor
for macro processing. SQLPP enables users to store and reuse design patterns
for schema creation, whereas the design patterns are defined in macros. |
|
Michael Hanspeter Böhlen, Christian S Jensen, Richard Snodgrass, Current Semantics, In: Encyclopedia of Database Systems, Springer, Berlin / Heidelberg, p. 544 - 545, 2009. (Book Chapter)
|
|
Michael Hanspeter Böhlen, Temporal Query Processing, In: Encyclopedia of Database Systems, Springer, Berlin / Heidelberg, p. 3012 - 3015, 2009. (Book Chapter)
|
|
Michael Hanspeter Böhlen, Christian Jensen, Richard Snodgrass, Temporal Compatibility, In: Encyclopedia of Database Systems, Springer, Berlin / Heidelberg, p. 2936 - 2939, 2009. (Book Chapter)
|
|
Michael Hanspeter Böhlen, Temporal Coalescing, In: Encyclopedia of Database Systems, Springer, Berlin / Heidelberg, p. 2932 - 2936, 2009. (Book Chapter)
|
|
Johann Gamper, Michael Hanspeter Böhlen, Christian S Jensen, Temporal Aggregation, In: Encyclopedia of Database Systems, Springer, Berlin / Heidelberg, p. 2924 - 2929, 2009. (Book Chapter)
|
|