Kevin Wellenzohn, Michael Hanspeter Böhlen, Sven Helmer, Dynamic Interleaving of Content and Structure for Robust Indexing of Semi-Structured Hierarchical Data (Extended Version), In: ArXiv.org, No. 05134, 2020. (Working Paper)
We propose a robust index for semi-structured hierarchical data that supports content-and-structure (CAS) queries specified by path and value predicates. At the heart of our approach is a novel dynamic interleaving scheme that merges the path and value dimensions of composite keys in a balanced way. We store these keys in our trie-based Robust Content-And-Structure index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. Additionally, we show important properties of our scheme, such as robustness against varying selectivities, and demonstrate improvements of up to two orders of magnitude over existing approaches in our experimental evaluation. |
|
Michael Hanspeter Böhlen, Muhammad Saad, Computing the Fourier Transformation over Temporal Data Streams (Invited Talk), In: 26th International Symposium on Temporal Representation and Reasoning, TIME 2019, Schloss Dagstuhl -- Leibniz-Zentrum fuer Informatik, 2019-10-16. (Conference or Workshop Paper)
In radio astronomy the sky is continuously scanned to collect frequency information about celestial objects. The inverse 2D Fourier transformation is used to generate images of the sky from the collected frequency information. We propose an algorithm that incrementally refines images by processing frequency information as it arrives in a temporal data stream. A direct implementation of the refinement with the discrete Fourier transformation requires O(N^2) complex multiplications to process an element of the stream. We propose a new algorithm that avoids recomputations and only requires O(N) complex multiplications. |
|
Nicolas Spielmann, Implementation of Single-Point Discrete Fourier Transform on two dimensional data, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Bachelor's Thesis)
The discrete Fourier transform is an algorithm that is used to transform the time representations of a function into its frequency components under the assumption that this given function is periodic. It is the base of many modern applications that use signal processing. Currently the fast Fourier transform algorithm, which reduces the run time of the discrete Fourier transform from O(n^2) to O(n*log(n)) is the fastest way to compute the Fourier transform.
In order to calculate the Fourier transform of two dimensional data, the algorithm will just be used to transform all rows and then transform all columns. One use case of the two dimensional algorithm is to generate dirty sky images form radio astronomy data.
In this thesis a faster algorithm to compute the single point two dimensional Fourier transform is described and it is mathematically proofed. This new algorithm will describe all fields of the Fourier transform with its first row or column of the transformed grid.
Furthermore the importance of Fourier transform in stream pipelines is considered and this new algorithm is implemented using the stream processing environment Apache Flink. After the initial implementation of the new algorithm, some stream specific optimization have been made to the algorithm. While computing the Fourier Transform, a lot of computations are made a multiple time. This fact is used to further improve the algorithm.
These different Algorithms are then evaluated using artificially generated source files to determine the run time. This shows that the new algorithm, for the biggest grid tested, is 60 times faster than the normal FFT implementation. With the small stream specific enhancements, the algorithm is even up to 120 times faster for the largest grid tested. |
|
Katerina Papaioannou, Martin Theobald, Michael Böhlen, Outer and Anti Joins in Temporal-Probabilistic Databases, In: 35th IEEE International Conference on Data Engineering, ICDE 2019, IEEE, 2019-04-08. (Conference or Workshop Paper published in Proceedings)
The result of a temporal-probabilistic (TP) join with negation includes, at each time point, the probability with which a tuple of a positive relation p matches none of the tuples in a negative relation n, for a given join condition θ. For the computation of TP joins with negation, we introduce generalized lineage-aware temporal windows, a mechanism that binds an interval to the lineages of all the matching valid tuples of each input relation. We compute these windows in an incremental manner, and we show that pipelined computations allow for the direct integration of our approach into PostgreSQL. We thereby alleviate the prevalent redundancies in the interval computations of existing approaches, which is proven by an extensive experimental evaluation with real-world datasets. |
|
Anton Dignös, Boris Glavic, Xing Niu, Michael Böhlen, Johann Gamper, Snapshot Semantics for Temporal Multiset Relations, Proceedings of the VLDB Endowment, Vol. 12 (6), 2019. (Journal Article)
Snapshot semantics is widely used for evaluating queries over temporal data: temporal relations are seen as sequences of snapshot relations, and queries are evaluated at each snapshot. In this work, we demonstrate that current approaches for snapshot semantics over interval-timestamped multiset relations are subject to two bugs regarding snapshot aggregation and bag difference. We introduce a novel temporal data model based on K-relations that overcomes these bugs and prove it to correctly encode snapshot semantics. Furthermore, we present an efficient implementation of our model as a database middleware and demonstrate experimentally that our approach is competitive with native implementations. |
|
Aikaterini Papaioannou, Martin Theobald, Michael Hanspeter Böhlen, Lineage-Aware Temporal Windows: Supporting Set Operations in Temporal-Probabilistic Databases, CoRR, Vol. abs/1910.00474, 2019. (Journal Article)
In temporal-probabilistic (TP) databases, the combination of the temporal and the probabilistic dimension adds significant overhead to the computation of set operations. Although set queries are guaranteed to yield linearly sized output relations, existing solutions exhibit quadratic runtime complexity. They suffer from redundant interval comparisons and additional joins for the formation of lineage expressions. In this paper, we formally define the semantics of set operations in TP databases and study their properties. For their efficient computation, we introduce the lineage-aware temporal window, a mechanism that directly binds intervals with lineage expressions. We suggest the lineage-aware window advancer (LAWA) for producing the windows of two TP relations in linearithmic time, and we implement all TP set operations based on LAWA. By exploiting the flexibility of lineage-aware temporal windows, we perform direct filtering of irrelevant intervals and finalization of output lineage expressions and thus guarantee that no additional computational cost or buffer space is needed. A series of experiments over both synthetic and real-world datasets show that (a) our approach has predictable performance, depending only on the input size and not on the number of time intervals per fact or their overlap, and that (b) it outperforms state-of-the-art approaches in both temporal and probabilistic databases. |
|
Anton Dignös, Boris Glavic, Xing Niu, Michael Hanspeter Böhlen, Snapshot Semantics for Temporal Multiset Relations (Extended Version), In: null, 2019. (Conference or Workshop Paper)
Snapshot semantics is widely used for evaluating queries over temporal data: temporal relations are seen as sequences of snapshot relations, and queries are evaluated at each snapshot. In this work, we demonstrate that current approaches for snapshot semantics over interval-timestamped multiset relations are subject to two bugs regarding snapshot aggregation and bag difference. We introduce a novel temporal data model based on K-relations that overcomes these bugs and prove it to correctly encode snapshot semantics. Furthermore, we present an efficient implementation of our model as a database middleware and demonstrate experimentally that our approach is competitive with native implementations and significantly outperforms such implementations on queries that involve aggregation. |
|
Aikaterini Papaioannou, Martin Theobald, Michael Hanspeter Böhlen, Generalized Lineage-Aware Temporal Windows: Supporting Outer and Anti Joins in Temporal-Probabilistic Databases, CoRR, Vol. abs/1902.04379, 2019. (Journal Article)
The result of a temporal-probabilistic (TP) join with negation includes, at each time point, the probability with which a tuple of a positive relation p matches none of the tuples in a negative relation n, for a given join condition θ. TP outer and anti joins thus resemble the characteristics of relational outer and anti joins also in the case when there exist time points at which input tuples from p have non-zero probabilities to be true and input tuples from n have non-zero probabilities to be false, respectively. For the computation of TP joins with negation, we introduce generalized lineage-aware temporal windows, a mechanism that binds an output interval to the lineages of all the matching valid tuples of each input relation. We group the windows of two TP relations into three disjoint sets based on the way attributes, lineage expressions and intervals are produced. We compute all windows in an incremental manner, and we show that pipelined computations allow for the direct integration of our approach into PostgreSQL. We thereby alleviate the prevalent redundancies in the interval computations of existing approaches, which is proven by an extensive experimental evaluation with real-world datasets. |
|
Michael Hanspeter Böhlen, Anton Dignös, Johann Gamper, Christian Jensen, Database Technology for Processing Temporal Data (Invited Paper), In: 25th International Symposium on Temporal Representation and Reasoning, TIME 2018, Leibniz-Zentrum für Informatik, 2018-10-15. (Conference or Workshop Paper published in Proceedings)
Despite the ubiquity of temporal data and considerable research on processing such data, database systems largely remain designed for processing the current state of some modeled reality. More recently, we have seen an increasing interest in processing historical or temporal data. The SQL:2011 standard introduced some temporal features, and commercial database management systems have started to offer temporal functionalities in a step-by-step manner. There has also been a proposal for a more fundamental and comprehensive solution for sequenced temporal queries, which allows a tight integration into relational database systems, thereby taking advantage of existing query optimization and evaluation technologies. New challenges for processing temporal data arise with multiple dimensions of time and the increasing amounts of data, including time series data that represent a special kind of temporal data. |
|
Timothy Pescatore, Integration of Ongoing Integers into PostgreSQL, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
Data that are associated with valid time interval are common in nowadays databases. They often store the time point now that represents the current time point and therefore changes its value over time. With ongoing time points like now, we face the problem, that the result of a query changes when time progresses. A new ongoing approach wants to calculate query results which remain valid over time. This was achieved with a new type of date, called ongoing date, that stays uninstantiated during query evaluation. With ongoing dates we can have a time interval whose duration changes as time passes by. If we do not want to refresh the duration results all the time, we need an ongoing integer, which holds different values for different time points to describe such a duration. This ongoing integer was implemented in this thesis into the widely-used PostgreSQL database system. This was achieved with a new data type called ogint. To integrate the new data type into the ongoing approach we implemented functions to measure the duration of an ongoing interval and additionally an addition and a maximum function. With the ogint we achieved a runtime for the duration function that was as fast as a bind approach, where we first need to convert ongoing dates to fixed dates. The major advantage to the bind approach is that we do not need to reevaluate our queries as time passes by. |
|
Alphonse Mariyagnanaseelan, Optimization of Mixed Queries in MonetDB System, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
The current approach to analyze scientific data stored in a DBMS is to export the data to a statistics software like MATLAB or R, perform the analysis, and then import the data back into the DBMS, in case the result has to be stored or further manipulated using relational operations like selections or joins. Those steps are necessary due to the fact that DBMSs currently only support simple aggregation operations and vector operations with attributes of a relation, but more sophisticated linear algebra operations are not available yet. We implement the matrix addition operation in MonetDB, a column-store DBMS. Further, we extend the query optimizer of MonetDB to optimize mixed queries, that consists of both matrix related operations and relational operations. We focus on the optimization of matrix addition in combination with selections, where we investigate the properties of such an optimization, we define equivalence rules for the rewriting of such mixed queries, and we discuss the trade-off in our optimization. The experimental evaluation shows that our optimization can reduce the query time by up to 95%, however the mentioned trade-off has a measurable and counterproductive effect in some cases. Since the optimization is done by the query optimizer, the user can simply declare the matrix operation, similar to join operations, and perform other relational operations on the result.
|
|
Claudio Brasser, Implementation and Performance Evaluation of multi dimensional Fast Fourier Transform Algorithms on Streaming Data using Apache Flink, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
In this thesis, different versions of the Fast Fourier Transform algorithm are implemented in order to evaluate their performance in a streaming environment. For this purpose, we built a streaming pipeline for the sampled input data. The dataset origins from the domain of radio astronomy where the Fourier Transform is a commonly used algorithm in the process of computing images from raw data. Cooley-Tukey FFT algorithms are a subset of FFT algorithms that use a divide-and-conquer approach to improve the runtime complexity. The discussed variations of the Cooley-Tukey FFT algorithm are Radix-2, Radix-4, and Split Radix. For each algorithm its explicit definition in the pseudo-code language, the corresponding mathematical equations, as well as a butterfly flow diagram is presented. The Split-Radix FFT algorithm performed best across all algorithms and parameter settings followed by Radix-4 and Radix-2. |
|
Valentin Weiss, Development of a Dynamic Web Application: The Swiss Feed Database, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
This thesis explores data summarization techniques for the feed data of the Swiss Feed Database. It proposes and evaluates solutions to reduce the client-side load by reducing the amount of tuples passed to the client. Most of the work is shifted away from the client to the database layer following
the thin-client paradigm with the aim of creating a stable and scalable solution. The summarization techniques were optimized to best suit the data visualization at hand. This includes techniques such as providing only a partial view of the data, aggregating the data set or leveraging properties of the spatial distribution of data points to abstract groups of data points into geometric shapes. |
|
Katerina Papaioannou, Martin Theobald, Michael Böhlen, Supporting Set Operations in Temporal-Probabilistic Databases, In: 34th IEEE International Conference on Data Engineering, ICDE 2018, IEEE, 2018-04-16. (Conference or Workshop Paper published in Proceedings)
In temporal-probabilistic (TP) databases, the combination of the temporal and the probabilistic dimension adds significant overhead to the computation of set operations. Although set queries are guaranteed to yield linearly sized output relations, all of the existing solutions exhibit a quadratic runtime complexity. They suffer from redundant interval comparisons and additional joins for the formation of lineage expressions. In this paper, we formally define TP set operations and study their properties. For their efficient computation, we introduce the lineage-aware temporal window, a mechanism that binds intervals with lineage expressions. We suggest the lineage-aware window advancer (LAWA) for producing lineage-aware temporal windows, which enable direct filtering of irrelevant intervals and finalization of output lineage expressions. This way, we compute TP set operations in linearithmic time. A series of experiments over both synthetic and real-world datasets show that (a) our approach has predictable performance, which depends only on the size of the input relations and not on the number of time intervals per fact or the overlap of the time intervals, and that (b) it outperforms state-of-the-art approaches. |
|
Rafael Kallis, An Adaptive Index for Hierarchical Database Systems, University of Zurich, Faculty of Business, Economics and Informatics, 2018. (Bachelor's Thesis)
The workload-aware property index is a hierarchical index which adapts to the database's recent transactional workload by not pruning volatile index nodes, i.e. nodes which are frequently inserted or deleted, in order to increase update performance. When the workload changes, these nodes cease to be volatile and become unproductive if they and their descendants, neither contribute to a query match, nor are volatile.
Unproductive nodes in hierarchical indexes waste space and slow down queries. We propose periodic Garbage Collection and Query-Time Pruning in order to clean unproductive nodes in the index. We implement the techniques in Apache Jackrabbit Oak and provide an extensive experimental evaluation to stress test the algorithms and show that the database throughput increases considerably when periodic Garbage
Collection or Query-Time Pruning are applied.
|
|
Johann Gamper, Michael Hanspeter Böhlen, Christian Jensen, Temporal Aggregation, In: Encyclopedia of Database Systems, Second Edition, Springer, New York, NY, USA, p. 58, 2018. (Book Chapter)
In database management, aggregation denotes the process of consolidating or summarizing a database instance; this is typically done by creating so-called aggregation groups of elements in the argument database instance and then applying an aggregate function to each group, thus obtaining an aggregate value for each group that is then associated with each element in the group. In a relational database context, the instances are relations and the elements are tuples. Aggregation groups are then typically formed by partitioning the tuples based on the values of one or more attributes so that tuples with identical values for these attributes are assigned to the same group. An aggregate function, e.g., sum, avg, or min, is then applied to another attribute to obtain a single value for each group that is assigned to each tuple in the group as a value of a new attribute. Relational projection is used for eliminating detail from aggregation results. |
|
Michael Hanspeter Böhlen, Christian Jensen, Richard Snodgrass, Temporal Compatibility, In: Encyclopedia of Database Systems, Second Edition, Springer, New York, NY, USA, p. 59, 2018. (Book Chapter)
Temporal compatibility captures properties of temporal languages with respect to the nontemporal languages that they extend. Temporal compatibility, when satisfied, ensures a smooth migration of legacy applications from a nontemporal system to a temporal system. Temporal compatibility dictates the semantics of legacy statements and constrains the semantics of temporal extensions to these statements, as well as the language design. |
|
Michael Hanspeter Böhlen, Christian Jensen, Richard Snodgrass, Nonsequenced Semantics, In: Encyclopedia of Database Systems, Second Edition, Springer, New York, NY, USA, p. 38, 2018. (Book Chapter)
Nonsequenced semantics guarantees that query language statements can reference and manipulate the timestamps that capture the valid and transaction time of data in temporal databases as regular attribute values, with no built-in temporal semantics being enforced by the query language. |
|
Michael Hanspeter Böhlen, Christian Jensen, Richard Snodgrass, Current Semantics, In: Encyclopedia of Database Systems, Second Edition, Springer, New York, NY, USA, p. 12, 2018. (Book Chapter)
Current semantics constrains the semantics of non-temporal statements applied to temporal databases. Specifically, current semantics requires that non-temporal statements behave as if applied to the non-temporal database that is the result of taking the timeslice of the temporal database at the current time. |
|
Michael Hanspeter Böhlen, Christian Jensen, Sequenced Semantics, In: Encyclopedia of Database Systems, Second Edition, Springer, New York, NY, USA, p. 52, 2018. (Book Chapter)
Sequenced semantics make it possible to generalize a query language statement on a nontemporal database to a temporal query on a corresponding temporal, interval time-stamped database by applying minor syntactic modifications to the statement that are independent of the particular statement. The semantics of such a generalized statement is consistent with considering the temporal database as being composed of a sequence of nontemporal database states. Sequenced semantics takes into account the interval timestamps of the argument tuples when forming the interval timestamps associated with result tuples, as well as permits the use of additional timestamp-related predicates in statements. |
|