Francesco Cafagna, Leveraging sort-merge for processing temporal data, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2016. (Dissertation)
Sorting is, together with partitioning and indexing, one of the core paradigms on which current Database Management System implementations base their query processing. It can be applied to efficiently compute joins, anti-joins, nearest neighbour joins (NNJs), aggregations, etc. It is efficient since, after the sorting, it makes one sequential scan of both inputs, and does not fetch redundantly tuples that do not appear in the result.
However, sort-based approaches loose their efficiency in the presence of temporal data: i) when dealing with time intervals, backtracking to previously scanned tuples that are still valid refetches in vain also tuples that are not anymore valid and will not appear in the result; ii) when dealing with timestamps, in computing NNJs with grouping attributes, blocks storing tuples of different groups are refetched multiple times. The goal of this thesis is to provide support to database systems for performing efficient sort-merge computations in the above cases.
We first introduce a new operator for computing NNJ queries with integrated support of grouping attributes and selection predicates. Its evaluation tree avoids false hits and redundant fetches, which are major performance bottlenecks in current NNJ solutions. We then show that, in contrast to current solutions that are not group- and selection-enabled, our approach does not constrain the scope of the query optimizer: query trees using our solution can take advantage of any optimization based on the groups, and any optimization on the selection predicates. For example, with our approach the Database Management System can use a sorted index scan for fetching at once all the blocks of the fact table storing tuples with the groups of the outer reviolation and, thus, reducing the tuples to sort. With Lateral NNJs, instead, groups are processed individually, and blocks storing tuples of different groups are fetched multiple times. With our approach the selection can be pushed down before the join if it is selective, or evaluated on the fly while computing the join if it’s not. With an indexed NNJ, instead, selection push down causes a nested loop which makes the NNJ inefficient due to the quadratic number of pairs checked. We applied our findings and implemented our approach into the kernel of the open-source database system PostgreSQL.
We then introduce a novel partitioning technique, namely Disjoint Interval Partitioning (DIP), for efficiently computing sort-merge computations on interval data. While current partitioning techniques try to place tuples with similar intervals into the same partitions, DIP does exactly the opposite: it puts tuples that do not overlap into the same partitions. This yields more merges between partitions but each of those no longer requires a nested-loop but can be performed more efficiently using sort-merge. Since DIP outputs the partitions with their elements already sorted, applying a temporal operator to two DIP partitions is performed in linear time, in contrast to the quadratic time of the state of the art solutions. We illustrate the generality of our approach by describing the implementation of three basic database operators: join, anti-join, and aggregation.
Extensive analytical evaluations confirm the efficiency of the solutions presented in this thesis. We experimentally compare our solutions to the state of the art approaches using real-world and synthetic temporal data. v
Die Sortierung ist, zusammen mit der Partitionierung und der Indexierung, eines der Kernparadigmen, auf der die Verarbeitung von Anfragen durch Datanbanksysteme beruht. Sie wird unter anderem für die effiziente Berechnung von Joins, Anti-Joins, Nearest Neighbour Joins (NNJs) und Aggregationen angewandt. Die Effizienz der Sortierung rührt daher, dass nach ihr lediglich ein sequenzieller Scan zweier sortierter Relationen für die Beantwortung der eingangs erwähnten Anfragen durchgeführt werden muss und auf Tupel, welche nicht Bestandteil des Ergebnisses sind, nicht mehrfach zugegriffen wird.
Allerdings verlieren Ansätze, die auf der Sortierung basieren, ihre Effizienz bei Anfragen über zeitabhängigen Daten:
i) bei Zeitintervallen wird beim Zurückgreifen auf vorgängig zugegriffene und immer noch gültige Tupel erneut auf inzwischen ungültige und in der Ergebnismenge nicht enthaltene Tupel zugegriffen;
ii) bei Zeitpunkten wird bei der Berechnung von NNJs mit Attributgruppierung auf Blöcke mit Tupeln verschiedener Gruppen mehrfach zugegriffen. Das Ziel dieser Arbeit besteht in der Weiterentwicklung von Datenbanksystemen hinsichtlich der effizienten Verarbeitung von Sort-Merge-Berechnungen in den obengenannten Fällen.
Zuerst stellen wir einen neuen Operator für die Berechnung von NNJ-Abfragen mit integrierter Unterstützung von Attributgruppierung und Auswahlprädikaten vor. Sein Evaluationsbaum vermeidet erfolglose und redundante Zugriffe auf Daten, welche die hauptsächlichen Engpässe in der Performanz von aktuellen NNJ-Lösungen darstellen. Wir zeigen, dass im Gegensatz zu herkömmlichen Lösungen, die keine Attributgruppierungen und Auswahlprädikate unterstützen,
vi) unser Ansatz die Möglichkeiten des Abfrageoptimierers signifikant erweitert: Abfragebäume, die unseren Ansatz anwenden, profitieren von sämtlichen Optimierungen aus dem Einsatz von Attributgruppierungen und Auswahlprädikaten. Beispielsweise können Datenbanksysteme mit unserem Ansatz einen sortierten Indexscan einsetzen, der genau einmal auf einen Block der Fak- tentabelle zugreift, der Tupel mit den Gruppen der äusseren Relation speichert, und dadurch die Anzahl der zu sortierenden Tupel verringert. Im Unterschied dazu werden mit gängigen lateralen NNJs die Gruppen einzeln verarbeitet und auf Blöcke, die Tupel verschiedener Gruppen beinhalten, wird mehrmals zugegriffen. Mit unserem Ansatz kann die Selektion bereits vor dem Join ausgewertet werden, sofern sie selektiv ist, oder die Selektion kann während des Scans der Da- ten ausgwertet werden. Mit einem indexierten NNJ führt eine standardmässig frühe Auswertung der Selektionsbedingung (selection push down) zu einer geschachtelten Schleife (nested loop), was den NNJ auf Grund der quadratischen Anzahl zu prüfender Paare ineffizient macht. Wir haben die gewonnenen Erkenntnisse angewandt und unseren Ansatz im Kern des Open Source Datenbanksystems PostgreSQL umgesetzt.
Wir führen eine neue Art der Partitionierung, nämlich Disjoint Interval Partitioning (DIP), zur effizienten Verarbeitung von Sort-Merge-Berechnungen auf Intervalldaten ein. Aktuelle Ansätze zur Partitionierung versuchen Tupel mit ähnlichen Intervallen in dieselbe Partition zu packen. Unser Ansatz macht genau das Gegenteil: er weist nicht-überlappende Tupel denselben Partitionen zu. Dies führt zu mehr Kombinationen von Partitionen, aber da jede dieser Kom- binationen keine geschachtelte Schleife erfordert, können Sort-Merge-Berechnungen effizienter durchgeführt werden. Da DIP die Elemente bereits sortiert ausgibt, kann ein Operator auf zwei DIP-Partitionen in linearer Zeit durchgeführt werden, im Unterschied zur quadratischen Zeit herkömmlicher Lösungen. Wir zeigen die Allgemeingültigkeit unseres Ansatzes auf, indem wir die Umsetzung von drei Datenbankoperatoren beschreiben: Join, Anti-Join und Aggregation.
Umfassende analytische Auswertungen bestätigen die Effizienz der in dieser Arbeit vorgestellten Lösungen. Wir vergleichen unsere Lösungen mit aktuellen Ansätzen mit echten und synthetischen zeitabhängigen Daten. |
|
Anton Dignös, Michael Hanspeter Böhlen, Johann Gamper, Christian S Jensen, Extending the kernel of a relational dbms with comprehensive support for sequenced temporal queries, ACM Transactions on Database Systems, Vol. 41 (4), 2016. (Journal Article)
Many databases contain temporal, or time-referenced, data and use intervals to capture the temporal aspect. While SQL-based database management systems (DBMSs) are capable of supporting the management of interval data, the support they offer can be improved considerably. A range of proposed temporal data models and query languages offer ample evidence to this effect. Natural queries that are very difficult to formulate in SQL are easy to formulate in these temporal query languages. The increased focus on analytics over historical data where queries are generally more complex exacerbates the difficulties and thus the potential benefits of a temporal query language. Commercial DBMSs have recently started to offer limited temporal functionality in a step-by-step manner, focusing on the representation of intervals and neglecting the implementation of the query evaluation engine.
This article demonstrates how it is possible to extend the relational database engine to achieve a full-fledged, industrial-strength implementation of sequenced temporal queries, which intuitively are queries that are evaluated at each time point. Our approach reduces temporal queries to nontemporal queries over data with adjusted intervals, and it leaves the processing of nontemporal queries unaffected. Specifically, the approach hinges on three concepts: interval adjustment, timestamp propagation, and attribute scaling. Interval adjustment is enabled by introducing two new relational operators, a temporal normalizer and a temporal aligner, and the latter two concepts are enabled by the replication of timestamp attributes and the use of so-called scaling functions. By providing a set of reduction rules, we can transform any temporal query, expressed in terms of temporal relational operators, to a query expressed in terms of relational operators and the two new operators. We prove that the size of a transformed query is linear in the number of temporal operators in the original query. An integration of the new operators and the transformation rules, along with query optimization rules, into the kernel of PostgreSQL is reported. Empirical studies with the resulting temporal DBMS are covered that offer insights into pertinent design properties of the article's proposal. The new system is available as open-source software. |
|
Toni Pesut, QR decomposition integration into PostgreSQL, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2015. (Bachelor's Thesis)
There is a great demand for storing and analyzing data inside a database management system (DBMS), as scientists in a wide variety of fields rely on DBMSs as a stable and efficient way of handling data. Currently most analyzing is done by exporting from the DBMS into a statistics program, e.g. R, and importing the solutions. However, the more practical solution would be to have statistical operations available inside the DBMS.
Since QR-Decomposition is a frequently used operation in statistics, the goal of this bachelor thesis is an integration of the QR-Decomposition into PostgreSQL. It has been implemented using the Gram Schmidt algorithm and tested to determine the scalability regarding different table sizes.
|
|
Francesco Cafagna, Michael Hanspeter Böhlen, Annelies Bracher, Nearest Neighbour Join with Groups and Predicates, In: Proceedings of the ACM Eighteenth International Workshop on Data Warehousing and OLAP, ACM, 2015-10-19. (Conference or Workshop Paper published in Proceedings)
This paper proposes the nearest neighbor join, r x T [G, Θ] s, with similarity on T, and integrated support for grouping attributes G and selection predicates Θ. The corresponding valuation algorithm, roNNJ, is robust and does not suffer from redundant fetches and index false hits, which are major performance bottlenecks in nearest neighbour joins that do not support grouping attributes and selection predicates. Our solution does not compute redundant fetches since it accesses the fact table only once, and uses the groups of the outer relation to limit the fact table to its relevant portions. We experimentally evaluate our solution using a data warehouse that manages analyses of animal feeds, and the TPC-H. |
|
Martin Spielmann, Computing Apps Frequently Installed Together over Very Large Amounts of Mobile Devices, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2015. (Master's Thesis)
Knowing Apps frequently installed together is valuable for applications in the field of app recommendations and targeted advertising. In this work, a distributed algorithm based on Apache Spark is developed to find the apps most frequently installed together on a vast amount of mobile devices. The algorithm is optimized by understanding Spark's execution model and the characteristics of the input data. An available input dataset of 500'000 devices was examined by the distribution of its apps and app pairs. Furthermore, the algorithm was tested extensively on different cluster configurations and with differently sized input data to investigate both scalability and affordability. |
|
Mourad Khayati, Michael Hanspeter Böhlen, Philippe Cudré Mauroux, Using Lowly Correlated Time Series to Recover Missing Values in Time Series: A Comparison Between SVD and CD, In: Advances in Spatial and Temporal Databases - 14th International Symposium, SSTD 2015, Hong Kong, China, August 26-28, 2015. Proceedings, 2015. (Conference or Workshop Paper published in Proceedings)
The Singular Value Decomposition (SVD) is a matrix decomposition technique that has been successfully applied for the recovery of blocks of missing values in time series. In order to perform an accurate block recovery, SVD requires the use of highly correlated time series. However, using lowly correlated time series that exhibit shape and/or trend similarities could increase the recovery accuracy. Thus, the latter time series could also be exploited by including them in the recovery process.
In this paper, we compare the accuracy of the Centroid Decomposition (CD) against SVD for the recovery of blocks of missing values using highly and lowly correlated time series. We show that the CD technique better exploits the trend and shape similarity to lowly correlated time series and yields a better recovery accuracy. We run experiments on real world hydrological and synthetic time series to validate our results. |
|
Gionata Genazzi, Temporal Filtering to Improve Temporal Duplicate Detection, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2015. (Bachelor's Thesis)
Duplicate detection studies the problem of identifying records in a given data set that refer to the same real-world entity. A quantitative way of solving duplicate detection is to perform a similarity join. A large collection of algorithms performs the similarity join using the
filter-verification framework. Algorithms of this type exploit techniques as prefix filtering, positional filtering, and suffix filtering to perform the join in an efficient way. However, implementations of these algorithms ignore temporal information of records, which can enhance duplicate detection.
In this thesis, we refine prefix filtering, positional filtering, and suffix filtering techniques in order to perform similarity joins utilizing also temporal information of records. Specifically, we propose three algorithms that perform temporal similarity joins with a temporal Jaccard similarity threshold. Experimental results show that these algorithms can improve considerably the performance of exact temporal similarity joins with temporal Jaccard when compared
to the brute force approach. |
|
Pei Li, Xin Luna Dong, Songtao Guo, Andrea Maurino, Divesh Srivastava, Robust Group Linkage, In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, May 18-22, 2015, 2015. (Conference or Workshop Paper published in Proceedings)
We study the problem of group linkage: linking records that refer to multiple entities in the same group. Applications for group linkage include finding businesses in the same chain, finding social network users from the same organization, and so on. Group linkage faces new challenges compared to traditional entity resolution. First, although different members in the same group can share some similar global values of an attribute, they represent different entities so can also have distinct local values for the same or different attributes, requiring a high tolerance for value diversity. Second, we need to be able to distinguish local values from erroneous values.
We present a robust two-stage algorithm: the first stage identifies pivots--maximal sets of records that are very likely to belong to the same group, while being robust to possible erroneous values; the second stage collects strong evidence from the pivots and leverages it for merging more records into the same group, while being tolerant to differences in local values of an attribute. Experimental results show the high effectiveness and efficiency of our algorithm on various real-world data sets. |
|
Mourad Khayati, Recovery of Missing Values using Matrix Decomposition Techniques, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2015. (Dissertation)
Time series data is prominent in many real world applications, e.g., hydrology or finance stock market. In many of these applications, time series data is missing in blocks, i.e., multiple consecutive values are missing. For example, in the hydrology field around 20% of the data is missing in blocks. However, many time series analysis tasks, such as prediction, require the existence of complete data. The recovery of blocks of missing values in time series is challenging if the missing block is a peak or a valley. The problem is more challenging in real world time series because of the irregularity in the data. The state-of-the-art recovery techniques are suitable either for the recovery of single missing values or for the recovery of blocks of missing values in regular time series. The goal of this thesis is to propose an accurate recovery of blocks of missing values in irregular time series. The recovery solution we propose is based on matrix decomposition techniques. The main idea of the recovery is to represent correlated time series as columns of an input matrix where missing values have been initialized and iteratively apply matrix decomposition technique to refine the initialized missing values. A key property of our recovery solution is that it learns the shape, the width and the amplitude of the missing blocks from the history of the time series that contains the missing blocks and the history of its correlated time series. Our experiments on real world hydrological time series show that our approach outperforms the state-of-the-art recovery techniques for the recovery of missing blocks in irregular time series. The recovery solution is implemented as a graphical tool that displays, browses and accurately recovers missing blocks in irregular time series. The proposed approach supports learning from highly and lowly correlated time series. This is important since lowly correlated time series, e.g., shifted time series, that exhibit shape and/or trend similarities are beneficial for the recovery process. We reduce the space complexity of the proposed solution from quadratic to linear. This allows to use time series with long histories without prior segmentation. We prove the scalability and the correctness of the solution. |
|
Michael Hanspeter Böhlen, Christoph Koch, Special issue on best papers of VLDB 2013 - Guest editorial, VLDB Journal, Vol. 24 (5), 2015. (Journal Article)
|
|
Peter R. Niederberger, Development of a light-weight mobile application for the Swiss Feed Database, 2015. (Other Publication)
The Swiss Feed Database is a database storing the temporal changes of the nutritive values in animal feeds. An online system (www.feedbase.ch) is used from farmers, scientists, etc. to query the database, to know the location of the feed samples, to compare several nutrients, compute regressions, compute derived nutrients, etc. The aim of this project is
to develop a mobile application using Apache Cordova that implements the functionalities of www.feedbase.ch in a lighter way (i.e., for mobile devices). |
|
Benedikt Steger, Visualization of nutrient/nutrient relationships, 2015. (Other Publication)
|
|
Paolo Bolzoni, Sven Helmer, Kevin Wellenzohn, Johann Gamper, Periklis Andritsos, Efficient itinerary planning with category constraints, In: Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Dallas/Fort Worth, TX, USA, November 4-7, 2014, ACM, New York, NY, USA, 2014-11-04. (Conference or Workshop Paper published in Proceedings)
We propose a more realistic approach to trip planning for tourist applications by adding category information to points of interest (POIs). This makes it easier for tourists to formulate their preferences by stating constraints on categories rather than individual POIs. However, solving this problem is not just a matter of extending existing algorithms. In our approach we exploit the fact that POIs are usually not evenly distributed but tend to appear in clusters. We develop a group of efficient algorithms based on clustering with guaranteed theoretical bounds. We also evaluate our algorithms experimentally, using real-world data sets, showing that in practice the results are better than the theoretical guarantees and very close to the optimal solution |
|
Kevin Wellenzohn, Hannes Mitterer, Johann Gamper, Michael Hanspeter Böhlen, Mourad Khayati, Missing value imputation in time series using Top-K case matching, In: 26th GI-Workshop Grundlagen von Datenbanken, CEUR-WS, 2014-10-21. (Conference or Workshop Paper published in Proceedings)
In this paper, we present a simple yet effective algorithm, called the Top-k Case Matching algorithm, for the imputation of missing values in streams of time series data that are similar to each other. The key idea of the algorithm is to look for the k situations in the historical data that are most similar to the current situation and to derive the missing value from the measured values at these k time points. To efficiently identify the top-k most similar historical situations, we adopt Fagin’s Threshold Algorithm, yielding an algorithm with sub-linear runtime complexity with high probability, and linear complexity in the worst case (excluding the initial sorting of the data, which is done only once). We provide the results of a first experimental evaluation using real-world meteorological data. Our algorithm achieves a high accuracy and is more accurate and efficient than two more complex state of the art solutions. |
|
Bruno Cadonna, Johann Gamper, Michael Hanspeter Böhlen, A robust skip-till-next-match selection strategy for event pattern matching, In: Advances in Databases and Information Systems - 18th East European Conference, ADBIS 201, Springer International Publishing, 2014-09-07. (Conference or Workshop Paper published in Proceedings)
In event pattern matching, various selection strategies have been proposed to impose additional constraints on the events that participate in a match. The skip-till-next-match selection strategy is used in scenarios where some incoming events are noise and therefore should be ignored. Skip-till-next-match is prone to blocking noise, i.e., noise that prevents the detection of matches. In this paper, we propose the robust skip-till-next-match selection strategy, which is robust against noise and finds matches that are missed by skip-till-next-match when blocking noise occurs in the input stream. To implement the new strategy in automaton-based pattern matching algorithms, we propose a backtracking mechanism. Extensive experiments using real-world data and different event pattern matching algorithms show that with skip-till-next-match the number of matches not detected due to blocking noise can be substantial, and that our backtracking mechanism outperforms alternative solutions that first produce a superset of the result followed by a post processing step to filter out non-compliant matches. |
|
Martin Leimer, A dynamic website for a Temporal Probabilistic Database Implementation, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2014. (Master's Thesis)
Temporal probabilistic databases are a field of interest in recent research. In this thesis, we introduce a straightforward and user-friendly web application, which allows performing queries in a temporal probabilistic extension of PostgreSQL. Moreover, we propose algorithms and techniques to develop interesting visualisation tools that allow the user to analyse and understand the results of the algebra operations performed. Tree-like structures illustrate how and from which base tuples a result tuple is derived from, while bubble charts describe the impact of each base tuple on a specific result tuple. Predefined datasets and queries are provided to accelerate acquaintance with this web application whereas interactive features make it suitable for all users, regardless of their level of experience. |
|
Anton Dignös, Michael Hanspeter Böhlen, Johann Gamper, Overlap interval partition join, In: ACM SIGMOD 2014 international conference on Management of Data, ACM, New York, NY, USA, 2014-07-22. (Conference or Workshop Paper published in Proceedings)
Each tuple in a valid-time relation includes an interval attribute T that represents the tuple's valid time. The overlap join between two valid-time relations determines all pairs of tuples with overlapping intervals. Although overlap joins are common, existing partitioning and indexing schemes are inefficient if the data includes long-lived tuples or if intervals intersect partition boundaries. We propose Overlap Interval Partitioning (OIP), a new partitioning approach for data with an interval. OIP divides the time range of a relation into k base granules and defines overlapping partitions for sequences of contiguous granules. OIP is the first partitioning method for interval data that gives a constant clustering guarantee: the difference in duration between the interval of a tuple and the interval of its partition is independent of the duration of the tuple's interval. We offer a detailed analysis of the average false hit ratio and the average number of partition accesses for queries with overlap predicates, and we prove that the average false hit ratio is independent of the number of short- and long-lived tuples. To compute the overlap join, we propose the Overlap Interval Partition Join (OIPJoin), which uses OIP to partition the input relations on-the-fly. Only the tuples from overlapping partitions have to be joined to compute the result. We analytically derive the optimal number of granules, k, for partitioning the two input relations, from the size of the data, the cost of CPU operations, and the cost of main memory or disk IOs. Our experiments confirm the analytical results and show that the OIPJoin outperforms state-of-the-art techniques for the overlap join. |
|
Andrin Betschart, Exact and approximate confidence computation in temporal probabilistic databases, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2014. (Master's Thesis)
Probabilistic databases and temporal databases are a field of interest in recent research. In this work a temporal probabilistic database schema is suggested that combines both aspects into one. To query such databases, we present a full relational algebra, which uses lineage as Boolean formulas to keep track of the origin of derived tuples. Furthermore, we investigate four algorithms to compute the confidence value of derived tuples by using lineage. We implemented all concepts of this work in the PostgreSQL database system and experiments show the performance of the lineage computation and of the four confidence computation algorithms. |
|
Mourad Khayati, Michael Hanspeter Böhlen, Johann Gamper, Memory-efficient centroid decomposition for long time series, In: ICDE, IEEE, 2014. (Conference or Workshop Paper published in Proceedings)
Real world applications that deal with time series data often rely on matrix decomposition techniques, such as the Singular Value Decomposition (SVD). The Centroid Decomposition (CD) approximates the Singular Value Decomposition, but does not scale to long time series because of the quadratic space complexity of the sign vector computation. In this paper, we propose a greedy algorithm, termed Scalable Sign Vector (SSV), to efficiently determine sign vectors for CD applications with long time series, i.e., where the number of rows (observations) is much larger than the number of columns (time series). The SSV algorithm starts with a sign vector consisting of only 1s and iteratively changes the sign of the element that maximizes the benefit. The space complexity of the SSV algorithm is linear in the length of the time series. We provide proofs for the scalability, the termination and the correctness of the SSV algorithm. Experiments with real world hydrological time series and data sets from the UCR repository validate the analytical results and show the scalability of SSV. |
|
Anton Dignös, Interval-Dependent Attributes in Relational Database Systems, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2014. (Dissertation)
Data with time intervals is prominently present in finance, accounting, medicine and many other application domains. When querying such data, it is important to perform operations on aligned intervals, i.e., data is processed together only for the common interval where it is valid in the real world. For instance, an employee contributed to a project only for the time period where both the project was running and the employee was employed by the company, i.e., the employee contributed to the project only over their aligned time interval. A temporal join is thus only evaluated over the aligned interval of an employee and a project. The problem of performing temporal operations, such as temporal aggregation or temporal joins, on data with time intervals using relational database systems can be attributed to the lack of primitives for the alignment of intervals. Even more challenges arise, when the data includes attribute values that are interval-dependent, such as project budgets or cumulative costs, and need to be scaled along with the alignment of intervals during processing. The goal of this thesis is to provide systematic and built-in support for querying data with intervals in relational database systems. The solution we propose uses two temporal primitives a temporal normalizer and a temporal aligner for the alignment of intervals. Temporal operators on interval data are defined by reduction rules that map a temporal operator to an operation with a temporal primitive followed by the corresponding traditional non-temporal operator that uses equality on aligned intervals. A key feature of our approach is that operators can access the original time intervals in predicates and functions, such as join conditions and aggregation functions, using timestamp propagation. Our approach, through timestamp propagation, supports the scaling of attribute values that are interval-dependent. When intervals are aligned during query processing, scaling can be performed at query time with the help of user-defined functions. This allows users to choose whether and how attribute values should be scaled. This is necessary since they may be interested in the total value in one query and the scaled value according to days or even working days in another query. We integrated our solution into the kernel of the open source database system PostgreSQL, which allows to leverage existing query optimization techniques and algorithms. |
|