Tobias Egger, Scalable Exploratory Analyses of Feed Data, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis)
In this thesis, an approach to measuring performance in a multi-tier application is designed and implemented. The results of the performance measurements are used to identify the main limiting factors in the Swiss Feed Database. The problematic parts in the PostgreSQL database tier are subsequently optimized. For this, query plans are analysed in detail and different performance-enhancing features in PostgreSQL are used. Spatial indexing is used to optimize spatial joins. Just-in-time compilation enhances the overall performance of database queries. Furthermore, temporary tables are used as a substitute for common-table expressions. Through a reduction of output size, the performance is enhanced as well. Finally, the substitution of slow string functions helps to leverage the performance. The optimizations are evaluated regarding the performance-gain and the scalability. |
|
Alexander Schindler, Different Approaches to Supporting Connectivity Queries on Undirected Fully-Dynamic Graphs, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Master's Thesis)
Connectivity queries are among the basic queries on a graph, but there is no trivial way to implement them for fully-dynamic graphs. In the last decades this has been a popular problem in theoretical computer science: a number of data structures have been proposed, but many of them have not been implemented and tested. In this paper three different approaches are explained and discussed: a brute-force approach, an algorithm based on directed rooted trees (IDRFA), and a simpler version of an algorithm that Wulff-Nilsen suggested. For each algorithm we present an implementation and show the algorithm’s performance. The experiments reveal that the the performance depends a lot on the graph’s structure and the application’s
actual needs in terms of which operations happen at which frequency. |
|
Muhammad Saad, Abraham Bernstein, Michael Hanspeter Böhlen, Daniele Dell'Aglio, Single Point Incremental Fourier Transform on 2D Data Streams, In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), IEEE Xplore, New York, p. 852 - 863, 2021. (Book Chapter)
In radio astronomy, antennas monitor portions of the sky to collect radio signals. The antennas produce data streams that are of high volume and velocity (2.5 GB/s) and the inverse Fourier transform is used to convert the collected signals into sky images that astrophysicists use to conduct their research. Applying the inverse Fourier transform in a streaming setting, however, is not ideal since its computational complexity
is quadratic in the size of the image.
In this article, we propose the Single Point Incremental Fourier Transform (SPIFT), a novel incremental algorithm to produce sequences of sky images. SPIFT computes the Fourier transform for a new signal in a linear number of complex multiplications by exploiting twiddle factors, multiplicative constant coefficients. We prove that twiddle factors are periodic and show how circular shifts can be exploited to reuse multiplication results. The cost of the additive operations can be curbed by exploiting the embarrassingly parallel nature of the additions, which modern big data streaming frameworks can leverage to compute slices of the image in parallel. Our experiments suggest that SPIFT can efficiently generate sequences of sky images: it computes the complex multiplications 4 to 12x faster than the Discrete Fourier Transform, and its parallelisation of the additive operations shows linear speedup. |
|
Julian Minder, Enhanced String Similarity Scoring for Company Entities, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Bachelor's Thesis)
This thesis introduces a method to improve the string similarity of company names by identifying information about semantic substructures in company names. The method consists of two collaborating algorithms that firstly use string frequencies to detect distinguishable tokens and secondly add semantic understanding of company names by identifying industry metadata. Both algorithms are based on an analysis of frequencies in a large corpus of company records. It is demonstrated that the algorithms implemented in the Record Linkage System by Gschwind et al. [1] are able to enhance the similarity function effectively and improve company record matching.
[1] T. Gschwind, C. Miksovic, J. Minder, K. Mirylenka, and P. Scotton. Fast record linkage for company entities. In Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019, 2019. |
|
Luka Popovic, Supporting Updates in the RCAS Index, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Master's Thesis)
The Robust Content-And-Structure (RCAS) index is a novel index for semi-structured hierarchical data. Unlike pure content indexes or pure structure indexes, the RCAS index combines the content and structure of the data in a single index by interleaving paths and values. At the core of the RCAS index is a new interleaving scheme, called dynamic interleaving that interleaves paths and values at their discriminative bytes. Currently, the RCAS index only supports a bulk-loading algorithm to create the index from a set of keys. In this work we explore, implement, and evaluate new ways to insert keys in the RCAS index that strike a balance between the update and query performance of the index. We propose two new insert methods and in order to further improve the index’s insert and query performance we utilize an auxiliary RCAS index. Furthermore, we investigate two merge algorithms which have the aim to efficiently combine the content of the main and the newly introduced auxiliary index. Our experiments show that the combination of proposed insertion methods, indexes and merge algorithms can provide efficient inserts in the index while preserving high query performance. |
|
Bezaye Tesfaye, Nikolaus Augsten, Mateusz Pawlik, Michael Hanspeter Böhlen, Christian S Jensen, An Efficient Index for Reachability Queries in Public Transport Networks, In: 24th European Conference on Advances in Databases and Information Systems, ADBIS 2020, Springer, 2020-08-25. (Conference or Workshop Paper)
Computing path queries such as the shortest path in public transport networks is challenging because the path costs between nodes change over time. A reachability query from a node at a given start time on such a network retrieves all points of interest (POIs) that are reachable within a given cost budget. Reachability queries are essential building blocks in many applications, for example, group recommendations, ranking spatial queries, or geomarketing. We propose an efficient solution for reachability queries in public transport networks. Currently, there are two options to solve reachability queries. (1) Execute a modified version of Dijkstra’s algorithm that supports time-dependent edge traversal costs; this solution is slow since it must expand edge by edge and does not use an index. (2) Issue a separate path query for each single POI, i.e., a single reachability query requires answering many path queries. None of these solutions scales to large networks with many POIs. We propose a novel and lightweight reachability index. The key idea is to partition the network into cells. Then, in contrast to other approaches, we expand the network cell by cell. Empirical evaluations on synthetic and real-world networks confirm the efficiency and the effectiveness of our index-based reachability query solution. |
|
Michael Hanspeter Böhlen, Oksana Dolmatova, Michael Krauthammer, Alphonse Mariyagnanaseelan, Jonathan Stahl, Timo Surbeck, Iterations for Propensity Score Matching in MonetDB, In: 24thvEuropean Conference on Advances in Databases and Information Systems, ADBIS 2020, Springer, 2020-08-25. (Conference or Workshop Paper)
The amount of data that is stored in databases and must be analyzed is growing fast. Many analytical tasks are based on iterative methods that approximate optimal solutions. Propensity score matching is a technique that is used to reduce bias during cohort building. The main step is the propensity score computation, which is usually implemented via iterative methods such as gradient descent. Our goal is to support efficient and scalable propensity score computation over relations in a column-oriented database. To achieve this goal, we introduce shape-preserving iterations that update values in existing tuples until a fix point is reached. Shape-preserving iterations enable gradient descent over relations and, thus, propensity score matching. We also show how to create appropriate input relations for shape-preserving iterations with randomly initialized relations. The empirical evaluation compares in-database iterations with the native implementation in MonetDB where iterations are flattened. |
|
Dominique Christina Hässig, Development of Adaptive Heatmaps for Interactive Feed Explorations, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Bachelor's Thesis)
The goal of this thesis was the further development of the visualization of the Swiss Feed Database.
There are multiple ways to present geographical distribution of data on a map. One way poses heatmaps. Major parameters for a useful presentation are the radius and the gradient of the used colours. A main goal of this thesis was the construction of correct heatmaps in arbitrary zoom level and with arbitrary many data.
Another goal of this thesis was the migration of the Swiss Feed Database from the older web application framework AngularJS to the newer Angular. In doing so it is explained, how the structure of the website now appears in the new version and how it works. |
|
Marc Rettenbacher, Integrating the RCAS Index with the Software Heritage Archive, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Bachelor's Thesis)
The Software Heritage Archive aims to collect and preserve all publicly available software
in the form of source code. It saves the project history and structure in the form of a graph
and makes it publicly available through multiple interfaces. We want to use the data from the
Software Heritage Archive to test the novel Robust Content-And-Structure (RCAS) index on
a larger scale. This is a first step to providing an interface to query the Software Heritage
Archive directly. For this we propose a way to parse the Archive and extract file paths and file
sizes of all projects as sample values for both Content and Structure, as the current interfaces
do not directly offer this functionality.
We implement and use a RCAS index to integrate our parsed data, as the index is made for
semi-structured hierarchical data and thus answers CAS queries efficiently. To measure the
index performance, we run different queries against it, utilizing the descendant axis // and
the wildcard character * in the path part of the query. We found that the placement of the
descendant axis // and wildcard * has a large impact on query performance. |
|
Johann Schwabe, A GPU-enabled Single-Point Incremental Fourier Transform, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Bachelor's Thesis)
Although Fourier transforms are widely used, special implementations are still being developed
for high performance applications. For use in a streaming environment, the Single Point
Incremental Fourier Transform (SPIFT) was recently proposed (Saad et al. 2020). In SPIFT
the main computational bottleneck is the incremental addition, where the Single Point Fourier
Transform of each new datapoint is integrated into the previous result. Two key optimizations
to speed up SPIFT were introduced and tested. Firstly, GPUs are used to efficiently sum the
Single Point Fourier Transforms of the individual datapoints to the final result. Secondly, Single
Point Fourier Transform of the different datapoints with the same shift are combined using
an aggregation matrix instead of individually being integrated into the resultant image. Five
different implementations combining these two optimizations were evaluated. Both optimizations
are crucial and are together able to increase the throughput on tested matrix dimensions
by three orders of magnitude. |
|
Oksana Dolmatova, Nikolaus Augsten, Michael Hanspeter Böhlen, A Relational Matrix Algebra and its Implementation in a Column Store, In: ACM SIGMOD International Conference on Management of Data, ACM, 2020-06-14. (Conference or Workshop Paper published in Proceedings)
Analytical queries often require a mixture of relational and linear algebra operations applied to the same data. This poses a challenge to analytic systems that must bridge the gap between relations and matrices. Previous work has mainly strived to fix the problem at the implementation level. This paper proposes a principled solution at the logical level. We introduce the relational matrix algebra (RMA), which seamlessly integrates linear algebra operations into the relational model and eliminates the dichotomy between matrices and relations. RMA is closed: All our relational matrix operations are performed on relations and result in relations; no additional data structure is required. Our implementation in MonetDB shows the feasibility of our approach, and empirical evaluations suggest that in-database analytics performs well for mixed workloads. |
|
Yvonne Mülle, Michael Hanspeter Böhlen, Query Results over Ongoing Databases that Remain Valid as Time Passes By, In: 36th IEEE International Conference on Data Engineering, ICDE 2020April 20-24, 2020, IEEE, 2020-04-20. (Conference or Workshop Paper published in Proceedings)
Ongoing time point now is used to state that a tuple is valid from the start point onward. For database systems ongoing time points have far-reaching implications since they change continuously as time passes by. State-of-the-art approaches deal with ongoing time points by instantiating them to the reference time. The instantiation yields query results that are only valid at the chosen time and get invalidated as time passes by. We propose a solution that keeps ongoing time points uninstantiated during query processing. We do so by evaluating predicates and functions at all possible reference times. This renders query results independent of a specific reference time and yields results that remain valid as time passes by. As query results, we propose ongoing relations that include a reference time attribute. The value of the reference time attribute is restricted by predicates and functions on ongoing attributes. We describe and evaluate an efficient implementation of ongoing data types and operations in PostgreSQL. |
|
Pei Li, Jaroslaw Szlichta, Michael Böhlen, Divesh Srivastava, Discovering Band Order Dependencies, In: 36th IEEE International Conference on Data Engineering, ICDE 2020, IEEE, 2020-04-20. (Conference or Workshop Paper published in Proceedings)
We introduce band ODs to model the semantics of attributes that are monotonically related with small variations without there being an intrinsic violation of semantics. To make band ODs relevant to real-world applications, we make them less strict to hold approximately with some exceptions. Since formulating integrity constraints manually is cumbersome, we study the problem of automatic approximate band OD discovery. We devise an algorithm that determines the optimal solution in polynomial time. We perform a thorough experimental evaluation of our techniques over real-world and synthetic datasets. |
|
Oksana Dolmatova, Nikolaus Augsten, Michael Hanspeter Böhlen, Preserving Contextual Information in Relational Matrix Operations, In: 36th IEEE International Conference on Data Engineering, ICDE 2020, IEEE, 2020-04-20. (Conference or Workshop Paper published in Proceedings)
There exist large amounts of numerical data that are stored in databases and must be analyzed. Database tables come with a schema and include non-numerical attributes; this is crucial contextual information that is needed for interpreting the numerical values. We propose relational matrix operations that support the analysis of data stored in tables and that preserve contextual information. The result of our approach are precisely defined relational matrix operations and a system implementation in MonetDB that illustrates the seamless integration of relational matrix operations into a relational DBMS. |
|
Pei Li, Jaroslaw Szlichta, Michael Hanspeter Böhlen, Dinesh Srivastava, Discovery of Band Order Dependencies, In: ICDE 2020, arXiv, 2020-04-01. (Conference or Workshop Paper published in Proceedings)
We introduce band ODs to model the semantics ofattributes that are monotonically related with small variationswithout there being an intrinsic violation of semantics. Tomake band ODs relevant to real-world applications, we makethem less strict to holdapproximatelywith some exceptions andconditionallyon subsets of the data with a mix of ascendingand descending directions. Formulating integrity constraintsmanually requires domain expertise, is prone to human errors,and time consuming. Thus, we study the problem of automaticabcOD discovery. We propose an algorithm that utilizes thenotion of a longest monotonic band (LMB) to identify longestsubsequences of tuples that satisfy a band OD. We formulate theabcOD discovery problem as a constraint optimization problem,and devise a dynamic programming algorithm that determinesthe optimal solution in polynomial time (super-cubic complexity).To further optimize the performance over large datasets, weadapt the algorithm to consider pieces (contiguous sequencesof tuples) in a greedy fashion. This improves the performanceby orders-of-magnitude without sacrificing precision in practice.We show that for unidirectional abcODs, with only ascending ordescending orderings, our pieces-based algorithm is guaranteedto find the optimal solution. Finally, we perform a thoroughexperimental evaluation of our techniques over real-world andsynthetic datasets. |
|
Thomas Mannhart, A General-purpose Range Join Algorithm for PostgreSQL, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Bachelor's Thesis)
In this thesis we provide a range join algorithm based on the sort-merge paradigm and its implementation into the open-source RDBS PostgreSQL. The traditional sort-merge join is an efficient join algorithm for equality constraints, while a range join additionally considers a predicate describing that a value from one relation is in the range between two values of the other relation. PostgreSQL implements the sort-merge join or Merge Join (MJ) as a state machine adhering to the demand-pull pipeline paradigm. Our range join or Range Merge Join (RMJ) builds on the existing implementation and expands it with additional conditions to efficiently handle range joins. We describe in detail, how we modified the PostgreSQL optimizer and executor to achieve this goal. We provide the implementation of the RMJ algorithm as well as the identification of possible range join predicates and the correct sorting of the input relations. We show the benefits of our implementation in several experiments using real-world and synthetic workloads and datasets. The experiments show a major reduction in execution time in most real-world and all of our synthetic workloads, while only incurring a minor overhead in planning time in a few cases. |
|
Mourad Khayati, Philippe Cudré-Mauroux, Michael Hanspeter Böhlen, Scalable recovery of missing blocks in time series with high and low cross-correlations, Knowledge and Information Systems (KAIS), Vol. 62 (6), 2020. (Journal Article)
Missing values are very common in real-world data including time-series data. Failures in power, communication or storage can leave occasional blocks of data missing in multiple series, affecting not only real-time monitoring but also compromising the quality of data analysis. Traditional recovery (imputation) techniques often leverage the correlation across time series to recover missing blocks in multiple series. These recovery techniques, however, assume high correlation and fall short in recovering missing blocks when the series exhibit variations in correlation. In this paper, we introduce a novel approach called CDRec to recover large missing blocks in time series with high and low correlations. CDRec relies on the centroid decomposition (CD) technique to recover multiple time series at a time. We also propose and analyze a new algorithm called Incremental Scalable Sign Vector to efficiently compute CD in long time series. We empirically evaluate the accuracy and the efficiency of our recovery technique on several real-world datasets that represent a broad range of applications. The results show that our recovery is orders of magnitude faster than the most accurate algorithm while producing superior results in terms of recovery. |
|
Kevin Wellenzohn, Michael Hanspeter Böhlen, Sven Helmer, Dynamic Interleaving of Content and Structure for Robust Indexing of Semi-Structured Hierarchical Data, Proceedings of the VLDB Endowment, Vol. 13 (10), 2020. (Journal Article)
We propose a robust index for semi-structured hierarchical data that supports content-and-structure (CAS) queries specified by path and value predicates. At the heart of our approach is a novel dynamic interleaving scheme that merges the path and value dimensions of composite keys in a balanced way. We store these keys in our trie-based Robust Content-And-Structure index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. Additionally, we show important properties of our scheme, such as robustness against varying selectivities, and demonstrate improvements of up to two orders of magnitude over existing approaches in our experimental evaluation. |
|
Yvonne Mülle, Michael Hanspeter Böhlen, Query Results over Ongoing Databases that Remain Valid as Time Passes By (Extended Version), In: ArXiv.org, No. 05722, 2020. (Working Paper)
Ongoing time point now is used to state that a tuple is valid from the start point onward. For database systems ongoing time points have far-reaching implications since they change continuously as time passes by. State-of-the-art approaches deal with ongoing time points by instantiating them to the reference time. The instantiation yields query results that are only valid at the chosen time and get invalidated as time passes by. We propose a solution that keeps ongoing time points uninstantiated during query processing. We do so by evaluating predicates and functions at all possible reference times. This renders query results independent of a specific reference time and yields results that remain valid as time passes by. As query results, we propose ongoing relations that include a reference time attribute. The value of the reference time attribute is restricted by predicates and functions on ongoing attributes. We describe and evaluate an efficient implementation of ongoing data types and operations in PostgreSQL. |
|
Oksana Dolmatova, Nikolaus Augsten, Michael Hanspeter Böhlen, A Relational Matrix Algebra and its Implementation in a Column Store, In: ArXiv.org, No. 05517, 2020. (Working Paper)
Analytical queries often require a mixture of relational and linear algebra operations applied to the same data. This poses a challenge to analytic systems that must bridge the gap between relations and matrices. Previous work has mainly strived to fix the problem at the implementation level. This paper proposes a principled solution at the logical level. We introduce the relational matrix algebra (RMA), which seamlessly integrates linear algebra operations into the relational model and eliminates the dichotomy between matrices and relations. RMA is closed: All our relational matrix operations are performed on relations and result in relations; no additional data structure is required. Our implementation in MonetDB shows the feasibility of our approach, and empirical evaluations suggest that in-database analytics performs well for mixed workloads. |
|