Wenqing Chang, End-to-End lmplementation of Pair-Wise Correlation Computation in a Streaming System, University of Zurich, Faculty of Business, Economics and Informatics, 2023. (Master's Thesis)
This thesis aims to address a key challenge in time-series data analysis - efficiently identifying correlations between various data streams. As the prominence of sensor networks rises, the analysis of time-series data has become increasingly crucial. The insights derived from these data streams hold significant intelligence, but extracting them efficiently remains a complex task.
This thesis brings into focus the dimensionality-reduction filter-and-refine techniques, designed to expedite the process of identifying correlations. Despite their utility, these techniques lack a comprehensive comparative analysis over streaming systems, and this thesis seeks to fill this gap.
The core objective is to implement these techniques within a streaming platform, enabling benchmarking under realistic conditions. The thesis is divided into several chapters that provide a comprehensive overview of the problem, delve into the dimensionality reduction algorithms, and discuss their implementation within a streaming system.
Particular emphasis is placed on the Filter-and-Refine Algorithm on the Kafka Streaming Platform. Two distinct design approaches, one based on Kafka Streams and another simpler Producer-Consumer design, are implemented, compared, and evaluated.
The thesis culminates with an exhaustive series of experiments assessing the performance of the implemented algorithms. The ultimate goal is to provide a framework that not only implements the techniques but also evaluates their performance, aiming to contribute to the field of time-series data analysis. |
|
Adrian Zermin, Building a Data Analysis Platform for the EARDREAM Project, University of Zurich, Faculty of Business, Economics and Informatics, 2023. (Master's Thesis)
With an aging population across the world, dementia and neurological diseases, such as Alzheimer's disease (AD), are on the rise. The disproportionate rise in AD cases in developing countries gives rise to a low-cost, robust way to diagnose early-onset AD. The EARDREAM project takes up the fight against AD in these countries using low-density electroencephalography (EEG) device, with the goal of developing a digital biomarker of early-onset AD.
To make the collected health data accessible and enable large-scale analysis, there is a need for accessible, scaleable, secure solutions for EEG data analysis.
This thesis presents the design and implementation of the Wondernap platform, a novel, cloud-based system dedicated to enhancing the current process of interacting with the data generated using the EARDREAM EEG device.
The evolution of the platform through two iterative prototypes is described, highlighting the transition from an initial prototype to a scaleable, secure, cloud-based solution.
The initial prototype laid the groundwork for a scaleable, modular, and transferable architecture capable of accounting for the unique requirements of the EARDREAM project, employing state-of-the-art technologies for EEG data analysis architectures.
Deploying a Flask backend and Apache HBase for EEG data storage, with MongoDB for patient data, the first iteration validated its usability through a user evaluation, scoring 84/100 on the System Usability Scale.
Building upon user feedback and stakeholder input, the second prototype accentuated the applicability of cloud computing to the current architecture, demonstrating its scalability and portability. Using infrastructure-as-code and incorporating Apache Phoenix, this prototype showcased enhancements in fault tolerance, security, and performance.
In summary, the developed platform offers a fault-tolerant, scaleable, secure cloud architecture supporting a user-friendly frontend allowing its users to gain insights into the data generated using the EARDREAM portable EEG device.
|
|
Cyrill Hidber, Implementing an Index for Videos Containing Motion in a Multi-media Retrieval System, University of Zurich, Faculty of Business, Economics and Informatics, 2023. (Bachelor's Thesis)
This work focuses on the analysis and evaluation of methods for efficient indexing and querying of motion videos in multimedia databases. In particular, the software stack vitrivr is considered for the representation of a multimedia database. The emphasis is on investigating the feasibility of implementing an index for motion data from the motion videos in one of the components of the software stack. The index should then enable a more efficient query. In this context, the use of technologies such as OpenPose for extracting motion data from video sequences and Dynamic Time Warping (DTW) for analyzing and categorizing the extracted data is discussed based on previous work. The work further illuminates the application of local segmentation as a means of improving the accuracy and efficiency of a similarity assessment based on DTW. Finally, two possible implementation approaches for the index are proposed and analyzed: one that builds on the existing architecture of the retrieval engine Cineast, and another that suggests a direct integration of DTW functionality into Cottontail DB. |
|
Loris Keist, Integration of Matrix Transposition into Database Systems, University of Zurich, Faculty of Business, Economics and Informatics, 2023. (Bachelor's Thesis)
Due to increased use cases in real-world applications, merging relational
database management systems with linear algebra operations has been an ongoing topic. It allows the analysis of large amounts of data stored in database systems. Multiple approaches have been integrated, but the linear algebra operation, matrix transpose, remains particularly difficult to implement. This thesis attempts to allow the transposition of relations in database management systems and avoid previous issues encountered with matrix transposition. The solution is based on decoupling the logical and physical levels, in a database management system. Decoupling the two levels adds flexibility to the system and can be used to store relations differently than they are on their logical level. It was possible to directly implement the idea in the database management system MonetDB and evaluate it against a basic version of transpose. The evaluation shows some improvements in performance and the solution allows the transposition of relations with a large number of tuples, which has been a main issue for matrix transpose in database management systems. |
|
Xinyu Zhu, Efficient in-place iterations in MonetDB, University of Zurich, Faculty of Business, Economics and Informatics, 2023. (Master's Thesis)
The enormous growth of stored data continues to challenge our ability to efficiently process and analyze data. Many numerical computations related to state transition based on previous state, e.g., compound interest problems or newton’s method, require a combination of
operations from the relational algebra (project, join, select etc.) and iterations. Moreover, many analytical computations, e.g., Markov chain algorithms or various types of regressions, require a combination of operations from the relational algebra, operations from the linear
algebra, and iterations. Plenty of attempts have been made to solve combinations of those elements, but still improvement needed on time and space consumption.
For example, some solutions focus on fetching data from the database, performing linear algebra operations and iterations, and then putting it back into the database, which lacks the support and optimization of relational algebra. Some solutions focus on linear algebra within the database, but require additional data structures and complicated preset functions, and the correspondence of non-numeric and numerical values is not defined. Moreover, the iteration of some solutions is very space or time consuming due to the operations involving table union and table updates. So, there is still not a very suitable solution that can perfectly combine the three aspects.
Our iterations have ability to cover all three aspects at the same time, require no additional knowledge about Python, R, Hadoop or Spark, more tightly integrate with the classic SQL definition, and handle various types of iterations in more flexible and explicit ways. We describe the definition of our iterations in MonetDB, explain major design (motivated by various
complicated iterations), and discuss key in-place features and future research directions. Finally, we provide applications and experiment results that show the potential of our solutions. |
|
Lukas Vollenweider, Optimization of Lempel-Ziv Entropy Rate Estimator for Numerical Time series, University of Zurich, Faculty of Business, Economics and Informatics, 2022. (Master's Thesis)
The entropy rate is an important tool for reasoning about characteristics of a time series, like the complexity, the similarity between time series, or its compressibility. Computing the entropy rate is too complex, which is why approximations are being used. One such approximations is based upon Lempel-Ziv coefficients. Calculating those Lempel-Ziv coefficients is, however, computationally expensive. Therefore, this work evaluates a dictionary-based data structure, used within the Dictionary Compression Score (DCS) method, which was devised to compress time series efficiently. The goal of the evaluation is to find out how the dictionary needs to be adapted to extract Lempel-Ziv coefficients. In a second step, the algorithm is improved to be able to compete against the reference implementation. The evaluation of the proposed algorithms is done by benchmarking their running time as well as their memory usage on real-world temperature data in respect to the reference implementation. We find out that the improvements can beat the reference implementation in terms of running time, while maintaining a similar memory footprint. |
|
Andrin Rehmann, Orthogonal Recursive Bisection on the GPU for Accelerated Load Balancing in Large N-Body Simulations, University of Zurich, Faculty of Business, Economics and Informatics, 2022. (Bachelor's Thesis)
Large simulations with billions of particles are used to understand the universe. Commonly, the fast multipole method is employed to improve the runtime, where a space partitioning binary tree data structure is required. The structure is generated using the Orthogonal Recursive Bisection method and at the same time used as a load balancing strategy to leverage the hardware of super computers optimally.
In this paper, a GPU accelerated version of ORB is proposed and implemented in the CUDA programming language. The implementation manages to maintain a consistent runtime during the entire execution of the ORB method, regardless of the increasing complexity as fragmentation grows with the tree depth and the number of leaf cells. Performance measurements showed a speedup by a factor of 5.4 over a fully parallelized and optimized CPU version. |
|
Pei Li, Jessica Jessica, naida Tania, Michael Böhlen, Divesh Srivastava, Jaroslaw Szlichta, abcOD: Mining Band Order Dependencies, In: 38th IEEE International Conference on Data Engineering, ICDE 2022, IEEE, 2022-05-09. (Conference or Workshop Paper published in Proceedings)
We present the design of and a demonstration plan for abcOD, a tool for efficiently discovering approximate band conditional order dependencies (abcODs) from data. abcOD utilizes a dynamic programming algorithm based on a longest monotonic band. Using real datasets, we demonstrate how the discovered abcODs can help users understand ordered data semantics, identify potential data quality problems, and interactively clean the data. |
|
Laura Toedtli, Exhaustive enumeration of variable-length motifs in time series, University of Zurich, Faculty of Business, Economics and Informatics, 2022. (Master's Thesis)
Motifs are recurring patterns in a time series at varying lengths. This thesis presents two problem definitions to find such relevant motifs in time series: Finding the K-most similar motif pairs for a given length range and finding all strongly correlated motif pairs for a given length range. In order to solve the posed problems, we present exact algorithms for either problem. We present three brute force algorithms that can be adapted to solve either problem and the exact pruning algorithms motif enumerator (MOEN), which solves the 1-most similar motif pair problem.
In this thesis, we evaluate the pruning algorithm's scalability compared to its fastest brute force counterpart, evaluate how the correlation threshold influences the run time and compare the types of recurring motifs that the different algorithms can discover. |
|
Nicoletta Farabullini, Indexing Videos Containing Human Motion in the Form of Dance, University of Zurich, Faculty of Business, Economics and Informatics, 2022. (Master's Thesis)
|
|
Bezaye Tesfaye, Nikolaus Augsten, Mateusz Pawlik, Michael Hanspeter Böhlen, Christian S Jensen, Speeding Up Reachability Queries in Public Transport Networks Using Graph Partitioning, Information Systems Frontiers, Vol. 24 (1), 2022. (Journal Article)
Computing path queries such as the shortest path in public transport networks is challenging because the path costs between nodes change over time. A reachability query from a node at a given start time on such a network retrieves all points of interest (POIs) that are reachable within a given cost budget. Reachability queries are essential building blocks in many applications, for example, group recommendations, ranking spatial queries, or geomarketing. We propose an efficient solution for reachability queries in public transport networks. Currently, there are two options to solve reachability queries. (1) Execute a modified version of Dijkstra’s algorithm that supports time-dependent edge traversal costs; this solution is slow since it must expand edge by edge and does not use an index. (2) Issue a separate path query for each single POI, i.e., a single reachability query requires answering many path queries. None of these solutions scales to large networks with many POIs. We propose a novel and lightweight reachability index. The key idea is to partition the network into cells. Then, in contrast to other approaches, we expand the network cell by cell. Empirical evaluations on synthetic and real-world networks confirm the efficiency and the effectiveness of our index-based reachability query solution. |
|
Anton Dignös, Michael Hanspeter Böhlen, Johann Gamper, Christian S Jensen, Peter Moser, Leveraging range joins for the computation of overlap joins, The VLDB Journal, Vol. 31 (1), 2022. (Journal Article)
Joins are essential and potentially expensive operations in database management systems. When data is associated with time periods, joins commonly include predicates that require pairs of argument tuples to overlap in order to qualify for the result. Our goal is to enable built-in systems support for such joins. In particular, we present an approach where overlap joins are formulated as unions of range joins, which are more general purpose joins compared to overlap joins, i.e., are useful in their own right, and are supported well by B+-trees. The approach is sufficiently flexible that it also supports joins with additional equality predicates, as well as open, closed, and half-open time periods over discrete and continuous domains, thus offering both generality and simplicity, which is important in a system setting. We provide both a stand-alone solution that performs on par with the state-of-the-art and a DBMS embedded solution that is able to exploit standard indexing and clearly outperforms existing DBMS solutions that depend on specialized indexing techniques. We offer both analytical and empirical evaluations of the proposals. The empirical study includes comparisons with pertinent existing proposals and offers detailed insight into the performance characteristics of the proposals. |
|
Pei Li, Jaroslaw Szlichta, Michael Böhlen, Divesh Srivastava, ABC of order dependencies, The VLDB Journal, Vol. 31 (5), 2022. (Journal Article)
Band order dependencies (ODs) enhance constraint-based data quality by modeling the semantics of attributes that are monotonically related to small variations without an intrinsic violation of semantics. The class of approximate band conditional ODs (abcODs) generalizes band ODs to make them more relevant to real-world applications by relaxing them to hold approximately with some exceptions (abODs) and conditionally on subsets of the data. We study the automatic dependency discovery of abcODs to avoid human burden. First, we propose a more efficient algorithm to discover abODs than in recent prior work that is based on a new optimization to compute a longest monotonic band via dynamic programming and decreases the runtime from O(n2) to O(nlogn). We then devise a dynamic programming algorithm for abcOD discovery that determines the optimal solution in polynomial time. To optimize the performance (without losing optimality), we adapt the algorithm to cheaply identify consecutive tuples that are guaranteed to belong to the same band. For generality, we extend our algorithms to discover bidirectional abcODs. Finally, we perform a thorough experimental evaluation of our techniques over real-world and synthetic datasets. |
|
Qing Chen, Oded Lachish, Sven Helmer, Michael Hanspeter Böhlen, Dynamic Spanning Trees for Connectivity Queries on Fully-dynamic Undirected Graphs, Proceedings of the VLDB Endowment, Vol. 15 (11), 2022. (Journal Article)
Answering connectivity queries is fundamental to fully dynamic graphs where edges and vertices are inserted and deleted frequently. Existing work proposes data structures and algorithms with worst case guarantees. We propose a new data structure, the dynamic tree (D-tree), together with algorithms to construct and maintain it. The D-tree is the first data structure that scales to fully dynamic graphs with millions of vertices and edges and, on average, answers connectivity queries much faster than data structures with worst case guarantees. |
|
Kevin Wellenzohn, Michael Hanspeter Böhlen, Sven Helmer, Antoine Pietri, Stefano Zacchiroli, Robust and Scalable Content-and-Structure Indexing (Extended Version), CoRR, Vol. abs/2209.05126, 2022. (Journal Article)
Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our trie-based RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge (LSM) tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS's robustness and scalability by indexing data from the Software Heritage (SWH) archive, which is the world's largest, publicly-available source code archive. |
|
Xiaozhe Yao, Implementing Learned Cardinality Estimation in a Database Systems Context, University of Zurich, Faculty of Business, Economics and Informatics, 2022. (Master's Thesis)
In this work, I present a tree based approach to estimate the cardinality of a query and its sub-queries. The tree based estimation divides the whole query into smaller sub-queries. The approach then employs neural networks for each sub-query and estimate its cardinality. Our approach can be applied to the planning tree in database systems, addressing the weakness of other approaches that they do not provide cardinality for sub-queries. The evaluation of the approach, based on a real-world dataset, shows that it has the potential to help database systems to choose the optimal query plan. |
|
Alphonse Mariyagnanaseelan, Transitioning between Data and Schema Values in MonetDB, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Master's Thesis)
In the past we have introduced the relational matrix algebra, in order to process and analyze scientific data directly in the DBMS, where the data is stored. Some relational matrix operations turn the data values of an input relation to the attribute name of their result relation. An example is the transpose operation, which uses the values of one of its input columns to form the schema of its output relation. The actual data stored in columns becomes available only during query execution. This is an issue, since the number and the names of attributes is essential information when creating query trees in column oriented DBMS. In this thesis we suggest a solution which encapsulates these unknown attributes in the query trees. We explain how an attribute specified in a query, that refers to one of the unknown attributes, can be resolved. Further, we describe the implementation of this concept in MonetDB. We propose functions, with which these encapsulated columns can be processed, so that they become compatible with the existing infrastructure. In the evaluation we observe that most relational operations can work with input relations, where the schema is unknown, when they make use of our extensions. In the experimental evaluation we test the performance of the proposed solution. |
|
Kevin Wellenzohn, Luka Popovic, Michael Hanspeter Böhlen, Sven Helmer, Inserting Keys into the Robust Content-and-Structure (RCAS) Index, In: Advances in Databases and Information Systems - 25th European Conference, ADBIS 2021, Springer, 2021. (Conference or Workshop Paper published in Proceedings)
Semi-structured data is prevalent and typically stored in formats like XML and JSON. The most common type of queries on such data are Content-and-Structure (CAS) queries, and a number of CAS indexes have been developed to speed up these queries. The state-of-the-art is the RCAS index, which properly interleaves content and structure, but does not support insertions of single keys. We propose several insertion techniques that explore the trade-off between insertion and query performance. Our exhaustive experimental evaluation shows that the techniques are efficient and preserve RCAS’s good query performance. |
|
Piero Neri, Implementation and Evaluation of Forecasting using Variants of Exponential Smoothing, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis)
A time series is a sequence of measurements, typically occurring at equispaced time intervals. Two time series were studied in this thesis consisting of temperature and humidity values in the city of Zurich in 2019. Decomposition methods are used to break down a time series into suitable components for exploring the characteristics of the time series which give us an insight on selecting an appropriate predictive model or whether model assumptions are met. A classical decomposition method was implemented and its application to the data sets revealed the
presence of a double seasonal component (i.e., a repetitive oscillatory pattern), corresponding to the alternation of nights and days embedded within a yearly seasonal component.
This thesis focused on the implementation of various exponential smoothing methods in order to make meteorological predictions based on past values. The obtained performance of the forecasts of the exponential smoothing methods were compared to the performance of a
commercial package of ARIMA models. The time series of temperature and humidity values were subdivided into 49 four-week time intervals, each shifted by one week in order to cover the whole 52 weeks of 2019. Within each four-week interval, the first three weeks were used as "training set", in order to identify the exponential smoothing methods and ARIMA models that best fitted the data, respectively. The training sets were also used to optimize parameters and to obtain the best fit for each method. The remaining last week of each four-week interval was
used as "test set" for predictions, based on the selected exponential smoothing methods and ARIMA models. The accuracy of forecasts was measured by comparing predicted temperature and humidity values against the values of the test set. Forecasts were performed for future
values with horizons ranging between 1 and 24 hours. In order to cover the whole test set with predictions, the first hour of the test set was iteratively transferred to the training set, allowing to cover the whole one-week interval with forecasts.
Suitable metrics were used in order to describe the accuracy of predictions for different time horizons within each interval. Eventually, all metrics for all intervals were averaged, in order to facilitate a more global analysis of performance. Predictions were particularly good at short term horizons (i.e., few hours into the future). The approach of identifying the
best fitting parameters and the best fitting methods described in the thesis could be applied to many univariate time series of interest. Moreover, the optimization strategy which was used, featuring a mix of grid search and random search approach, reduced the computation time
without sacrificing prediction performance quality: an important feature for large time series. |
|
Dimitri Degkwitz, Multipoint Incremental Fourier Transformation, University of Zurich, Faculty of Business, Economics and Informatics, 2021. (Bachelor's Thesis)
In radio astronomy the discrete Fourier Transformation is crucial for evaluating telescope data. An increased interest in a streaming approach to the Fourier Transformation has led to the development of the Single Point Incremental Fourier Transformation (SPIFT), by Saad et al. SPIFT only handles one datapoint at a time. We developed two new algorithms which evaluate data points in batches. The first algorithm groups datapoints with similar characteristics into a dictionary before SPIFT is applied. The main focus of this report is on the second algorithm, that uses dictionaries and an algorithm we named Doublestep. Doublestep uses symmetries in SPIFT to reduce the necessary number of complex additions. In theory MPIFT with dictionaries and Doublestep reduces the asymptotic complexity of SPIFT by O(N/log(N)) for batches of smaller sizes. Our implementations confirmed that they reduced the complexity significantly. |
|