Erik Hasselberg, The Multigranular Temporal Nearest Neighbor Join Operator, 2014. (Other Publication)
In this report an extension of the nearest neighbor join operator has been implemented for dealing with similarity on timestamps. The TNNJ (temporal nearest neighbor join) operator computes an equijoin for two given relations r of schema R = [E, G, T ] and s of schema S = [E,G,T,M] on E,G and a nearest neighbor join on T with group G for the tuples of the outer relation r without equijoin.
|
|
Markus Neumann, Multiple linear regression in databases, 2014. (Other Publication)
Matrix operations have many applications in various fields of research and industry, where big datasets and relational database management systems prevail. Performing matrix operations inside a database can be expensive and therefore efficient algorithms are required. The MAD project provides a recent set of such algorithms from which I implemented the `ordinary least squares' algorithm that approximates the linear regression coefficient, as Facharbeit in the course of my minor at UZH. This work presents an introduction to linear regression, the issues with its application in a database system as well as the implementation details of the algorithm in PostgreSQL. |
|
Sophie Leuenberger, Multiple linear regression in databases, 2014. (Other Publication)
Multidimensional statistical models such as multiple linear regression models are usually computed outside a data base management system (DBMS). In this report, we study how a multiple linear regression model can be computed inside a DBMS. The concept of "summary matrices", which was introduced by Ordonez as a tool to compute statistical models inside a DBMS, is presented and adapted for the case of multiple linear regression. We will consider two different approaches of implementation, study the performance of both of the alternatives, and figure out which one is better suited in which situation.
|
|
Urs Vögeli, The aggregate and buffer effect in the Robust Nearest Neighbor Join operator, 2014. (Other Publication)
This report looks at the robust nearest neighbor join. Given an outer table r[G,T,E] and an inner table s[G,T,E,M], the rnn-join operator computes an equijoin on E and a nearest neighbor join on T for the tuples of the outer table without an equijoin match. In the case of a nearest neighbor join an aggregation function aggregates multiple nearest neighbors. This report builds an extension to the RNNJ operator that does not aggregate the attribute M for the nearest neighbor join but returns a tuples with the M value of every nearest neighbor of the outer tuple. The report introduce two ways of implementing the rnn-join without aggregation, one is an implementation with the use of buffers and aggregation sets and the other is an implementation with the use of backtracking and without the use of any buffers and aggregation sets. |
|
Tobias Ammann, Applying meta-blocking to improve efficiency in entity resolution, 2014. (Other Publication)
This report compares two implementations of meta-blocking in terms of runtime and memory usage, and measures the accuracy of meta-blocking using a subset of the Musicbrainz database. We find that the implementation using a reversed index is more efficient than the naive implementation. Furthermore, we find that the dataset in its current form is unsuitable for meta-blocking, due to incomplete records and the presence of high-frequency tokens, which cause both implementations to approach O(n²) runtime and memory consumption (n being the number of records). |
|
Oliver Leumann, Query compilation of statement modifiers, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Bachelor's Thesis)
For the querying of temporal relations statement modifiers have been proposed in the past. Statement modifiers can be prepended to an SQL query to indicate if the given query should be evaluated non-temporal or temporal by the database system. For the processing and evaluation of temporal database operators, such as aggregation and joins, reduction rules using temporal primitives have been proposed. The shortcoming of these approaches is that they are not yet combined. The focus of this thesis is to first provide the basis to smoothly combine the two approaches, i.e., to compile queries with statement modifiers to statements with temporal primitives, and then to implement the compilation mechanism into the database system PostgreSQL whose kernel has already been extended with the temporal primitives. |
|
Christian Ammendola, Michael Hanspeter Böhlen, Johann Gamper, Efficient evaluation of ad-hoc range aggregates, In: DaWaK, Springer, 2013-08-26. (Conference or Workshop Paper published in Proceedings)
θ-MDA is a flexible and efficient operator for complex ad-hoc multi-dimensional aggregation queries. It separates the specification of aggregation groups for which aggregate values are computed (base table b) and the specification of aggregation tuples from which aggregate values are computed. Aggregation tuples are subsets of the detail table r and are defined by a general θ-condition. The θ-MDA requires one scan of r, during which the aggregates are incrementally updated. In this paper, we propose a two-step evaluation strategy for θ-MDA to optimize the computation of ad-hoc range aggregates by reducing them to point aggregates. The first step scans r and computes point aggregates as a partial intermediate result x̃, which can be done efficiently. The second step combines the point aggregates to the final aggregates. This transformation significantly reduces the number of incremental updates to aggregates and reduces the runtime from (∣∣r∣∣⋅∣∣b∣∣) to (∣∣r∣∣) , provided that ∣∣b∣∣<∣∣r∣∣‾‾‾√ and |x̃| ≈ |b|, which is common for OLAP. An empirical evaluation confirms the analytical results and shows the effectiveness of our optimization: range queries are evaluated with almost the same efficiency as point queries. |
|
Thomas Walter Brenner, Load-balancing implementation in Hadoop, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Bachelor's Thesis)
The MapReduce-algorithm is a model that operates on distributed, parallel systems. Hadoop is an implementation of this MapReduce-algorithm. Some applications may produce an imbalance of work on the cluster during the execution. The goal of this thesis is to implement an load-balancing algorithm in the Hadoop framework to sort a list of timestamps. Implemented was an algorithm called TopCluster, which was developed at the universities of Munich and Bozen-Bolzano.
This algorithm gathers locally the necessary information, combines them and produces a distribution of the data in order to avoid skew. In this thesis the TopCluster-algorithm is implemented, modified to meet the necessary requirements and eventually tested with different randomly distributed data. |
|
Andreas Albrecht, Learning value evolution on real-world temporal data, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Bachelor's Thesis)
Temporal record linkage studies the problem of identifying records that refer to the same real-world entities over time. This is a challenging task because (1) real-world entities may change their attribute values as time goes by (e.g., a researcher may move from one affiliation to another) and (2) different entities may share the same value over time (e.g., two researchers with the same name). In order to address these challenges, the concept of time decay aims at capturing how information of entities evolves over time to improve linkage quality of records.
This thesis proposes two methods for learning time decay and further investigates several string matching approaches on a real-world data set. Moreover, we consider the efficiency of the algorithms and propose an inverted index based approach to improve efficiency. The experiments on real-world data sets show that the two learning decay algorithms provide similar results on multiple sampled data sets. Furthermore, our algorithms improve the brute-force solution by at least two orders of magnitude. |
|
Anton Dignös, Michael Böhlen, Johann Gamper, Query time scaling of attribute values in interval timestamped databases, In: 29th IEEE International Conference on Data Engineering (ICDE 2013), Institute of Electrical and Electronics Engineers, Los Alamitos, CA, USA, 2013-04-08. (Conference or Workshop Paper published in Proceedings)
In valid-time databases with interval timestamping each tuple is associated with a time interval over which the recorded fact is true in the modeled reality. The adjustment of these intervals is an essential part of processing interval timestamped data. Some attribute values remain valid if the associated interval changes, whereas others have to be scaled along with the time interval. For example, attributes that record total (cumulative) quantities over time, such as project budgets, total sales or total costs, often must be scaled if the timestamp is adjusted. The goal of this demo is to show how to support the scaling of attribute values in SQL at query time. |
|
Markus Innerebner, Michael Böhlen, Johann Gamper, ISOGA: A system for geographical reachability analysis, In: International Symposium on Web and Wireless Geographical Information Systems, Springer, Berlin, p. 180 - 189, 2013-04-05. (Book Chapter)
In this paper, we present a web-based system, termed ISOGA, that uses isochrones to perform geographical reachability analysis. An isochrone in a spatial network covers all space points from where a query point is reachable within given time constraints. The core of the system builds an efficient algorithm for the computation of isochrones in multimodal spatial networks. By joining isochrones with other databases, various kinds of geospatial reachability analysis can be performed, such as how well is a city covered by public services or where to look for an apartment at moderate prices that is close to the working place. ISOGA adopts a service-oriented three-tier architecture and uses technologies that are compliant with OGC standards. We describe several application scenarios in urban and extra-urban areas, which show the applicability of the tool. |
|
Martin Noack, Data lineage and meta data analysis in data warehouse environments, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2013. (Bachelor's Thesis)
This thesis aims to provide new insights on data lineage computations within the Credit Suisse data warehouse environment. We propose a system to compute the lineage, targeted at business users without technical knowledge about IT systems. Therefore we provide complete abstraction for end users. Furthermore, we process only conceptual mapping rules from the metadata warehouse, in contrast to other approaches which record transformations at runtime, and consequently do not rely on access to potentially sensitive data. In order to process mapping rules, we developed an algorithm that is capable of extracting components generically, based on their semantic meaning and relation to each other. This thesis describes some patterns in lineage investigations that result from our approach and gives an outlook to future projects that could be based on this work. |
|
Nikolaus Augsten, Michael Hanspeter Böhlen, Similarity Joins in Relational Database Systems, Morgan & Claypool Publishers, " ", 2013. (Book/Research Monograph)
State-of-the-art database systems manage and process a variety of complex objects, including strings and trees. For such objects equality comparisons are often not meaningful and must be replaced by similarity comparisons. This book describes the concepts and techniques to incorporate similarity into database systems. We start out by discussing the properties of strings and trees, and identify the edit distance as the de facto standard for comparing complex objects. Since the edit distance is computationally expensive, token-based distances have been introduced to speed up edit distance computations. The basic idea is to decompose complex objects into sets of tokens that can be compared efficiently. Token-based distances are used to compute an approximation of the edit distance and prune expensive edit distance calculations. A key observation when computing similarity joins is that many of the object pairs, for which the similarity is computed, are very different from each other. Filters exploit this property to improve the performance of similarity joins. A filter preprocesses the input data sets and produces a set of candidate pairs. The distance function is evaluated on the candidate pairs only. We describe the essential query processing techniques for filters based on lower and upper bounds. For token equality joins we describe prefix, size, positional and partitioning filters, which can be used to avoid the computation of small intersections that are not needed since the similarity would be too low. |
|
Mourad Khayati, Michael Hanspeter Böhlen, Johann Gamper, Scalable Centroid Decomposition, No. IFI-2013.08, Version: 1, 2013. (Technical Report)
Real world applications that deal with time series data often rely on matrix decomposition techniques, such as the Singular Value Decomposition (SVD). The Centroid Decomposition (CD) approximates the Singular Value Decomposition, but does not scale to long time series because of the quadratic space complexity of the sign vector computation.
In this report, we propose a greedy algorithm, termed Scalable Sign Vector (SSV), to efficiently determine sign vectors for CD applications with long time series, i.e., where the number of rows (observations) is much larger than the number of columns (time series). The SSV algorithm starts with a sign vector consisting of only 1s and iteratively changes the sign of the element that maximizes the benefit. The space complexity of the SSV algorithm is linear in the length of the time series. We provide proofs for the scalability, the termination and the correctness of the SSV algorithm. Experiments with real world hydrological time series and data sets from the UCR repository validate the analytical results and show the scalability of SSV. |
|
Michael Hartmann, Implementation of a relational algebra based graphical user interface for temporal queries, 2013. (Other Publication)
In [DBG12] a relational algebra solution, that provides native support for the properties of the sequenced semantics described in [BJS00], has been presented. Temporal expressions are reduced to ordinary expressions using two additional primitive operators to align time intervals. Until now, there is no automatically reduction implemented. In this Facharbeit an application, that lets expressions be graphically built and set as sequenced or non-sequenced, is presented. If the expression is set to sequenced it automatically applies the reduction rules and can be executed on a PostgreSQL database system supporting the additional primitives. |
|
Nikolaus Augsten, Michael Böhlen, Johann Gamper, The address connector: noninvasive synchronization of hierarchical data sources, Knowledge and Information Systems (KAIS), Vol. 37 (3), 2013. (Journal Article)
Different databases often store information about the same or related objects in the real world. To enable collaboration between these databases, data items that refer to the same object must be identified. Residential addresses are data of particular interest as they often provide the only link between related pieces of information in different databases. Unfortunately, residential addresses that describe the same location might vary considerably and hence need to be synchronized. Non-matching street names and addresses stored at different levels of granularity make address synchronization a challenging task. Common approaches assume an authoritative reference set and correct residential addresses according to the reference set. Often, however, no reference set is available, and correcting addresses with different granularity is not possible. We present the address connector, which links residential addresses that refer to the same location. Instead of correcting addresses according to an authoritative reference set, the connector defines a lookup function for residential addresses. Given a query address and a target database, the lookup returns all residential addresses in the target database that refer to the same location. The lookup supports addresses that are stored with different granularity. To align the addresses of two matching streets, we use a global greedy address-matching algorithm that guarantees a stable matching. We define the concept of address containment that allows us to correctly link addresses with different granularity. The evaluation of our solution on real-world data from a municipality shows that our solution is both effective and efficient. |
|
Mourad Khayati, Michael Hanspeter Böhlen, REBOM: Recovery of blocks of missing values in time series, In: 18th International Conference on Management of Data (COMAD), Computer Society of India, Mumbai, India, 2012-12-14. (Conference or Workshop Paper published in Proceedings)
The recovery of blocks of missing values in regular time series has been addressed by model-based techniques. Such techniques are not suitable to recover blocks of missing values in irregular time series and restore peaks and valleys. We propose REBOM (REcovery of BlOcks of Missing values): a new technique that reconstructs shape, amplitude and width of missing peaks and valleys in irregular time series. REBOM successfully reconstructs peaks and valleys by iteratively considering the time series itself and its correlation to multiple other time series. We provide an iterative algorithm to recover blocks of missing values and analytically investigate its monotonicity and termination. Our experiments with synthetic and real world hydrological data confirm that for the recovery of blocks of missing values in irregular time series REBOM is more accurate than existing methods. |
|
Andrin Betschart, Visualization of the varying spatial density information in the Swiss Feed Database, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2012. (Bachelor's Thesis)
This thesis introduces techniques to visualize information of the spatial feed data for the online application of the Swiss Feed Database. Since the computation of the Kernel methods that are used to compute the visualized images is not very time efficient, we fight the challenge to develop an algorithm of lower complexity, by introducing some optimizations: query optimization, sparse grid with bilinear interpolation and server side implementation. The experimental evaluation proves that the implemented algorithm executes fast on all popular browsers, no matter what criteria were selected by the user. |
|
Bruno Cadonna, Johann Gamper, Michael Hanspeter Böhlen, Efficient event pattern matching with match windows, In: 18th ACM SIGKDD International Conference, ACM Press, New York, New York, USA, 2012-09-12. (Conference or Workshop Paper published in Proceedings)
In event pattern matching a sequence of input events is matched against a complex query pattern that specifies constraints on extent, order, values, and quantification of matching events. In this paper we propose a general pattern matching strategy that consists of a pre-processing step and a pattern matching step. Instead of eagerly matching incoming events, the pre-processing step buffers events in a match window to apply different pruning techniques (filtering, partitioning, and testing for necessary match conditions). In the second step, an event pattern matching algorithm, A, is called only for match windows that satisfy the necessary match conditions. This two-phase strategy with a lazy call of the matching algorithm significantly reduces the number of events that need to be processed by A as well as the number of calls to A. This is important since pattern matching algorithms tend to be expensive in terms of runtime and memory complexity, whereas the pre-processing can be done very efficiently. We conduct extensive experiments using real-world data with pattern matching algorithms for, respectively, automata and join trees. The experimental results confirm the effectiveness of our strategy for both types of pattern matching algorithms. |
|
Martin Leimer, Statistical comparison of regions in the Swiss Feed Database, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2012. (Bachelor's Thesis)
In this bachelor thesis we develop an approach how to find similar regions and how to calculate the probability to which degree two regions are equal. This calculation is based on nutrient measurements on feed samples in the Swiss Feed Database. We encounter this challenge using different statistical tests, which we finally implement on the Swiss Feed Database. Moreover, we will focus on an optimized implementation of the developed algorithm. The experimental evaluation reveals that the similarity probabilities of the top-k similar regions algorithm can be computed in reasonable time. Further, it will be shown that there indeed can be very similar regions. |
|