Not logged in.

Contribution Details

Type Dissertation
Scope Discipline-based scholarship
Title Hierarchical summarization of multidimensional data
Organization Unit
Authors
  • A Taliun
Supervisors
  • Michael Hanspeter Böhlen
  • A Mazeika
Language
  • English
Institution University of Zurich
Faculty Faculty of Economics, Business Administration and Information Technology
Date 2009
Abstract Text Data summarization and clustering are key techniques to query and analyze large amounts of multidimensional data. However, the effectiveness of existing methods is limited by high intermediate memory cost and difficult to choose input parameters. This Ph.D. thesis develops a novel approach to hierarchical data summarization and clustering that overcomes these limitations. We propose the AD-Tree, an innovative data summary structure that summarizes multidimensional data in terms of its density on a hierarchy of grids. The computation of the AD-Tree is an iterative process that requires a minimal amount of intermediate memory. We start computing the AD-Tree from a sparse initial grid, which gives a rough estimate of the density function of the data. We iteratively increase the estimation quality by splitting cells along dimensions where the density function is non-linear. This ensures a minimal consumption of intermediate memory: instead of overestimating the density on a fine grid and afterwards removing unnecessary grid points, we put new grid points only in places where it increases the precision of the estimation. The key challenges of our approach are the identification of areas and dimensions where the density function exhibits a non-linear behavior and a fast organization of new grid points into a hierarchy of grids that ensures an optimal memory utilization. We introduce shape error, dimensional split, grouping and compact representation of multidimensional grids to successfully solve these problems: the shape error measures the deviation of the density function from being linear on a grid, the dimensional split implements the splitting of cells in selected dimensions, the grouping organizes new grid points into large grids, and the compact representation of multidimensional grids reduces their storage costs by a factor of the dimensionality. We develop an efficient solution to approximately answer aggregate range queries from the AD-Tree. We develop CORE, a novel clustering technique that clusters multidimensional data without any input parameters. The salient property of CORE is the explicit computation and representation of local density maxima, which permits a high-quality nonparametric clustering. CORE uses the local density maxima to determine cores of clusters. The AD-Tree, rectangular neighborhoods and gradients enable the efficient and robust computation of cores: the AD-Tree provides a uniform and compact estimation of the density of the data, the rectangular neighborhood localizes stationary points in the AD-Tree, and gradients distinguish local maxima from other types of stationary points and connect maximal cores. CORE is the first clustering technique that bases the clustering on a semantically rich data summary structure. We investigate overlapping clusters and develop an efficient solution to separate them. The separation of overlapping clusters makes it necessary to cluster the data at all levels of the density and to consider the orientation of clusters. We use the AD-Tree, which allows CORE to find fragments and overlapping cores at all levels of the density. We restore complete cores from their fragments with the help of gradient paths. Gradient paths connect fragments through overlaps and quantify the orientation of fragments. We analytically investigate our techniques and confirm the results with extensive experimental evaluations on synthetic and real world datasets. The results show the advantage of our techniques compared to existing methods
Other Identification Number merlin-id:250
PDF File Download from ZORA
Export BibTeX
EP3 XML (ZORA)