Tobias Schlaginhaufen, Design and Implementation of a Database Client Application for Inserting, Modifying, Presentation and Export of Bitemporal Personal Data, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2007. (Master's Thesis)
There exists only little support for temporal data in conventional relational databases, although many applications require temporal or even bitemporal storage of data. This thesis describes the implementation of a bitemporal database and a client application pursuant to it for managing study subjects of a longitudinal etiological study about adjustment and mental health. Designing a bitemporal database on top of a relational database model involves the dilemma of time-normalization. Either one uses the well-established storage organization and query evaluation techniques of relational databases and accepts a certain redundant storage of data. Or one time-normalizes the data model and avoids redundancy but, at the same time, one has to accept a degenerated relational model, which is complex, difficult to handle and may degrade the performance of a relation database system. We present an approach striking the balance between these two extremes. |
|
Martin Spörri, Administration of Metadata Models with Semantic Web Technologies, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2007. (Master's Thesis)
This thesis was written between September 2006 and March 2007 as a diploma thesis at the Database Technology Group, which is part of the department of informatics at the University of Zurich. The aim on one hand was to show what Semantic Web technologies are and how they can be used to administrate metadata. On the other hand the mission was to build a standalone software application that integrates into the existing metadata management system of Helsana Versicherungen AG and provides additional flexibility and functionality to the system. The thesis is divided into three parts: After a short introduction, the first part describes terms and technologies related to the Semantic Web and metadata management. The second part covers the planning, implementation and evaluation of the software application that was built, and the third part contains an overview of the work that was done as well as an outlook to possible further development. |
|
Claudio Jossen, Klaus R Dittrich, The process of metadata modelling in industrial data warehouse environments, In: BTW Workshops 2007, Verlagshaus Mainz, Aachen, 2007-03-01. (Conference or Workshop Paper published in Proceedings)
Modern application landscapes and especially huge enterprise applications, like data warehouses, used for decision support or other analyzing purposes get more and more complex. To manage, use and maintain these systems the need for metadata management has increased. In consequence of new tasks being identified by new groups of data warehouse users, the role of metadata management implies more than simply surf data schemas. It becomes necessary that metadata systems integrate different kinds of metadata and offer different views on the metadata as well. In this paper we discuss the process of identifying metadata model requirements, defining a new metadata model and finally implementing it in a metadata schema. The process is illustrated by a possible metadata model and schema, which were developed to meet the requirements of a complex data warehouse environment in Helsana Versicherungen AG, the largest Swiss insurance company. The paper describes the implementation of the metadata model based on the metadata standards Resource Description Framework (RDF) and RDF Schema (RDFS). The presented model and schema are just one possible solution and are not leading to a universal metadata model. The goal of this paper is to discuss the process of metadata modeling and to help metadata architects to develop their own metadata models and schemas. |
|
Boris Glavic, Klaus R. Dittrich, Data Provenance: A Categorization of Existing Approaches, In: BTW '07: 12. GI-Fachtagung für Datenbanksysteme in Business, Technologie und Web, Verlagshaus Mainz, Aachen, March 2007. (Conference or Workshop Paper)
In many application areas like e-science and data-warehousing detailed
information about the origin of data is required. This kind of information is
often referred to as data provenance or data lineage. The provenance of a data
item includes information about the processes and source data items that lead
to its creation and current representation. The diversity of data
representation models and application domains has lead to a number of more or
less formal definitions of provenance. Most of them are limited to a special
application domain, data representation model or data processing facility. Not
surprisingly, the associated implementations are also restricted to some
application domain and depend on a special data model. In this paper we give a
survey of data provenance models and prototypes, present a general
categorization scheme for provenance models and use this categorization scheme
to study the properties of the existing approaches. This categorization enables
us to distinguish between different kinds of provenance information and could
lead to a better understanding of provenance in general. Besides the
categorization of provenance types, it is important to include the storage,
transformation and query requirements for the different kinds of provenance
information and application domains in our considerations. The analysis of
existing approaches will assist us in revealing open research problems in the
area of data provenance. |
|
Ionut Subasu, Patrick Ziegler, Klaus R. Dittrich, Towards Service-Based Data Management Systems, In: Datenbanksysteme in Business, Technologie und Web (BTW 2007), Workshop Proceedings, March 2007. (Conference or Workshop Paper)
|
|
Sara Khaleghi, Erstellung und Bewertung eines Konzeptes für die Archivierung und die Bereinigung von Stammdaten und Kursdaten im Bankenbereich, University of Zurich, Faculty of Economics, Business Administration and Information Technology, 2007. (Master's Thesis)
The goal of this thesis was the creation and validation of an archiving concept for valor-specific static and pricing data and the making of a data housekeeping concept for the UBS AG. Also proposals for the optimisation of the existing archiving solution were suggested. To achieve these goals the current system landscape and general framework were analysed, new requirements for the archiving collected and with respect to the current developmental state of the archiving technology a concept derived. For the existing archiving solution suggestions for optimisation were proposed and a framework for the data housekeeping developed, which helps to create a housekeeping plan as soon as the outstanding business requirements have been collected. |
|
Arturas Mazeika, Michael Hanspeter Böhlen, Nick Koudas, Divesh Srivastava, Estimating the selectivity of approximate string queries, ACM Transactions on Database Systems, Vol. 32 (2), 2007. (Journal Article)
Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures.We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is independent of the number of database strings and linear with respect to the length of query string. We give a detailed empirical performance evaluation of our solution for synthetic and real-world datasets. We show that VSol is effective for large skewed databases of short strings. |
|
Patrick Ziegler, Klaus R. Dittrich, Data Integration — Problems, Approaches, and Perspectives., In: Conceptual Modelling in Information Systems Engineering, Springer, 2007. (Conference or Workshop Paper)
|
|
Patrick Ziegler, Evaluation of SIRUP with the THALIA Benchmark for Data Integration Systems, No. IFI-2007.0008, Version: 1, 2007. (Technical Report)
|
|
Claudio Jossen, Metadaten Management - Grundlagen und industrielle Praxis, AV Akademikerverlag, 2007. (Book/Research Monograph)
|
|
Linas Baltrunas, Arturas Mazeika, Michael Böhlen, Multi-dimensional histograms with tight bounds for the error, In: IDEAS 2006, IEEE, 2006-12-11. (Conference or Workshop Paper published in Proceedings)
Histograms are being used as non-parametric selectivity estimators for one-dimensional data. For highdimensional data it is common to either compute onedimensional histograms for each attribute or to compute a multi-dimensional equi-width histogram for a set of attributes. This either yields small low-quality or large highquality histograms.In this paper we introduce HIRED (High-dimensional histograms with dimensionality REDuction): small highquality histograms for multi-dimensional data. HIRED histograms are adaptive, and they are based on the shape error and directional splits. The shape error permits a precise control of the estimation error of the histogram and, together with directional splits, yields a memory complexity that does not depend on the number of uniform attributes in the dataset. We provide extensive experimental results with synthetic and real world datasets. The experiments confirm that our method is as precise as state-of-the-art techniques and uses orders of magnitude less memory. |
|
Carme Martin, Michael Hanspeter Böhlen, Carlos Lopez, Extending ATSQL to Support Temporally Dependent Information, In: JISBD 2006, 2006-10-03. (Conference or Workshop Paper published in Proceedings)
|
|
Nikolaus Augsten, Michael Böhlen, Johann Gamper, An Incrementally Maintainable Index for Approximate Lookups in Hierarchical Data, In: VLDB2006: 32th International Conference on Very Large Data Bases, 2006-09-12. (Conference or Workshop Paper published in Proceedings)
|
|
Arturas Mazeika, Michael Hanspeter Böhlen, Andrej Taliun, Adaptive density estimation, In: 32nd International Conference on Very Large Data Bases, VLDB Endowment, 2006-09-12. (Conference or Workshop Paper published in Proceedings)
This demonstration illustrates the APDF tree: an adaptive tree that supports the effective and effcient computation of continuous density information. The APDF tree allocates more partition points in non-linear areas of the density function and fewer points in linear areas of the density function. This yields not only a bounded, but a tight control of the error. The demonstration explains the core steps of the computation of the APDF tree (split, kernel additions, tree optimization, kernel additions, unsplit) and demos the implementation for different datasets. |
|
Arturas Mazeika, Michael Hanspeter Böhlen, Cleansing databases of misspelled proper nouns, In: CleanDB 2006, 2006-09-11. (Conference or Workshop Paper published in Proceedings)
The paper presents a data cleansing technique for string databases. We propose and evaluate an algorithm that identifies a group of strings that consists of (multiple) occurrences of a correctly spelled string plus nearby misspelled strings. All strings in a group are replaced by the most frequent string of this group. Our method targets proper noun databases, including names and addresses, which are not handled by dictionaries. At the technical level we give an efficient solution for computing the center of a group of strings and determine the border of the group. We use inverse strings together with sampling to efficiently identify and cleanse a database. The experimental evaluation shows that for proper nouns the center calculation and border detection algorithms are robust and even very small sample sizes yield good results. |
|
Arturas Mazeika, Janis Petersons, Michael Hanspeter Böhlen, PPPA: Push and Pull Pedigree Analyzer for large and complex pedigree databases, In: 10th East-European Conference on Advances in Databases and Information Systems, Springer, 2006-09-03. (Conference or Workshop Paper published in Proceedings)
In this paper we introduce a novel push and pull technique to analyze pedigree data. We present the Push and Pull Pedigree Analyzer (PPPA) to organize large and complex pedigrees and investigate the development of genetic diseases. PPPA receives as input a pedigree (ancestry information) of different families. For each person the pedigree contains information about the occurrence of a specific genetic disease. We propose a new solution to arrange and visualize the individuals of the pedigree based on the relationships between individuals and information about the disease. PPPA starts with random positions of the individuals, and iteratively pushes apart non-relatives with opposite diseases patterns and pulls together relatives with identical disease patterns. The goal is a visualization that groups families with homogeneous disease patterns.We investigate our solution experimentally with genetic data from peoples from South Tyrol, Italy. We show that the algorithm converges independent of the number of individuals n and the complexity of the relationships. The runtime of the algorithm is super-linear wrt n. The space complexity of the algorithm is linear wrt n. The visual analysis of the method confirms that our push and pull technique successfully deals with large and complex pedigrees. |
|
Stefania Leone, Ela Hunt, Thomas B Hodel, Michael Böhlen, Klaus R Dittrich, Design and implementation of a document database extension, In: 10th East-European Conference on Advances in Databases and Information Systems, Alexander Technological Educational Institute of Thessaloniki, 2006-09-03. (Conference or Workshop Paper published in Proceedings)
Integration of text and documents into database management systems has been the subject of much research. However, most of the approaches are limited to data retrieval. Collaborative text editing, i.e. the ability for multiple users to work on a document instance simultaneously, is rarely supported. Also, documents mostly consist of plain text only, and support very limited meta data storage or search. We address the problem by proposing an extended definition of document data type which comprises not only the text itself but also structural information such as layout, template and semantics, as well as document creation meta data. We implemented a new collaborative data type Document which supports document manipulation via a text editing API and extended SQL syntax (TX SQL), as detailed in this work. We report also on the search capabilities of our document management system and present some of the future challenges for collaborative document management. |
|
Michael Hanspeter Böhlen, Johann Gamper, Christian S Jensen, How would you like to aggregate your temporal data?, In: 13th International Symposium on Temporal Representation and Reasoning (TIME 2006), IEEE, 2006-06-15. (Conference or Workshop Paper published in Proceedings)
Real-world data management applications generally manage temporal data, i.e., they manage multiple states of time-varying data. Many contributions have been made by the research community for how to better model, store, and query temporal data. In particular, several dozen temporal data models and query languages have been proposed. Motivated in part by the emergence of non-traditional data management applications and the increasing proliferation of temporal data, this paper puts focus on the aggregation of temporal data. In particular, it provides a general framework of temporal aggregation concepts, and it discusses the abilities of five approaches to the design of temporal query languages with respect to temporal aggregation. Rather than providing focused, polished results, the paper's aim is to explore the inherent support for temporal aggregation in an informal manner that may serve as a foundation for further exploration. |
|
Stefania Leone, Extending database technology: a new document data type, In: Caise 2006 Doctoral Consortium, Luxembourg, June 2006. (Conference or Workshop Paper)
|
|
Patrick Ziegler, Christoph Kiefer, Christoph Sturm, Klaus R. Dittrich, Abraham Bernstein, Generic Similarity Detection in Ontologies with the SOQA-SimPack Toolkit, In: SIGMOD Conference, ACM, New York, NY, USA, June 2006. (Conference or Workshop Paper)
Ontologies are increasingly used to represent the intended real-world semantics of data and services in information systems. Unfortunately, different databases often do not relate to the same ontologies when describing their semantics. Consequently, it is desirable to have information about the similarity between ontology concepts for ontology alignment and integration. In this demo, we present the SOQASimPack Toolkit (SST) 7, an ontology language independent Java API that enables generic similarity detection and visualization in ontologies. We demonstrate SST’s usefulness with the SOQA-SimPack Toolkit Browser, which allows users to graphically perform similarity calculations in ontologies. |
|