Not logged in.

Contribution Details

Type Journal Article
Scope Discipline-based scholarship
Title The pq-gram distance between ordered labeled trees
Organization Unit
Authors
  • N Augsten
  • Michael Hanspeter Böhlen
  • J Gamper
Item Subtype Original Work
Refereed Yes
Status Published in final form
Language
  • English
Journal Title ACM Transactions on Database Systems (TODS)
Publisher Association for Computing Machinery
Geographical Reach international
ISSN 0362-5915
Volume 35
Number 1
Page Range 4:1 - 4:36
Date 2010
Abstract Text When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ. We propose pq-grams to approximately match hierarchical data from autonomous sources and define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the fanout weighted tree edit distance. We prove that the pq-gram distance is a lower bound of the fanout weighted tree edit distance and give a normalization of the pq-gram distance for which the triangle inequality holds. Experiments on synthetic and real-world data (residential addresses and XML) confirm the scalability of our approach and show the effectiveness of pq-grams.
Digital Object Identifier 10.1145/1670243.1670247
Other Identification Number 1441; merlin-id:95
PDF File Download from ZORA
Export BibTeX
EP3 XML (ZORA)
Additional Information © ACM, 2010. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Trans. Database Syst., VOL 35, ISS 1, (2010)