Not logged in.

Quick Search - Contribution

Contribution Details

Type	Bachelor's Thesis
Scope	Discipline-based scholarship
Title	Temporal Filtering to Improve Temporal Duplicate Detection
Organization Unit	Database Technology (Michael Hanspeter Böhlen)
Authors	Gionata Genazzi
Supervisors	Michael Hanspeter Böhlen Pei Li
Language	English
Institution	University of Zurich
Faculty	Faculty of Economics, Business Administration and Information Technology
Number of Pages	66
Date	2015
Abstract Text	Duplicate detection studies the problem of identifying records in a given data set that refer to the same real-world entity. A quantitative way of solving duplicate detection is to perform a similarity join. A large collection of algorithms performs the similarity join using the filter-verification framework. Algorithms of this type exploit techniques as prefix filtering, positional filtering, and suffix filtering to perform the join in an efficient way. However, implementations of these algorithms ignore temporal information of records, which can enhance duplicate detection. In this thesis, we refine prefix filtering, positional filtering, and suffix filtering techniques in order to perform similarity joins utilizing also temporal information of records. Specifically, we propose three algorithms that perform temporal similarity joins with a temporal Jaccard similarity threshold. Experimental results show that these algorithms can improve considerably the performance of exact temporal similarity joins with temporal Jaccard when compared to the brute force approach.
Zusammenfassung	Duplikaterkennung (duplicate detection) ist das Finden mehrerer Repräsentationen desselben Realweltobjekts. Die Ausführung einer similarity join stellt eine quantitative Lösung von Duplikaterkennungsproblemen dar. Eine breite Kategorie von similarity join Algorithmen verwenden das sogenannte filter-verification framework; diese Algorithmen nutzen Techniken wie prefix filtering, positional filtering und suffix filtering mit dem Ziel similarity join effizient auszuführen. Diese Techniken berücksichtigen jedoch nicht temporale Information von Daten. Die Verwendung von temporaler Information kann Duplikaterkennung noch verbessern. In dieser Arbeit verfeinern wir prefix filtering, positional filtering und suffix filtering Techniken, sodass sie während similarity joins temporaler Information von Daten ausnutzen können. Wir präsentieren drei Algorithmen die temporal similarity joins Probleme lösen. Die an realen Datensätzen durchgeführten Experimente zeigen, dass die drei vorgeschlagenen Algorithmen die Effizienz von temporal similarity joins im Vergleich zum brute force Vorgehen erheblich verbessern können.
PDF File	Download
Export	BibTeX