Not logged in.

Quick Search - Contribution

Contribution Details

Type	Journal Article
Scope	Discipline-based scholarship
Title	Estimating the selectivity of approximate string queries
Organization Unit	Database Technology (Michael Hanspeter Böhlen)
Authors	Arturas Mazeika Michael Hanspeter Böhlen Nick Koudas Divesh Srivastava
Item Subtype	Original Work
Refereed	Yes
Status	Published in final form
Language	English
Journal Title	ACM Transactions on Database Systems
Publisher	Association for Computing Machinery
Geographical Reach	international
ISSN	0362-5915
Volume	32
Number	2
Page Range	12
Date	2007
Abstract Text	Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures.We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is independent of the number of database strings and linear with respect to the length of query string. We give a detailed empirical performance evaluation of our solution for synthetic and real-world datasets. We show that VSol is effective for large skewed databases of short strings.
Digital Object Identifier	10.1145/1242524.1242529
Other Identification Number	merlin-id:2809
PDF File	Download from ZORA
Export	BibTeX EP3 XML (ZORA)