Not logged in.

Contribution Details

Type Journal Article
Scope Discipline-based scholarship
Title Robust and Scalable Content-and-Structure Indexing (Extended Version)
Organization Unit
Authors
  • Kevin Wellenzohn
  • Michael Hanspeter Böhlen
  • Sven Helmer
  • Antoine Pietri
  • Stefano Zacchiroli
Item Subtype Original Work
Refereed Yes
Status Published in final form
Language
  • English
Journal Title CoRR
Geographical Reach international
Volume abs/2209.05126
Page Range 1 - 28
Date 2022
Abstract Text Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our trie-based RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge (LSM) tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS's robustness and scalability by indexing data from the Software Heritage (SWH) archive, which is the world's largest, publicly-available source code archive.
Export BibTeX
EP3 XML (ZORA)