Keyword search is a useful tool for exploring large RDF datasets. Existing techniques either rely on constructing a distance matrix for pruning the search space or building summaries from the RDF graphs for query processing. In this work, we show that existing techniques have serious limitations in dealing with realistic, large RDF data with tens of millions of triples. Furthermore, the existing summarization techniques may lead to incorrect/incomplete results. To address these issues, we propose an effective summarization algorithm to summarize the RDF data. Given a keyword query, the summaries lend significant pruning powers to exploratory keyword search and result in much better efficiency compared to previous works. Unlike existing techniques, our search algorithms always return correct results. Besides, the summaries we built can be updated incrementally and efficiently. Experiments on both benchmark and large real RDF data sets show that our techniques are scalable and efficient.
Query execution assurance is an important concept in defeating lazy servers in the database as a service model. We show that extending query execution assurance to outsourced databases with multiple data owners is highly inefficient. To cope with lazy servers in the distributed setting, we propose query access assurance (QAA) that focuses on IO-bound queries. The goal in QAA is to enable clients to verify that the server has honestly accessed all records that are necessary to compute the correct query answer, thus eliminating the incentives for the server to be lazy if the query cost is dominated by the IO cost in accessing these records. We formalize this concept for distributed databases, and present two efficient schemes that achieve QAA with high success probabilities. The first scheme is simple to implement and deploy, but may incur excessive server to client communication cost and verification cost at the client side, when the query selectivity or the database size increases. The second scheme is more involved, but successfully addresses the limitation of the first scheme. Our design employs a few number theory techniques. Extensive experiments demonstrate the efficiency, effectiveness and usefulness of our schemes.
The database community has devoted extensiveamount of efforts to indexing and querying temporaldata in the past decades. However, insufficientamount of attention has been paid to temporal rankingqueries. More precisely, given any time instance t, thequery asks for the top-k objects at time t with respect tosome score attribute. Some generic indexing structuresbased on R-trees do support ranking queries on temporaldata, but as they are not tailored for such queries,the performance is far from satisfactory. We presentthe Seb-tree, a simple indexing scheme that supportstemporal ranking queries much more efficiently. TheSeb-tree answers a top-k query for any time instance tin the optimal number of I/Os in expectation, namely,O(log_B(N/B k/B)) I/Os, where N is the size of the data set and B is the disk block size. The index has near-linearsize (for constant and reasonable kmax values, where kmax is the maximum value for the possible values of the query parameter k), can be constructed in near-lineartime, and also supports insertions and deletions without affecting its query performance guarantee. Most ofall, the Seb-tree is especially appealing in practice dueto its simplicity as it uses the B-tree as the only building block. Extensive experiments on a number of largedata sets, show that the Seb-tree is more than an orderof magnitude faster than the R-tree based indexes for temporal ranking queries.
Temporal and multi-version databases often generate massive amounts of data, due to the increasing availability of large storage space and the increasing importance of mining and auditing opera- tions from historical data. For example, Google now allows users to limit and rank search results by setting a time range. These data- bases are ideal candidates for a distributed store, which offers large storage space, and parallel and distributed processing power from a cluster of (commodity) machines. A key challenge is to achieve a good load balancing algorithm for storage and processing of these data, which is done by partitioning the database. In this paper, we introduce the concept of optimal splitters for temporal and multi- version databases, which induce a partition of the input data set, and guarantee that the size of the maximum bucket be minimized among all possible configurations, given a budget for the desired number of buckets. We design efficient methods for memory- and disk-resident data respectively, and show that they significantly outperform com- peting baseline methods both theoretically and empirically on large real data sets.
This paper revisits the classical problem of multi-query optimization in the context of RDF/SPARQL. We show that the techniques developed for relational and semi-structured data/query languages are hard, if not impossible, to be extended to account for RDF data model and graph query patterns expressed in SPARQL. In light of the NP-hardness of the multi-query optimization for SPARQL, we propose heuristic algorithms that partition the input batch of queries into groups such that each group of queries can be optimized together. An essential component of the optimization incorporates an efficient algorithm to discover the common sub-structures of multiple SPARQL queries and an effective cost model to compare candidate execution plans. Since our optimization techniques do not make any assumption about the underlying SPARQL query engine, they have the advantage of being portable across different RDF stores. The comprehensive experimental studies, performed on three popular RDF stores, show that the proposed techniques are effective, efficient and scalable.
The problem of answering SPARQL queries over virtual SPARQL views is commonly encountered in a number of settings, including while enforcing security policies to access RDF data, or when integrating RDF data from disparate sources. We approach this problem by rewriting SPARQL queries over the views to equivalent queries over the underlying RDF data, thus avoiding the costs entailed by view materialization and maintenance. We show that SPARQL query rewriting combines the most challenging aspects of rewriting for the relational and XML cases: like the relational case, SPARQL query rewriting requires synthesizing multiple views; like the XML case, the size of the rewritten query is exponential to the size of the query and the views. In this paper, we present the first native query rewriting algorithm for SPARQL. For an input SPARQL query over a set of virtual SPARQL views, the rewritten query resembles a union of conjunctive queries and can be of exponential size. We propose optimizations overthe basic rewriting algorithm to (i) minimize each conjunctive query in the union; (ii) eliminate conjunctive queries with empty results from evaluation; and (iii) efficiently prune out big portions of the search space of empty rewritings. The experiments, performed on two RDF stores, show that our algorithms arescalable and independent of the underlying RDF stores. Furthermore, our optimizations have order of magnitude improvements over the basic rewriting algorithm in both the rewriting size and evaluation time.