Bin Yao | |

[ Book Chapter Journal Conference Workshop Tech Report]

By Feifei Li, Ke Yi, Yufei Tao, Bin Yao, Yang Li, Dong Xie, Min Wang

Vol.25(4), Pages 317-338,

Aggregate similarity search, also known as aggregate nearest neighbor (Ann) query, finds many useful applications in spatial and multimedia databases. Given a group Q of M query objects, it retrieves from a database the objects most similar to Q, where the simi- larity is an aggregation (e.g., sum, max) of the distances between each retrieved object p and all the objects in Q. In this paper, we propose an added flexibility to the query definition, where the similarity is an aggre- gation over the distances between p and any subset of phi objects in Q for some support 0 < phi le 1. We call this new definition flexible aggregate similarity search, and accordingly refer to a query as a flexible aggre- gate nearest neighbor (Fann) query. We present algo- rithms for answering Fann queries exactly and approx- imately. Our approximation algorithms are especially appealing, which are simple, highly efficient, and work well in both low and high dimensions. They also return near-optimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experi- ments on large real and synthetic datasets from 2 to 74 dimensions have demonstrated their superior efficiency and high quality.

By Bin Yao, Xiaokui Xiao, Feifei Li, Yifan Wu

Vol.23(5), Pages 697-720,

Optimal location (OL) queries are a type of spatial queries that are particularly useful for the strategic planning of resources. Given a set of existing facilities and a set of clients, an OL query asks for a location to build a new facility that optimizes a certain cost metric (defined based on the distances between the clients and the facilities). Several techniques have been proposed to address OL queries, assuming that all clients and facilities reside in an L p space. In practice, however, movements between spatial locations are usually confined by the underlying road network, and hence, the actual distance between two locations can differ significantly from their L p distance. Motivated by the deficiency of the existing techniques, this paper presents a comprehensive study on OL queries in road networks. We propose a unified framework that addresses three variants of OL queries that find important applications in practice, and we instantiate the framework with several novel query processing algorithms. We further extend our framework to efficiently monitor the OLs when locations for facilities and/or clients have been updated. Our dynamic update methods lead to efficient answering of continuous optimal location queries. We demonstrate the efficiency of our solutions through extensive experiments with large real data.

By Feifei Li, Bin Yao, Mingwang Tang, Marios Hadjieleftheriou

Vol.25(6), Pages 1394-1409,

This work deals with the approximate string search in large spatial databases. Speci%uFB01cally, we investigate range queries augmented with a string similarity search predicate in both Euclidean space and road networks. We dub this query the spatial approximate string (SAS) query. In Euclidean space, we propose an approximate solution, the MHR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on the set resemblance between the query string and the q-grams from the sub-trees of index nodes. We also discuss how to estimate the selectivity of a SAS query in Euclidean space, for which we present a novel adaptive algorithm to %uFB01nd balanced partitions using both the spatial and string information stored in the tree. For queries on road networks, we propose a novel exact method, RSASSOL, which signi%uFB01cantly outperforms the baseline algorithm in practice. The RSASSOL combines the q-gram based inverted lists and the reference nodes based pruning. Extensive experiments on large real data sets demonstrate the ef%uFB01ciency and effectiveness of our approaches.

By Feifei Li, Bin Yao, Piyush Kumar

Vol.23, No. 10, Pages 1526-1540,

Given a set of points P and a query set Q, a group enclosing query (GEQ) fetches the point p such that the maximum distance of p to all points in Q is minimized. This problem is equivalent to the Min-Max case (minimizing the maximum distance) of aggregate nearest neighbor queries for spatial databases. This work first designs a new exact solution by exploring new geometric insights, such as the minimum enclosing ball, the convex hull and the furthest voronoi diagram of the query group. To further reduce the query cost, especially when the dimensionality increases, we turn to approximation algorithms. Our main approximation algorithm has a worst case p2-approximation ratio if one can find the exact nearest neighbor of a point. In practice, its approximation ratio never exceeds 1.05 for a large number of data sets up to six dimension.We also discuss how to extend it to higher dimensions (up to 74 in our experiment) and show that it still maintains a very good approximation quality (still close to 1) and low query cost. In fixed dimensions,we extend the p2-approximation algorithm to get a (1 epsilon)-approximate solution for the GEQ problem. Both approximation algorithms have O(logN M) query cost in any fixed dimension, where N and M are the sizes of the data set P and query group Q. Extensive experiments on both synthetic and real data sets, up to 10 million points and 74 dimensions, confirm the efficiency, effectiveness and scalability of the proposed algorithms, especially their significant improvement over the state-of-the-art method.

By Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, Zhongpu Chen, Minyi Guo

In Proceedings of In Proceedings of 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (

We present the Simba (Spatial In-Memory Big data Analytics) system, which offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba natively extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and DataFrame API. It enables the construction of indexes over RDDs inside the engine in order to work with big spatial data and complex spatial operations. Simba also comes with an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput in big spatial data analysis. This demonstration proposal describes key ideas in the design of Simba, and presents a demonstration plan.

By Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, Minyi Guo

In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (

Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial analytics systems are disk-based and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces the concept and construction of indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba%u2019s superior performance compared against other spatial analytics system.

By Feifei Li , Guoliang Li, Seung-won Hwang, Bin Yao , Zhenjie Zhang,

In Proceedings of WAIM (

By Bin Yao, Feifei Li, Xiaokui Xiao

In Proceedings of 29th IEEE International Conference on Data Engineering (

The increasing popularity of the cloud drives the demands for secure queries on an encrypted database E(D) stored in the cloud. In this paper, we investigate the secure nearest neighbor (SNN) problem, in which a client issues an encrypted query point E(q) to a server and asks for an encrypted data point in E(D) that is closest to the query point, without allowing the server to learn the plaintexts of the data or the query (and its result). We show that efficient attacks exist for existing SNN methods, even though they were claimed to be secure in standard security models (such as indistinguishability under chosen plaintext or ciphertext attacks). We also establish a relationship between the SNN problem and the order-preserving encryption (OPE) problem from the cryptography field, and we show that SNN is at least as hard as OPE. Since it is impossible to construct secure OPE schemes in standard security models, our results imply that one cannot expect to find the exact (encrypted) nearest neighbor based on only E(q) and E(D). Given this hardness result, we design new SNN methods by asking the server, given only E(q) and E(D), to return a relevant (encrypted) partition E(G) from E(D) (i.e., G subseteq D), such that that E(G) is guaranteed to contain the answer for the SNN query. Our methods provide customizable tradeoff between efficiency and communication cost, and they are as secure as the encryption scheme E used to encrypt the query and the database. Note that E can be any well-established encryption schemes. The efficiency and scalability of our methods are demonstrated through extensive experiments on large real data.

By Bin Yao, Mingwang Tang, Feifei Li

In Proceedings of 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (

GIS data usually consist of both spatial and textual information,where the spatial component represents the location ofthe object and the textual element contains a set of stringsdescribing object in that location. For GIS data situated ona road network, shortest path search is a basic operation. Inpractice, however, users are often interested at routing whencertain constraints on the textual information have been alsoincorporated. This work complements the standard shortestpath search with multiple keywords and an approximatestring similarity function, where the goal is to find the shortestpath that passes through at least one matching objectper keyword; we dub this problem the multi-approximatekeywordrouting (makr) query. We present both exact andapproximate solutions. When the number %u03BA of query keywordsis small (e.g., %u03BA %u2264 6), the exact solution works efficiently.However, when %u03BA increases, it becomes increasinglyexpensive (especially on large GIS data). In this case, ourapproximate methods achieve superb query efficiency, excellentscalability, and high approximation quality, as indicatedin our extensive experiments on large, real datasets (up to 2million points on road networks with hundreds of thousandsof nodes and edges). We also prove that one approximatemethod has a %u03BA-approximation in the worst case.

By Yang Li, Feifei Li, Ke Yi, Bin Yao, Min Wang

In Proceedings of 30th ACM SIGMOD International Conference on Management of Data (

Aggregate similarity search, a.k.a. aggregate nearest neighbor (Ann) query, ﬁnds many useful applications in spatialand multimedia databases. Given a group Q of M query objects, it retrieves the most (or top-k) similar object to Q froma database P, where the similarity is an aggregation (e.g.,sum, max) of the distances between the retrieved object pand all the objects in Q. In this paper, we propose an addedﬂexibility to the query deﬁnition, where the similarity is anaggregation over the distances between p and any subset ofφM objects in Q for some support 0 < φ ≤ 1. We call thisnew deﬁnition ﬂexible aggregate similarity (Fann) search,which generalizes the Ann problem. Next, we present algorithms for answering Fann queries exactly and approximately. Our approximation algorithms are especially appealing, which are simple, highly eﬃcient, and work wellin both low and high dimensions. They also return nearoptimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experiments on largereal and synthetic datasets from 2 to 74 dimensions havedemonstrated their superior eﬃciency and high quality.

By Xiaokui Xiao, Bin Yao, Feifei Li

In Proceedings of 27th IEEE International Conference on Data Engineering (

Optimal location (OL) queries are a type of spatial queries particularly useful for the strategic planning of resources.Given a set of existing facilities and a set of clients, an OL query asks for a location to build a new facility that optimizes a certain cost metric (defined based on the distances between the clients and the facilities). Several techniques have been proposed to address OL queries, assuming that all clients and facilities reside in an Lp space. In practice, however, movements between spatial locations are usually confined by the underlying road network, and hence, the actual distance between two locations can differ significantly from their Lp distance.Motivated by the deficiency of the existing techniques, this paper presents the first study on OL queries in road networks.We propose a unified framework that addresses three variants of OL queries that find important applications in practice,and we instantiate the framework with several novel query processing algorithms. We demonstrate the efficiency of our solutions through extensive experiments with real data.

By Bin Yao, Feifei Li, Marios Hadjieleftheriou, Kun Hou

In Proceedings of 26th IEEE International Conference on Data Engineering (

This work presents a novel index structure, MHR-tree, for efficiently answering approximate string match queries in large spatial databases. The MHR-tree is based on the R-tree augmented with the min-wise signature and the linear hashing technique. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on set resemblance between the query string and the q-grams from the sub-trees of index nodes. MHR-tree supports a wide range of query predicates efficiently, including range and nearest neighbor queries. We also discuss how to estimate range query selectivity accurately. We present a novel adaptive algorithm for finding balancedpartitions using both the spatial and string information stored in the tree. Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our approach.

By Bin Yao, Feifei Li, Piyush Kumar

In Proceedings of 26th IEEE International Conference on Data Engineering (

Finding the k nearest neighbors (kNN) of a query point, or a set of query points (kNN-Join) are fundamental problems in many application domains. Many previous efforts tosolve these problems focused on spatial databases or stand-alone systems, where changes to the database engine may be required,which may limit their application on large data sets that are stored in a relational database management system. Furthermore,these methods may not automatically optimize kNN queries or kNN-Joins when additional query conditions are specified. In this work, we study both the kNN query and the kNN-Join ina relational database, possibly augmented with additional query conditions. We search for relational algorithms that require no changes to the database engine. The straightforward solution uses the user-defined-function (UDF) that a query optimizer cannot optimize.We design algorithms that could be implementedby SQL operators without changes to the database engine, hence enabling the query optimizer to understand and generate the “best” query plan. Using only a small constant number of random shifts for databases in any fixed dimension, our approach guarantees to find the approximate kNN with only logarithmic number of page accesses in expectation with aconstant approximation ratio and it could be extended to find the exact kNN efficiently in any fixed dimension. Our design paradigm easily supports the kNN-Join and updates. Extensive experiments on large, real and synthetic, data sets confirm the efficiency and practicality of our approach.

By Bin Yao, Feifei Li, Piyush Kumar

In Proceedings of 25th IEEE International Conference on Data Engineering (

Given a set of points $P$ and a query point q, the reverse furthest neighbor (RFN) query fetches the set of points $p in P$ such that q is their furthest neighbor among all points in $Pcup{q}$. This is the monochromatic RFN (MRFN) query. Another interesting version of RFN query is the bichromatic reverse furthest neighbor (BRFN) query. Given a set of points P, a queryset Q and a query point $q in Q$, a BRFN query fetches the set of points $pin P$ such that q is the furthest neighbor of p amongall points in Q. The RFN query has many interesting applications in spatial databases and beyond. For instance, given a large residential database (as P) and a set of potential sites (as Q) for building a chemical plant complex, the construction site should be selected as the one that has the maximum number of reverse furthest neighbors. This is an instance of the BRFN query. This paper presents the challenges associated with such queries and proposes efficient, R-tree based algorithms for both monochromatic and bichromatic versions of the RFN queries. We analyze properties of the RFN query that differentiate it from the widely studied reverse nearest neighbor queries and enable the design of novel algorithms. Our approach takes advantage of the furthest Voronoi diagrams as well as the convex hulls of either the data set P (in the MRFN case) or thequery set Q (in the BRFN case). For the BRFN queries, we also extend the analysis to the situation when Q is large in size andbecomes disk-resident. Experiments on both synthetic and real data sets confirm the efficiency and scalability of proposed algorithms over the brute-force search based approach.