COFRS Award: Efficient Aggregate Similarity Search

For many geospatial and multimedia database applications, a fundamental challenge is to answer similaritysearch on large amounts of data, also known as the nearest neighbor (NN) query. Despite extensivestudies, little is known for aggregate similarity search that has witnessed an increasing number of applications.Given a data set P, an aggregate similarity query is specified by a group of query objects Q, anaggregator ? and a similarity function f involving at least the spatial distance between objects. The goalis to find an object p? ? P such that the aggregate similarity between p? and objects from Q is minimized.For example, when ? is sum over spatial distances, it is to find p? ? P that minimizes the sum of distancesbetween p? and every object from Q.The aggregate similarity search presents interesting and challenging research problems, in terms of bothquery semantics and query processing techniques. We will also explore parallel and relational (those thatcan be directly implemented by standard SQL statements) methods in this project.Due to the fundamental importance of the similarity search and the emerging applications for variousaggregate similarity search, our project will significantly improve the scientific work in geospatial intelligenceanalytics, including spatial intelligence, complex spatial data analysis, GIS, and location-based services.End-users may design customized aggregate similarity search to identify normal and discover anomalies incomplex geospatial and multimedia data.


  • FSU COFRS Award, PI, Feifei Li $14,000, 05/11-08/11.

  • Publications

  • Randomized Synopses for Query Assurance on Data Streams (Project Website), Talk
    By Ke Yi,    Feifei Li,    Marios Hadjieleftheriou,    George Kollios,    Divesh Srivastava
    In Proceedings of 24th IEEE International Conference on Data Engineering (ICDE 2008),  pages 416-425,  Cancun, Mexico,  April,  2008.

    The overwhelming flow of information in manydata stream applications forces many companies to outsourceto a third-party the deployment of a Data Stream Management System (DSMS) for performing desired computations. Remotecomputations intrinsically raise issues of trust, making query execution assurance on data streams a problem with practicalimplications. Consider a client observing the same data stream asa remote server (e.g., network traffic), that registers a continuousquery on the server’s DSMS, and receives answers upon request.The client needs to verify the integrity of the results usingsignificantly fewer resources than evaluating the query locally.Towards that goal, we propose a probabilistic algorithm forselection and aggregate/group-by queries, that uses constantspace irrespective of the result-set size, has low update cost,and arbitrarily small probability of failure. We generalize thisalgorithm to allow some tolerance on the number of errors permitted (irrespective of error magnitude), and also discuss thehardness of permitting arbitrary errors of small magnitude. Wealso perform an empirical evaluation using live network traffic.