CAREER: Novel Query Processing Techniques for Probabilistic Data


Data are increasingly generated, stored, and processed distributively. Meanwhile, when large amounts of data are generated, ambiguity, uncertainty, and errors are inherently introduced, especially in a distributed setup. It is best to represent such data in a distributed probabilistic database. In distributed data management, summary queries are useful tools for obtaining the most important answers from massive quantities of data effectively and efficiently, e.g., top-k queries, heavy hitters (aka frequent items), histograms and wavelets, threshold monitoring queries, etc. This project investigates novel query processing techniques for various, important summary queries in distributed probabilistic data. Broadly classified, this project examines both snapshot summary queries in static (i.e., no updates) distributed probabilistic databases, and continuous summary queries in dynamic (i.e., with updates) distributed probabilistic databases. A number of techniques are explored to design novel, communication and computation efficient algorithms for processing these queries. A distributed probabilistic data management system (DPDMS) prototype is implemented based on the query processing techniques developed in this project. This DPDMS is released to and used in practice by scientists and engineers from other science disciplines as well as industry. Graduate and undergraduate students, including those from minority groups, are actively involved in this project. Findings from the project have been integrated into different courses, demos, and educational projects.


Funding


  • Funded by NSF IIS Program udner the project ''CAREER: Novel Query Processing Techniques for Probabilistic Data'', sole PI, Feifei Li, 02/01/11-01/31/16, $498,138
  • http://www.cs.utah.edu/~lifeifei/dpdm/

  • People


    Feifei Li
    Associate Professor


    Mingwang Tang
    PhD Student. Graduated in Summer 2014. Current Employment: Uber.


    Jeffrey Jestes
    PhD Student, Graduated in summer 2013. Current employment: Cerner Corporation


    Dong Xie
    PhD student. Research Interest: big spatial data systems. Distributed systems.



    Source



    Dataset



    Publications


  • Semantics of Ranking Queries for Probabilistic Data (Project Website)
    By Jeffrey Jestes,    Graham Cormode,    Feifei Li,    Ke Yi
    Vol.250, Pages 545-556, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE),  2010.
    Abstract

    Recently, there have been several attempts to propose definitions and algorithms for ranking queries on probabilistic data. However, these lack many intuitive properties of a top-k over deterministic data. We define several fundamental properties, including exact-k, containment, unique-rank, value-invariance, and stability, which are satisfied by ranking queries on certain data. We argue these properties should also be carefully studied in defining ranking queries in probabilistic data, and fulfilled by definition for ranking uncertain data for most applications. We propose an intuitive new ranking definition based on the observation that the ranks of a tuple across all possible worlds represent a well-founded rank distribution. We studied the ranking definitions based on the expectation, the median and other statistics of this rank distribution for a tuple and derived the expected rank, median rank and quantile rank correspondingly. We are able to prove that the expected rank, median rank and quantile rank satisfy all these properties for a ranking query. We provide efficient solutions to compute such rankings across the major models of uncertain data, such as attribute-level and tuple-level uncertainty. Finally, a comprehensive experimental study confirms the effectiveness of our approach.

  • Semantics of Ranking Queries for Probabilistic Data and Expected Ranks (Project Website), Talk
    By Graham Cormode,    Feifei Li,    Ke Yi
    In Proceedings of 25th IEEE International Conference on Data Engineering (ICDE 2009),  pages 305-316,  Shanghai, China,  April,  2009.
    Abstract

    When dealing with massive quantities of data, topkqueries are a powerful technique for returning only the kmost relevant tuples for inspection, based on a scoring function.The problem of efficiently answering such ranking queries hasbeen studied and analyzed extensively within traditional databasesettings. The importance of the top-k is perhaps even greater inprobabilistic databases, where a relation can encode exponentiallymany possible worlds. There have been several recent attemptsto propose definitions and algorithms for ranking queries overprobabilistic data. However, these all lack many of the intuitiveproperties of a top-k over deterministic data. Specifically, wedefine a number of fundamental properties, including exact-k,containment, unique-rank, value-invariance, and stability, whichare all satisfied by ranking queries on certain data. We arguethat all these conditions should also be fulfilled by any reasonabledefinition for ranking uncertain data. Unfortunately, none of theexisting definitions is able to achieve this.To remedy this shortcoming, this work proposes an intuitivenew approach of expected rank. This uses the well-founded notionof the expected rank of each tuple across all possible worldsas the basis of the ranking. We are able to prove that, incontrast to all existing approaches, the expected rank satisfiesall the required properties for a ranking query. We provideefficient solutions to compute this ranking across the majormodels of uncertain data, such as attribute-level and tuple-leveluncertainty. For an uncertain relation of N tuples, the processingcost is O(N logN)—no worse than simply sorting the relation.In settings where there is a high cost for generating each tuple inturn, we provide pruning techniques based on probabilistic tailbounds that can terminate the search early and guarantee thatthe top-k has been found. Finally, a comprehensive experimentalstudy confirms the effectiveness of our approach.

  • Finding Frequent Items in Probabilistic Data (Project Website), Talk
    By Qin Zhang,    Feifei Li,    Ke Yi
    In Proceedings of 27th ACM SIGMOD International Conference on Management of Data (SIGMOD 2008),  pages 819-832,  Vancouver, Canada,  June,  2008.
    Abstract

    Computing statistical information on probabilistic data hasattracted a lot of attention recently, as the data generatedfrom a wide range of data sources are inherently fuzzy oruncertain. In this paper, we study an important statisti-cal query on probabilistic data: finding the frequent items.One straightforward approach to identify the frequent itemsin a probabilistic data set is to simply compute the expectedfrequency of an item and decide if it exceeds a certain frac-tion of the expected size of the whole data set. However,this simple definition misses important information aboutthe internal structure of the probabilistic data and the in-terplay among all the uncertain entities. Thus, we proposea new definition based on the possible world semantics thathas been widely adopted for many query types in uncertaindata management, trying to find all the items that are likelyto be frequent in a randomly generated possible world. Ourapproach naturally leads to the study of ranking frequentitems based on confidence as well.Finding likely frequent items in probabilistic data turnsout to be much more difficult. We first propose exact algo-rithms for offline data with either quadratic or cubic time.Next, we design novel sampling-based algorithms for stream-ing data to find all approximately likely frequent items withtheoretically guaranteed high probability and accuracy. Oursampling schemes consume sublinear memory and exhibitexcellent scalability. Finally, we verify the effectiveness andefficiency of our algorithms using both real and syntheticdata sets with extensive experimental evaluations.

  • Efficient Processing of Top-k Queries in Uncertain Databases (Project Website), Talk
    By Ke Yi,    Feifei Li,    George Kollios,    Divesh Srivastava
    In Proceedings of 24th IEEE International Conference on Data Engineering (ICDE 2008),  pages 1406-1408,  Cancun, Mexico,  April,  2008.
    Abstract

    This work introduces novel polynomial-time algorithmsfor processing top-k queries in uncertain databases,under the generally adopted model of x-relations. An x-relationconsists of a number of x-tuples, and each x-tuple randomlyinstantiates into one tuple from one or more alternatives. Ourresults signi cantly improve the best known algorithms for top-kquery processing in uncertain databases, in terms of both runningtime and memory usage. Focusing on the single-alternative case,the new algorithms are orders of magnitude faster.

  • Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations (Project Website)
    By Ke Yi,    Feifei Li,    George Kollios,    Divesh Srivastava
    Vol.20, Pages 1669-1682, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE),  2008.
    Abstract

    This work introduces novel polynomial algorithms for processing top-k queries in uncertain databases under the generally adopted model of x-relations. An x-relation consists of a number of x-tuples, and each x-tuple randomly instantiates into one tuple from one or more alternatives. Our results significantly improve the best known algorithms for top-k query processing in uncertain databases, in terms of both runtime and memory usage. In the single-alternative case, the new algorithms are 2 to 3 orders of magnitude faster than the previous algorithms. In the multialternative case, we introduce the first-known polynomial algorithms, while the current best algorithms have exponential complexity in both time and space. Our algorithms run in near linear or low polynomial time and cover both types of top-k queries in uncertain databases. We provide both the theoretical analysis and an extensive experimental evaluation to demonstrate the superiority of the new approaches over existing solutions.

  • Efficient Processing of Top-k Queries on Uncertain Databases (Project Website)
    By Ke Yi,    Feifei Li,    Divesh Srivastava,    George Kollios
    Technical Report,  June,  2007.
    Abstract
  • Efficient and Scalable Monitoring and Summarization of Large Probabilistic Data
    By Mingwang Wang
    In Proceedings of 32nd ACM SIGMOD International Conference on Management of Data (SIGMOD 2013 PHD SYMPOSIUM),  pages 61-66,  June,  2013.
    Abstract

    In numerous real applications, uncertainty is inherently introduced when massive data are generated. Modern database management systems aim to incorporate and handle data with uncertainties as a first-class citizen, where uncertain data are represented as probabilistic relations. In my thesis, my work has focused on monitoring and summarization of large probabilistic data. Specifically, we extended the distributed threshold monitoring problem to distributed probabilistic data. Instead, we actually need to monitor the aggregated value (e.g. sum) of distributed probabilistic data against both the score threshold and the probability threshold, which make the techniques designed for deterministic data are not directly applicable. Our algorithms have significantly reduced both the communication and computation costs as shown by an extensive experimental evaluation on large real datasets. On the other hand, building histograms to summarize the distribution of certain feature in a large data set is a fundamental problem in data management. Recent work have extended this studies to probabilistic data, but their methods suffer from the limited scalability. We present novel methods to build scalable histograms over large probabilistic data using distributed and parallel algorithms. Extensive experiments on large real data sets have demonstrated the superb scalability and efficiency achieved by our implementations in MapReduce, when compared to the existing, state-of-the-art centralized methods.

  • Scalable Histograms on Large Probabilistic Data, Talk
    By Mingwang Tang,    Feifei Li
    In Proceedings of 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2014),  pages 631-640,  New York,  2014.
    Abstract

    Histogram construction is a fundamental problem in data management, and a good histogram supports numerous mining operations. Recent work has extended histograms to probabilistic data. However, constructing histograms for probabilistic data can be extremely expensive, and existing studies suffer from limited scalability. This work designs novel approximation methods to construct scalable histograms on probabilistic data. We show that our methods provide constant approximations compared to the optimal histograms produced by the state-of-the-art in the worst case. We also extend our methods to parallel and distributed settings so that they can run gracefully in a cluster of commodity machines. We introduced novel synopses to reduce communication cost when running our methods in such settings. Extensive experiments on large real data sets have demonstrated the superb scalability and efficiency achieved by our methods, when compared to the state-ofthe-art methods. They also achieved excellent approximation quality in practice.

  • Distributed Online Tracking, Talk
    By Mingwang Tang,    Feifei Li,    Yufei Tao
    In Proceedings of 34th ACM SIGMOD International Conference on Management of Data (SIGMOD 2015),  pages 2047-2061,  Melbourne, Australia,  2015.
    Abstract

    In online tracking, an observer S receives a sequence of values, one per time instance, from a data source that is described by a function f. A tracker T wants to continuously maintain an approximation that is within an error threshold of the value f(t) at any time instance t, with small communication overhead. This problem was recently formalized and studied in, and a principled approach with optimal competitive ratio was proposed. This work extends the study of online tracking to a distributed setting, where a tracker T wants to track a function f that is computed from a set of functions {f1, . . . , fm} from m distributed observers and respective data sources. This formulation finds numerous important and natural applications, e.g., sensor networks, distributed systems, measurement networks, and pub-sub systems. We formalize this problem and present effective online algorithms for various topologies of a distributed system/network for different aggregate functions. Experiments on large real data sets demonstrate the excellent performance of our methods in practice.

  • Exact and Approximate Flexible Aggregate Similarity Search
    By Feifei Li,    Ke Yi,    Yufei Tao,    Bin Yao,    Yang Li,    Dong Xie,    Min Wang
    Vol.25(4), Pages 317-338, The International Journal on Very Large Data Bases (VLDBJ),  2016.
    Abstract

    Aggregate similarity search, also known as aggregate nearest neighbor (Ann) query, finds many useful applications in spatial and multimedia databases. Given a group Q of M query objects, it retrieves from a database the objects most similar to Q, where the simi- larity is an aggregation (e.g., sum, max) of the distances between each retrieved object p and all the objects in Q. In this paper, we propose an added flexibility to the query definition, where the similarity is an aggre- gation over the distances between p and any subset of phi objects in Q for some support 0 < phi le 1. We call this new definition flexible aggregate similarity search, and accordingly refer to a query as a flexible aggre- gate nearest neighbor (Fann) query. We present algo- rithms for answering Fann queries exactly and approx- imately. Our approximation algorithms are especially appealing, which are simple, highly efficient, and work well in both low and high dimensions. They also return near-optimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experi- ments on large real and synthetic datasets from 2 to 74 dimensions have demonstrated their superior efficiency and high quality.

  • Simba: Efficient In-Memory Spatial Analytics (Project Website), Talk
    By Dong Xie,    Feifei Li,    Bin Yao,    Gefei Li,    Liang Zhou,    Minyi Guo
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD 2016),  pages 1071-1085,  June,  2016.
    Abstract

    Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial analytics systems are disk-based and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces the concept and construction of indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba%u2019s superior performance compared against other spatial analytics system.

  • Simba: Spatial In-Memory Big Data Analytics (Project Website)
    By Dong Xie,    Feifei Li,    Bin Yao,    Gefei Li,    Liang Zhou,    Zhongpu Chen,    Minyi Guo
    In Proceedings of In Proceedings of 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2016),  pages TBA,  San Francisco, USA,  November,  2016.
    Abstract

    We present the Simba (Spatial In-Memory Big data Analytics) system, which offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba natively extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and DataFrame API. It enables the construction of indexes over RDDs inside the engine in order to work with big spatial data and complex spatial operations. Simba also comes with an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput in big spatial data analysis. This demonstration proposal describes key ideas in the design of Simba, and presents a demonstration plan.