III: Small: Towards a Database Engine for Interactive and Online Sampling and Analytics


Existing databases and data management systems are designed and optimized for executing queries and analytical jobs in their entirety. User interactions with such systems are limited to binary decisions: either wait for exact answers in the end, or terminate a job while it is still running and obtain little or even zero knowledge regarding the final output. This model no loner works well with big data as waiting for the exact query or analytical results may take a long time. The good news is that often users do not need exact results in big data computation


Funding


  • instead, they are happy with approximations, especially for approximations with quality guarantees. It is even better if the approximation quality gradually improves over time in an online fashion, while the query is being executed, and users can control the efficiency-accuracy tradeoff in realtime. This project designs techniques for building a database engine that supports such interactive and online exploration and analytics. The results of this project is also useful for downstream data analytical modules such as information visualization.
  • Supported by NSF IIS program: https://www.nsf.gov/awardsearch/showAward?AWD_ID=1619287

  • People


    Robert Christensen
    Master/PhD student. Now at Visa Research.


    Richie Frost
    Master student.


    Feifei Li
    Professor


    Zhuoyue Zhao
    PhD Student, Research Interests: distributed systems,big data analytics.


    Neeka Ebrahimi
    Master student.



    Source



    Publications


  • Spatial Online Sampling and Aggregation, Talk
    By Lu Wang,    Robert Christensen,    Feifei Li,    Ke Yi
    In Proceedings of International Conference on Very Large Data Bases (VLDB 2016),  pages 84-95,  New Delhi India,  August,  2016.
    Abstract

    The massive adoption of smart phones and other mobile devices has generated humongous amount of spatial and spatio-temporal data. The importance of spatial analytics and aggregation is ever-increasing. An important challenge is to support interactive exploration over such data. However, spatial analytics and aggregation using all data points that satisfy a query condition is expensive, especially over large data sets, and could not meet the needs of interactive exploration. To that end, we present novel indexing structures that support spatial online sampling and aggregation on large spatial and spatio-temporal data sets. In spatial online sampling, random samples from the set of spatial (or spatio-temporal) points that satisfy a query condition are generated incrementally in an online fashion. With more and more samples, various spatial analytics and aggregations can be performed in an online, interactive fashion, with estimators that have better accuracy over time. Our design works well for both memory-based and disk-resident data sets, and scales well towards different query and sample sizes. More importantly, our structures are dynamic, hence, they are able to deal with insertions and deletions efficiently. Extensive experiments on large real data sets demonstrate the improvements achieved by our indexing structures compared to other baseline methods.

  • Wander Join: Online Aggregation via Random Walks, Talk
    By Feifei Li,    Bin Wu,    Ke Yi,    Zhuoyue Zhao
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD 2016),  pages 615-629,  June,  2016.
    Abstract

    Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and even needs unrealistic assumptions (e.g., tuples in a table are stored in random order). This paper proposes a new approach the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. Wander join produces independent but non-uniform samples, which gives huge performance gains over the uniform but non-independent samples returned by ripple join. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Wander join works for different types of joins including chain, acyclic, cyclic joins, with selection predicates and group-by clauses. It easily extends to $theta$-joins as well. Extensive experiments using the TPC-H benchmark have demonstrated the superior performance of wander join over ripple join in both internal and external memory. We have also implemented and tested wander join in the latest version of PostgreSQL; the results show its excellent efficiency and effectiveness in a full fledged, commercial level DBMS.

  • Wander Join: Online Aggregation For Joins, Talk
    By Feifei Li,    Bin Wu,    Ke Yi,    Zhuoyue Zhao
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data 2016 (Demo Paper) (SIGMOD 2016),  pages 2121-2124,  June,  2016.
    Abstract

    Joins are expensive, and online aggregation over joins wasproposed to mitigate the cost, which offers a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and may also need very restrictive assumptions (e.g., tuples in a table are stored in random order). We have introduced a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. We have also implemented and tested wander join in the latest version of PostgreSQL; this is the first time online aggregation has been integrated into a full fledged, commercial level DBMS.

  • Wander Join and XDB: Online Aggregation via Random Walks
    By Feifei Li,    Bin Wu,    Ke Yi,    Zhuoyue Zhao
    Vol.0, ACM SIGMOD Record (SIGMOD RECORD),  2017.
    Abstract

    Joins are expensive, and online aggregation is an e↵ective approach to explore the tradeo↵ between query e"ciency and accuracy in a continuous, online fashion. However, the stateof-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and needs strong assumptions (e.g., the tuples in a table are stored in random order). This paper proposes a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Selection predicates and group-by clauses can be handled as well. We have developed an online engine called XDB by integrating wander join in the latest version of PostgreSQL. Extensive experiments using the TPC-H benchmark have shown the superior performance of wander join. The XDB implementation has demonstrated its practicality in a full-fledged database system

  • Random Sampling Over Joins Revisited
    By Zhuoyue Zhao,    Robert Christensen,    Feifei Li,    Xiao Hu,    Ke Yi
    (To Appear) In Proceedings of 37th ACM SIGMOD International Conference on Management of Data (SIGMOD 2018),  pages ??-??,  June,  2018.
    Abstract

    Joins are expensive, especially on large data and/or multiple relations. One promising approach in mitigating their high costs is to just return a simple random sample of the full join results, which is sufficient for many analytical queries. Indeed, in as early as 1999, Chaudhuri et al. posed the problem of sampling over joins as a fundamental challenge in large database systems. They also pointed out a fundamental barrier for this problem, that the sampling operator cannot be pushed through a join, i.e., sample(R join S) \ne sample(R) join sample(S). To overcome this barrier, they used precomputed statistics to guide the sampling process, but only showed how this works for two-relation joins. This paper revisits this classic problem for both acyclic and cyclic multi-way joins. We build upon the idea of Chaudhuri et al., but extend it in several nontrivial directions. First, we propose a general framework for random sampling over multi-way joins, which includes the algorithm of Chaudhuri et al. as a special case. Second, we explore several ways to instantiate this framework, depending on what prior information is available about the underlying data, and offer different tradeoffs between sample generation latency and throughput. We analyze the properties of different instantiations and evaluate them against the baseline methods; the results clearly demonstrate the superiority of our new techniques.

  • Persistent Bloom Filter: Membership Testing for the Entire History
    By Yanqing Peng,    Jinwei Guo,    Feifei Li,    Weining Qian,    Aoying Zhou
    (To Appear) In Proceedings of 37th ACM SIGMOD International Conference on Management of Data (SIGMOD 2018),  pages ??-??,  June,  2018.
    Abstract

    Membership testing is the problem of testing whether an element is in a set of elements. Performing the test exactly is expensive space-wise, requiring the storage of all elements in a set. In many applications, an approximate testing that can be done quickly using small space is often desired. Bloom filter (BF) was designed and has witnessed great success across numerous application domains. But there is no compact structure that supports set membership testing for temporal queries, e.g., has person A visited a web server between 9:30am and 9:40am? And has the same person visited the web server again between 9:45am and 9:50am? It is possible to support such ``temporal membership testing'' using a BF, but we will show that this is fairly expensive. To that end, this paper designs persistent bloom filter (PBF), a novel data structure for temporal membership testing with compact space.