Zhuoyue Zhao
PhD Student, Research Interests: distributed systems,big data analytics.


Book Chapter  Journal  Conference  Workshop  Tech Report]

Journal

2019

  • Wander Join and XDB: Online Aggregation via Random Walks
    By Feifei Li,    Bin Wu,    Zhuoyue Zhao,    Ke Yi
    Vol.44, Pages 2:1-2:41, ACM Transactions on Database Systems (ACM TODS), January, 2019.
    Abstract

    Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and even needs unrealistic assumptions (e.g., tuples in a table are stored in random order). This article proposes a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Compared with ripple join, wander join is particularly efficient for equality joins involving multiple tables, but also supports %u03B8joins. Selection predicates and group-by clauses can be handled as well. To demonstrate the usefulness of wander join, we have designed and implemented XDB (approXimate DB) by integrating wander join into various systems including PostgreSQL, Spark, and a stand-alone plug-in version using PL/SQL. The design and implementation of XDB has demonstrated wander join%u2019s practicality in a full-fledged database system. Extensive experiments using the TPC-H benchmark have demonstrated the superior performance of wander join over ripple join.

  • SolarDB: Towards a Shared-Everything Database on Distributed Log-Structured Storage
    By Tao Zhu,    Zhuoyue Zhao,    Feifei Li,    Weining Qian,    Aoying Zhou,    Dong Xie,    Ryan Stutsman,    Haining Li,    Huiqi HU
    Vol.25, Pages 11:1-11:26, ACM Transactions on Storage (ACM TOS), June , 2019.
    Abstract

    Efficient transaction processing over large databases is a key requirement for many mission-critical applications. Though modern databases have achieved good performance through horizontal partitioning, their performance deteriorates when cross-partition distributed transactions have to be executed. This paper presents SolarDB, a distributed relational database system that has been successfully tested at a large commercial bank. The key features of SolarDB include: 1) a shared-everything architecture based on a two-layer log-structured merge-tree; 2) a new concurrency control algorithm that works with the log-structured storage, which ensures efficient and non-blocking transaction processing even when the storage layer is compacting data among nodes in the background; 3) fine-grained data access to effectively minimize and balance network communication within the cluster. According to our empirical evaluations on TPC-C, Smallbank and a real-world workload, SolarDB outperforms the existing shared-nothing systems by up to 50x when there are close to or more than 5% distributed transactions.

  • 2017

  • Wander Join and XDB: Online Aggregation via Random Walks
    By Feifei Li,    Bin Wu,    Ke Yi,    Zhuoyue Zhao
    Vol.0, ACM SIGMOD Record (SIGMOD Record), 2017.
    Abstract

    Joins are expensive, and online aggregation is an e↵ective approach to explore the tradeo↵ between query e"ciency and accuracy in a continuous, online fashion. However, the stateof-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and needs strong assumptions (e.g., the tuples in a table are stored in random order). This paper proposes a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Selection predicates and group-by clauses can be handled as well. We have developed an online engine called XDB by integrating wander join in the latest version of PostgreSQL. Extensive experiments using the TPC-H benchmark have shown the superior performance of wander join. The XDB implementation has demonstrated its practicality in a full-fledged database system

  • InferSpark: Statistical Inference at Scale. (Project Website)
    By Zhuoyue Zhao,   Eric Lo,   Kenny Q. Zhu,   Chris Liu,   
    Vol.abs/1707.02047, CoRR (corr 2017), 2017.
  • Conference

    2018

  • Random Sampling Over Joins Revisited, Talk
    By Zhuoyue Zhao,    Robert Christensen,    Feifei Li,    Xiao Hu,    Ke Yi
    In Proceedings of 37th ACM SIGMOD International Conference on Management of Data (SIGMOD 2018), pages 1525-1539, June, 2018.
    Abstract

    Joins are expensive, especially on large data and/or multiple relations. One promising approach in mitigating their high costs is to just return a simple random sample of the full join results, which is sufficient for many analytical queries. Indeed, in as early as 1999, Chaudhuri et al. posed the problem of sampling over joins as a fundamental challenge in large database systems. They also pointed out a fundamental barrier for this problem, that the sampling operator cannot be pushed through a join, i.e., sample(R join S) ne sample(R) join sample(S). To overcome this barrier, they used precomputed statistics to guide the sampling process, but only showed how this works for two-relation joins. This paper revisits this classic problem for both acyclic and cyclic multi-way joins. We build upon the idea of Chaudhuri et al., but extend it in several nontrivial directions. First, we propose a general framework for random sampling over multi-way joins, which includes the algorithm of Chaudhuri et al. as a special case. Second, we explore several ways to instantiate this framework, depending on what prior information is available about the underlying data, and offer different tradeoffs between sample generation latency and throughput. We analyze the properties of different instantiations and evaluate them against the baseline methods; the results clearly demonstrate the superiority of our new techniques.

  • 2016

  • Wander Join: Online Aggregation via Random Walks, Talk
    By Feifei Li,    Bin Wu,    Ke Yi,    Zhuoyue Zhao
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD 2016), pages 615-629, June, 2016.
    Abstract

    Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and even needs unrealistic assumptions (e.g., tuples in a table are stored in random order). This paper proposes a new approach the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. Wander join produces independent but non-uniform samples, which gives huge performance gains over the uniform but non-independent samples returned by ripple join. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Wander join works for different types of joins including chain, acyclic, cyclic joins, with selection predicates and group-by clauses. It easily extends to $theta$-joins as well. Extensive experiments using the TPC-H benchmark have demonstrated the superior performance of wander join over ripple join in both internal and external memory. We have also implemented and tested wander join in the latest version of PostgreSQL; the results show its excellent efficiency and effectiveness in a full fledged, commercial level DBMS.

  • Wander Join: Online Aggregation For Joins, Talk
    By Feifei Li,    Bin Wu,    Ke Yi,    Zhuoyue Zhao
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data 2016 (Demo Paper) (SIGMOD 2016), pages 2121-2124, June, 2016.
    Abstract

    Joins are expensive, and online aggregation over joins wasproposed to mitigate the cost, which offers a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and may also need very restrictive assumptions (e.g., tuples in a table are stored in random order). We have introduced a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. We have also implemented and tested wander join in the latest version of PostgreSQL; this is the first time online aggregation has been integrated into a full fledged, commercial level DBMS.