Zhuoyue Zhao
PhD Student, Research Interests: distributed systems,big data analytics.


Book Chapter  Journal  Conference  Workshop  Tech Report]

Journal

2017

  • Wander Join and XDB: Online Aggregation via Random Walks
    By Feifei Li,    Bin Wu,    Ke Yi,    Zhuoyue Zhao
    Vol.0, ACM SIGMOD Record (SIGMOD RECORD),  2017.
    Abstract

    Joins are expensive, and online aggregation is an e↵ective approach to explore the tradeo↵ between query e"ciency and accuracy in a continuous, online fashion. However, the stateof-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and needs strong assumptions (e.g., the tuples in a table are stored in random order). This paper proposes a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Selection predicates and group-by clauses can be handled as well. We have developed an online engine called XDB by integrating wander join in the latest version of PostgreSQL. Extensive experiments using the TPC-H benchmark have shown the superior performance of wander join. The XDB implementation has demonstrated its practicality in a full-fledged database system

  • InferSpark: Statistical Inference at Scale. (Project Website)
    By Zhuoyue Zhao,   Eric Lo,   Kenny Q. Zhu,   Chris Liu,   
    Vol.abs/1707.02047, CoRR (CORR 2017),  2017.
    Abstract
  • Conference

    2018

  • Random Sampling Over Joins Revisited
    By Zhuoyue Zhao,    Robert Christensen,    Feifei Li,    Xiao Hu,    Ke Yi
    (To Appear) In Proceedings of 37th ACM SIGMOD International Conference on Management of Data (SIGMOD 2018),  pages ??-??,  June,  2018.
    Abstract

    Joins are expensive, especially on large data and/or multiple relations. One promising approach in mitigating their high costs is to just return a simple random sample of the full join results, which is sufficient for many analytical queries. Indeed, in as early as 1999, Chaudhuri et al. posed the problem of sampling over joins as a fundamental challenge in large database systems. They also pointed out a fundamental barrier for this problem, that the sampling operator cannot be pushed through a join, i.e., sample(R join S) \ne sample(R) join sample(S). To overcome this barrier, they used precomputed statistics to guide the sampling process, but only showed how this works for two-relation joins. This paper revisits this classic problem for both acyclic and cyclic multi-way joins. We build upon the idea of Chaudhuri et al., but extend it in several nontrivial directions. First, we propose a general framework for random sampling over multi-way joins, which includes the algorithm of Chaudhuri et al. as a special case. Second, we explore several ways to instantiate this framework, depending on what prior information is available about the underlying data, and offer different tradeoffs between sample generation latency and throughput. We analyze the properties of different instantiations and evaluate them against the baseline methods; the results clearly demonstrate the superiority of our new techniques.

  • 2016

  • Wander Join: Online Aggregation via Random Walks, Talk
    By Feifei Li,    Bin Wu,    Ke Yi,    Zhuoyue Zhao
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD 2016),  pages 615-629,  June,  2016.
    Abstract

    Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and even needs unrealistic assumptions (e.g., tuples in a table are stored in random order). This paper proposes a new approach the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. Wander join produces independent but non-uniform samples, which gives huge performance gains over the uniform but non-independent samples returned by ripple join. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Wander join works for different types of joins including chain, acyclic, cyclic joins, with selection predicates and group-by clauses. It easily extends to $theta$-joins as well. Extensive experiments using the TPC-H benchmark have demonstrated the superior performance of wander join over ripple join in both internal and external memory. We have also implemented and tested wander join in the latest version of PostgreSQL; the results show its excellent efficiency and effectiveness in a full fledged, commercial level DBMS.

  • Wander Join: Online Aggregation For Joins, Talk
    By Feifei Li,    Bin Wu,    Ke Yi,    Zhuoyue Zhao
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data 2016 (Demo Paper) (SIGMOD 2016),  pages 2121-2124,  June,  2016.
    Abstract

    Joins are expensive, and online aggregation over joins wasproposed to mitigate the cost, which offers a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and may also need very restrictive assumptions (e.g., tuples in a table are stored in random order). We have introduced a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. We have also implemented and tested wander join in the latest version of PostgreSQL; this is the first time online aggregation has been integrated into a full fledged, commercial level DBMS.