Abstract
Joins are expensive, especially on large data and/or multiple relations.
One promising approach in mitigating their high costs is to just return
a simple random sample of the full join results, which is sufficient for
many analytical queries. Indeed, in as early as 1999, Chaudhuri et al.
posed the problem of sampling over joins as a fundamental challenge
in large database systems. They also pointed out a fundamental
barrier for this problem, that the sampling operator cannot be pushed
through a join, i.e., sample(R join S) ne sample(R) join sample(S). To
overcome this barrier, they used precomputed statistics to guide the
sampling process, but only showed how this works for two-relation
joins. This paper revisits this classic problem for both acyclic and cyclic
multi-way joins. We build upon the idea of Chaudhuri et al., but
extend it in several nontrivial directions. First, we propose a general
framework for random sampling over multi-way joins, which
includes the algorithm of Chaudhuri et al. as a special case. Second,
we explore several ways to instantiate this framework, depending
on what prior information is available about the underlying data,
and offer different tradeoffs between sample generation latency and
throughput. We analyze the properties of different instantiations
and evaluate them against the baseline methods; the results clearly
demonstrate the superiority of our new techniques.