Robert Christensen
Master/PhD student. Now at Visa Research.


Book Chapter  Journal  Conference  Workshop  Tech Report]

Conference

2018

  • Random Sampling Over Joins Revisited, Talk
    By Zhuoyue Zhao,    Robert Christensen,    Feifei Li,    Xiao Hu,    Ke Yi
    In Proceedings of 37th ACM SIGMOD International Conference on Management of Data (SIGMOD 2018),  pages 1525-1539,  June,  2018.
    Abstract

    Joins are expensive, especially on large data and/or multiple relations. One promising approach in mitigating their high costs is to just return a simple random sample of the full join results, which is sufficient for many analytical queries. Indeed, in as early as 1999, Chaudhuri et al. posed the problem of sampling over joins as a fundamental challenge in large database systems. They also pointed out a fundamental barrier for this problem, that the sampling operator cannot be pushed through a join, i.e., sample(R join S) ne sample(R) join sample(S). To overcome this barrier, they used precomputed statistics to guide the sampling process, but only showed how this works for two-relation joins. This paper revisits this classic problem for both acyclic and cyclic multi-way joins. We build upon the idea of Chaudhuri et al., but extend it in several nontrivial directions. First, we propose a general framework for random sampling over multi-way joins, which includes the algorithm of Chaudhuri et al. as a special case. Second, we explore several ways to instantiate this framework, depending on what prior information is available about the underlying data, and offer different tradeoffs between sample generation latency and throughput. We analyze the properties of different instantiations and evaluate them against the baseline methods; the results clearly demonstrate the superiority of our new techniques.

  • 2016

  • Spatial Online Sampling and Aggregation, Talk
    By Lu Wang,    Robert Christensen,    Feifei Li,    Ke Yi
    In Proceedings of International Conference on Very Large Data Bases (VLDB 2016),  pages 84-95,  New Delhi India,  August,  2016.
    Abstract

    The massive adoption of smart phones and other mobile devices has generated humongous amount of spatial and spatio-temporal data. The importance of spatial analytics and aggregation is ever-increasing. An important challenge is to support interactive exploration over such data. However, spatial analytics and aggregation using all data points that satisfy a query condition is expensive, especially over large data sets, and could not meet the needs of interactive exploration. To that end, we present novel indexing structures that support spatial online sampling and aggregation on large spatial and spatio-temporal data sets. In spatial online sampling, random samples from the set of spatial (or spatio-temporal) points that satisfy a query condition are generated incrementally in an online fashion. With more and more samples, various spatial analytics and aggregations can be performed in an online, interactive fashion, with estimators that have better accuracy over time. Our design works well for both memory-based and disk-resident data sets, and scales well towards different query and sample sizes. More importantly, our structures are dynamic, hence, they are able to deal with insertions and deletions efficiently. Extensive experiments on large real data sets demonstrate the improvements achieved by our indexing structures compared to other baseline methods.

  • 2015

  • STORM: Spatio-Temporal Online Reasoning and Management of Large Spatio-Temporal Data (Project Website), Talk
    By Robert Christensen ,    Lu Wang,    Feifei Li,    Ke Yi,    Jun Tang,    Natalee Villa
    In Proceedings of 34th ACM SIGMOD International Conference on Management of Data (SIGMOD 2015),  pages 1111-1116,  Melbourne, Australia,  2015.
    Abstract

    We present the STORM system to enable spatio-temporal online reasoning and management of large spatio-temporal data. STORM supports interactive spatio-temporal analytics through novel spatial online sampling techniques. Online spatio-temporal aggregation and analytics are then derived based on the online samples, where approximate answers with approximation quality guarantees can be provided immediately from the start of query execution. The quality of these online approximations improve over time. This demonstration proposal describes key ideas in the design of the STORM system, and presents the demonstration plan.

  • 2013

  • Adaptive Log Compression for Massive Log Data (Project Website), Talk
    By Robert Christensen,    Feifei Li
    In Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD 2013, Undergraduate Research Poster) (SIGMOD 2013),  pages 1283-1284,  June,  2013.
    Abstract

    We present novel adaptive log compression schemes. Results show 30% improvement on compression ratios over existing approaches.

  • Optimal Splitters for Temporal and Multi-version Databases, Talk
    By Wangchao Le,    Feifei Li,    Yufei Tao,    Robert Christensen
    In Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD 2013),  pages 109-120,  June,  2013.
    Abstract

    Temporal and multi-version databases often generate massive amounts of data, due to the increasing availability of large storage space and the increasing importance of mining and auditing opera- tions from historical data. For example, Google now allows users to limit and rank search results by setting a time range. These data- bases are ideal candidates for a distributed store, which offers large storage space, and parallel and distributed processing power from a cluster of (commodity) machines. A key challenge is to achieve a good load balancing algorithm for storage and processing of these data, which is done by partitioning the database. In this paper, we introduce the concept of optimal splitters for temporal and multi- version databases, which induce a partition of the input data set, and guarantee that the size of the maximum bucket be minimized among all possible configurations, given a budget for the desired number of buckets. We design efficient methods for memory- and disk-resident data respectively, and show that they significantly outperform com- peting baseline methods both theoretically and empirically on large real data sets.