BIGDATA: Small: DCM: DA: Building a Mergeable and Interactive Distributed Data Layer for Big Data Summarization Systems

Big data today is stored in a distributed fashion across many different machines or data sources. This poses new algorithmic and system challenges to performing efficient analysis on the full data set. To address these difficulties, the PIs are building the MIDDLE (Mergeable and Interactive Distributed Data LayEr) Summarization System and deploying it on large real-world datasets. The MIDDLE system builds and maintains a special class of summaries that can be efficiently constructed and updated while still allowing fine-grained analysis on the heavy tail. Mergeable summaries can represent any data set with a guaranteed tradeoff between size and accuracy, and any two such summaries can be merged to create a new summary with the same size-accuracy tradeoff.


  • NSF BIGDATA Program, 09/15/13-08/31/16, $685,380.

  • People

    Robert Christensen
    Master/PhD student. Now at Visa Research.

    Klemen Simonic
    Graduated with Master Summer 2015. First Employment: Facebook.

    Jeff M. Phillips
    Associate Professor

    Feifei Li

    Wangchao Le
    PhD student graduated in 2013. Now at Microsoft.

    Mina Ghashami
    PhD student

    Guineng Zheng
    PhD Student. Research Interest: knowledge base construction and application.

    Jason Voung
    BS/MS Student

    Yanqing Peng
    PhD Student.

    Michael Matheny
    PhD Student.


  • Quality and Efficiency for Kernel Density Estimates in Large Data, Talk
    By Yan Zheng,    Jeffrey Jestes,    Jeff M. Phillips,    Feifei Li
    In Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD 2013), pages 433-444, June, 2013.

    Kernel density estimates are important for a broad variety of applications including media databases, pattern recognition, computer vision, data mining, and the sciences. Their con- struction has been well-studied, but existing techniques are expensive on massive datasets and/or only provide heuristic approximations without theoretical guarantees. We propose randomized and deterministic algorithms with quality guarantees which are orders of magnitude more ef- ficient than previous algorithms. Our algorithms do not re- quire knowledge of the kernel or its bandwidth parameter and are easily parallelizable. We demonstrate how to imple- ment our ideas in a centralized setting and in MapReduce, although our algorithms are applicable to any large-scale data processing framework. Extensive experiments on large real datasets demonstrate the quality, efficiency, and scala- bility of our techniques.

  • Optimal Splitters for Temporal and Multi-version Databases, Talk
    By Wangchao Le,    Feifei Li,    Yufei Tao,    Robert Christensen
    In Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD 2013), pages 109-120, June, 2013.

    Temporal and multi-version databases often generate massive amounts of data, due to the increasing availability of large storage space and the increasing importance of mining and auditing opera- tions from historical data. For example, Google now allows users to limit and rank search results by setting a time range. These data- bases are ideal candidates for a distributed store, which offers large storage space, and parallel and distributed processing power from a cluster of (commodity) machines. A key challenge is to achieve a good load balancing algorithm for storage and processing of these data, which is done by partitioning the database. In this paper, we introduce the concept of optimal splitters for temporal and multi- version databases, which induce a partition of the input data set, and guarantee that the size of the maximum bucket be minimized among all possible configurations, given a budget for the desired number of buckets. We design efficient methods for memory- and disk-resident data respectively, and show that they significantly outperform com- peting baseline methods both theoretically and empirically on large real data sets.

  • Dynamic Monitoring of Optimal Locations in Road Network Databases
    By Bin Yao,    Xiaokui Xiao,    Feifei Li,    Yifan Wu
    Vol.23(5), Pages 697-720, The International Journal on Very Large Data Bases (VLDBJ), 2014.

    Optimal location (OL) queries are a type of spatial queries that are particularly useful for the strategic planning of resources. Given a set of existing facilities and a set of clients, an OL query asks for a location to build a new facility that optimizes a certain cost metric (defined based on the distances between the clients and the facilities). Several techniques have been proposed to address OL queries, assuming that all clients and facilities reside in an L p space. In practice, however, movements between spatial locations are usually confined by the underlying road network, and hence, the actual distance between two locations can differ significantly from their L p distance. Motivated by the deficiency of the existing techniques, this paper presents a comprehensive study on OL queries in road networks. We propose a unified framework that addresses three variants of OL queries that find important applications in practice, and we instantiate the framework with several novel query processing algorithms. We further extend our framework to efficiently monitor the OLs when locations for facilities and/or clients have been updated. Our dynamic update methods lead to efficient answering of continuous optimal location queries. We demonstrate the efficiency of our solutions through extensive experiments with large real data.

  • Continuous Matrix Approximation on Distributed Data (Project Website), Talk
    By Mina Ghashami,    Jeff Phillips,    Feifei Li
    In Proceedings of 40th International Conference on Very Large Data Bases (VLDB 2014), pages 809-820, Hangzhou China, 2014.

    Tracking and approximating data matrices in streaming fashion is a fundamental challenge. The problem requires more care and attention when data comes from multiple distributed sites, each receiving a stream of data. This paper considers the problem of %u201Ctracking approximations to a matrix%u201D in the distributed streaming model. In this model, there are m distributed sites each observing a distinct stream of data (where each element is a row of a distributed matrix) and has a communication channel with a coordinator, and the goal is to track an %u03B5-approximation to the norm of the matrix along any direction. To that end, we present novel algorithms to address the matrix approximation problem. Our algorithms maintain a smaller matrix B, as an approximation to a distributed streaming matrix A, such that for any unit vector x: ||Ax||^2 %u2212 ||Bx||^2 %u2264 %u03B5||A||_F^2. Our algorithms work in streaming fashion and incur small communication, which is critical for distributed computation. Our best method is deterministic and uses only O((m/%u03B5) log(%u03B2N)) communication, where N is the size of stream (at the time of the query) and %u03B2 is an upperbound on the squared norm of any row of the matrix. In addition to proving all algorithmic properties theoretically, extensive experiments with real large datasets demonstrate the efficiency of these protocols.

  • Scalable Keyword Search on Large RDF Data
    By Wangchao Le,    Feifei Li,    Anastasios Kementsietsidis,    Songyun Duan
    Vol.26, Pages 2774-2788, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), 2014.

    Keyword search is a useful tool for exploring large RDF datasets. Existing techniques either rely on constructing a distance matrix for pruning the search space or building summaries from the RDF graphs for query processing. In this work, we show that existing techniques have serious limitations in dealing with realistic, large RDF data with tens of millions of triples. Furthermore, the existing summarization techniques may lead to incorrect/incomplete results. To address these issues, we propose an effective summarization algorithm to summarize the RDF data. Given a keyword query, the summaries lend significant pruning powers to exploratory keyword search and result in much better efficiency compared to previous works. Unlike existing techniques, our search algorithms always return correct results. Besides, the summaries we built can be updated incrementally and efficiently. Experiments on both benchmark and large real RDF data sets show that our techniques are scalable and efficient.

  • Spatial Online Sampling and Aggregation, Talk
    By Lu Wang,    Robert Christensen,    Feifei Li,    Ke Yi
    In Proceedings of International Conference on Very Large Data Bases (VLDB 2016), pages 84-95, New Delhi India, August, 2016.

    The massive adoption of smart phones and other mobile devices has generated humongous amount of spatial and spatio-temporal data. The importance of spatial analytics and aggregation is ever-increasing. An important challenge is to support interactive exploration over such data. However, spatial analytics and aggregation using all data points that satisfy a query condition is expensive, especially over large data sets, and could not meet the needs of interactive exploration. To that end, we present novel indexing structures that support spatial online sampling and aggregation on large spatial and spatio-temporal data sets. In spatial online sampling, random samples from the set of spatial (or spatio-temporal) points that satisfy a query condition are generated incrementally in an online fashion. With more and more samples, various spatial analytics and aggregations can be performed in an online, interactive fashion, with estimators that have better accuracy over time. Our design works well for both memory-based and disk-resident data sets, and scales well towards different query and sample sizes. More importantly, our structures are dynamic, hence, they are able to deal with insertions and deletions efficiently. Extensive experiments on large real data sets demonstrate the improvements achieved by our indexing structures compared to other baseline methods.

  • Exact and Approximate Flexible Aggregate Similarity Search
    By Feifei Li,    Ke Yi,    Yufei Tao,    Bin Yao,    Yang Li,    Dong Xie,    Min Wang
    Vol.25(4), Pages 317-338, The International Journal on Very Large Data Bases (VLDBJ), 2016.

    Aggregate similarity search, also known as aggregate nearest neighbor (Ann) query, finds many useful applications in spatial and multimedia databases. Given a group Q of M query objects, it retrieves from a database the objects most similar to Q, where the simi- larity is an aggregation (e.g., sum, max) of the distances between each retrieved object p and all the objects in Q. In this paper, we propose an added flexibility to the query definition, where the similarity is an aggre- gation over the distances between p and any subset of phi objects in Q for some support 0 < phi le 1. We call this new definition flexible aggregate similarity search, and accordingly refer to a query as a flexible aggre- gate nearest neighbor (Fann) query. We present algo- rithms for answering Fann queries exactly and approx- imately. Our approximation algorithms are especially appealing, which are simple, highly efficient, and work well in both low and high dimensions. They also return near-optimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experi- ments on large real and synthetic datasets from 2 to 74 dimensions have demonstrated their superior efficiency and high quality.

  • Matrix Sketching Over Sliding Windows, Talk
    By Zhewei Wei,    Xuancheng Liu,    Feifei Li,    Shuo Shang,    Xiaoyong Du,    Ji-Rong Wen
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIMGOD 2016), pages 1465-1480, June, 2016.

    Large-scale matrix computation becomes essential for many data data applications, and hence the problem of sketching matrix with small space and high precision has received extensive study for the past few years. This problem is often considered in the row-update streaming model, where the data set is a matrix $A in mathbb{R}^{n times d}$, and the processor receives a row ($1times d$) of $A$ at each timestamp. The goal is to maintain a smaller matrix (termed approximation matrix, or simply approximation) $Bin mathbb{R}^{elltimes d}$ as an approximation to $A$, such that the covariance error $|A^T A - B^TB|$ is small and $ell ll n$.

    This paper studies continuous tracking approximations to the matrix defined by a sliding window of most recent rows. We consider both sequence-based and time-based window. We show that maintaining $A^TA$ exactly requires linear space in the sliding window model, as opposed to $O(d^2)$ space in the streaming model. With this observation, we present three general frameworks for matrix sketching on sliding windows. The sampling techniques give random samples of the rows in the window according to their squared norms. The textsf{Logarithmic Method} converts a {em mergeable} streaming matrix sketch into a matrix sketch on time-based sliding windows. The textsf{Dyadic Interval} framework converts arbitrary streaming matrix sketch into a matrix sketch on sequence-based sliding windows. In addition to proving all algorithmic properties theoretically, we also conduct extensive empirical study with real datasets to demonstrate the efficiency of these algorithms.

  • Graph Analytics Through Fine-Grained Parallelism, Talk
    By Zechao Shang,    Feifei Li,    Jeffrey Xu Yu,    Zhiwei Zhang,    Hong Chen
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD 2016), pages 463-478, June, 2016.

    Large graphs are getting increasingly popular and even indispensable in applications from a number of important application domains, for example, in social media data, large networks, and knowledge bases. Efficient execution of various analytics over large graphs thus becomes an important subject of study. To increase efficiency and scalability, in-memory computation and parallelism have been explored extensively to speed up various graph analytical workloads. In many existing graph analytical engines (e.g., Pregel, Neo4j, GraphLab), parallelism is achieved via one of the three concurrency control models, namely, bulk synchronization processing (BSP), asynchronous processing, and synchronous processing. Among them, synchronous processing has the potential to achieve the best performance due to fine-grained parallelism, while ensuring the correctness and the convergence of the computation, if an effective concurrency control scheme is used. This paper explores the topological properties of the underlying graph to design and implement a highly effective concurrency control scheme for efficient synchronous processing in an in-memory graph analytical engine. Our design uses a novel hybrid approach that combines 2PL (two-phase locking) with OCC (optimistic concurrency control), for high degree and low degree vertices in a graph respectively. Our results show that the proposed hybrid synchronous scheduler has significantly outperformed other synchronous schedulers in existing graph analytical engines, as well as BSP and asynchronous schedulers.

  • Fast and Concurrent RDF Queries with RDMA-based Distributed Graph Exploration
    By Jiaxin Shi,    Youyang Yao,    Rong Chen,    Haibo Chen,    Feifei Li
    In Proceedings of 12th USENIX Symposium on Operating Systems Design and Implebbmentation (OSDI 2016), pages 317-332, Savannah, GA, November, 2016.

    Many knowledge bases like Google and Facebook%u2019s knowledge/social graphs are represented and stored as RDF graphs, where users can issue structured queries on such graphs using SPARQL. With massive queries over large and constantly growing RDF data, it is im- perative that an RDF graph store should provide low latency and high throughput for concurrent query processing. However, prior systems still experience high per-query latency over large datasets and most prior designs have poor resource utilization such that each query is processed in sequence. We present Wukong, a distributed graph-based RDF store that leverages RDMA-based graph exploration to support highly concurrent and low-latency queries over large data. Following a graph-centric design, Wukong builds indexes by extending the graphs with index vertices and leverages differentiated graph partitioning to retain locality (for normal vertices) while exploiting parallelism (for index vertices). To explore the low-latency feature of RDMA, Wukong leverages a new RDMA-friendly, predicate-based key/value store to store the partitioned graphs. To provide low latency and high parallelism, Wukong decomposes a query into sub-queries, each of which may be distributed over and handled by a set of machines in parallel. For each sub-query, Wukong leverages RDMA to provide communication-aware optimizations to balance between execution and data migration. To reduce inter-query interference, Wukong leverages a worker-obliger work stealing mechanism to oblige queries in straggler workers. Evaluation on a 6-node RDMA-capable cluster shows that Wukong signif- icantly outperforms state-of-the-art systems like TriAD and Trinity.RDF for both latency and throughput, usually at the scale of orders of magnitude.

  • Efficient Frequent Directions Algorithm for Sparse Matrices
    By Mina Ghashami,    Edo Liberty,    and Jeff M. Phillips
    In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDD 2016), pages ??-??, Feburary, 2016.

    This paper describes Sparse Frequent Directions, a variant of Frequent Directions for sketch- ing sparse matrices. It resembles the original algorithm in many ways: both receive the rows of an input matrix A^{nd} one by one in the streaming setting and compute a small sketch B in R^{l x d}. Both share the same strong (provably optimal) asymptotic guarantees with respect to the space- accuracy tradeoff in the streaming setting. However, unlike Frequent Directions which runs in O(ndl) time regardless of the sparsity of the input matrix A, Sparse Frequent Directions runs in O~(nnz(A)*l nl^2) time. Our analysis loosens the dependence on computing the Singular Value Decomposition (SVD) as a black box within the Frequent Directions algorithm. Our bounds require recent results on the properties of fast approximate SVD computations. Finally, we empirically demonstrate that these asymptotic improvements are practical and significant on real and synthetic data.

  • Coresets and Sketches
    by Jeff Phillips
    Handbook of Discrete and Computational Geometry. 3rd edition, CRC Press, Janurary, 2016.

    Geometric data summarization has become an essential tool in both geometric approximation algorithms and where geometry intersects with big data problems. In linear or near-linear time large data sets can be compressed into a summary, and then more intricate algorithms can be run on the summaries whose results approximate those of the full data set. Coresets and sketches are the two most important classes of these summaries.

  • Streaming Kernel Principal Component Analysis
    By Mina Ghashami,    Daniel Perry,    and Jeff M. Phillips
    In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), pages ??-??, December, 2015.

    Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture non-linear structures within large data sets, and is a central tool in data analysis and learning. To allow for non-linear relations, typically a full nn kernel matrix is constructed over n data points, but this requires too much space and time for large values of n. Techniques such as the Nystrom method and random feature maps can help towards this goal, but they do not explicitly maintain the basis vectors in a stream and take more space than desired. We propose a new approach for streaming KPCA which maintains a small set of basis elements in a stream, requiring space only logarithmic in n, and also improves the dependence on the error parameter. Our technique combines together random feature maps with recent advances in matrix sketching, it has guaranteed spectral norm error bounds with respect to the original kernel matrix, and it compares favorably in practice to state-of-the-art approaches.

  • Improved Practical Matrix Sketching with Guarantees
    By Mina Ghashami,    Amey Desai,    and Jeff M. Phillips
    Vol.28, Pages 1678--1690, Transactions on Knowledge and Data Engineering (IEEE TKDE), Janurary, 2015.

    Matrices have become essential data representations for many large-scale problems in data analytics, and hence matrix sketching is a critical task. Although much research has focused on improving the error/size tradeoff under various sketching paradigms, the many forms of error bounds make these ap- proaches hard to compare in theory and in practice. This paper attempts to categorize and compare most known methods under row-wise streaming updates with provable guarantees, and then to tweak some of these methods to gain practical improvements while retaining guarantees. For instance, we observe that a simple heuristic iSVD, with no guarantees, tends to outperform all known approaches in terms of size/error trade-off. We modify the best performing method with guarantees FREQUENTDIRECTIONS under the size/error trade-off to match the performance of iSVD and retain its guarantees. We also demonstrate some adversarial datasets where iSVD performs quite poorly. In comparing techniques in the time/error trade-off, techniques based on hashing or sampling tend to perform better. In this setting we modify the most studied sampling regime to retain error guarantee but obtain dramatic improvements in the time/error trade-off. Finally, we provide easy replication of our studies on APT, a new testbed which makes available not only code and datasets, but also a computing platform with fixed environmental settings.

  • Frequent Directions: Simple and Deterministic Matrix Sketching.
    By Mina Ghashami,    Edo Liberty,    Jeff M. Phillips and David P. Woodruff
    Vol.0, SIAM Journal of Computing (SICOMP), Janurary, 2015.

    We describe a new algorithm called Frequent Directions for deterministic matrix sketching in the row-updates model. The algorithm is presented an arbitrary input matrix A %u2208 Rn%uFFFDd one row at a time. It performed O(dl) operations per row and maintains a sketch matrix B in R^{l%uFFFDd} such that for any k < l ||A^TA %u2212 B^TB||_2 <= ||A%u2212A_k||^2_F/(l%u2212k) and ||A%u2212pi_{B_k}(A)||^2_F <= (1 k/(l-k)) ||A%u2212A_k||^2_F. Here, A_k stands for the minimizer of ||A %u2212 A_k||_F over all rank k matrices (similarly B_k ) and pi_{B_k}(A) is the rank k matrix resulting from projecting A on the row span of B_k. We show both of these bounds are the best possible for the space allowed. The summary is mergeable, and hence trivially parallelizable. Moreover, Frequent Directions outperforms exemplar implementations of existing streaming algorithms in the space-error tradeoff.

  • Persistent Bloom Filter: Membership Testing for the Entire History, Talk
    By Yanqing Peng,    Jinwei Guo,    Feifei Li,    Weining Qian,    Aoying Zhou
    In Proceedings of 37th ACM SIGMOD International Conference on Management of Data (SIGMOD 2018), pages 1037-1052, June, 2018.

    Membership testing is the problem of testing whether an element is in a set of elements. Performing the test exactly is expensive space-wise, requiring the storage of all elements in a set. In many applications, an approximate testing that can be done quickly using small space is often desired. Bloom filter (BF) was designed and has witnessed great success across numerous application domains. But there is no compact structure that supports set membership testing for temporal queries, e.g., has person A visited a web server between 9:30am and 9:40am? And has the same person visited the web server again between 9:45am and 9:50am? It is possible to support such ``temporal membership testing'' using a BF, but we will show that this is fairly expensive. To that end, this paper designs persistent bloom filter (PBF), a novel data structure for temporal membership testing with compact space.

  • Relative Error Embeddings for the Gaussian Kernel Distance
    By Di Chen and Jeff M. Phillips
    In Proceedings of Algorithmic Learning Theory (ALT), pages ??-??, October, 2017.

    This studies a way to linearize kernel methods for machine learning — essentially sketching the non-linear information. It studies a non-linear mapping from a low-dimensional space (where there may be no linear separator or linear regression fit) to a high-dimensional space where there would be. This is most commonly done using a Gaussian kernel, and that is what we study. We show that not only is the inner product information preserved under an additive error (this has been known for a decade) but that also the distance is preserved under a relative error. We show that this is true with the same bounds as a Johnson-Liderstrauss random project, with the small exception of some lower-order data dependent factors, and that these are required. It immediately implies a few heuristics, like k-center clustering after this mapping, now have formal error guarantees.

  • Coresets for Kernel Regression (Project Website)
    By Yan Zheng and Jeff M. Phillips
    In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDD), pages ??-??, August, 2017.

    This project developed new data summaries for kernel regression — these have not been formally studied before. These approaches are related to kernel density estimates, for where there are several effective summaries, but only model density of the data, not an associated weight. We produce both sample complexity results as well as deterministic approaches which adapt to the structure of the data. We demonstrate these approaches on a variety of 1-dimensional (e.g., time series), 2-dimensional (e.g., spatial) and high-dimensional data sets. It can summarize sets of enormous size down to a few megabytes with minimal loss of accuracy.