About This Group

How do Google web servers handle trillions of clicks from all around the world? How do GPS devices give drivers routes based on the shortest distance or shortest time? How does facebook suggest friends based on your profile? The answers to these questions involve data management to a great extent.

Our group focuses on research in query processing on both streaming and relational database systems, management and indexing of large databases, spatio-temporal database applications, security issues in the data management field, query processing in probabilistic data, and distributed DBMS. We strive for innovative algorithm design in the context of the above research areas for extreme scale data.

Recent Publications

  • (Approximate) Uncertain Skylines
    By Peyman Afshani, Pankaj K. Agarwal, Lars Arge, Kasper Green Larsen and Jeff M. Phillips
    Vol.0, To Appear THEORY OF COMPUTING SYSTEMS (TOCS), Jan, 2013.

    Abstract

    Given a set of points with uncertain locations, we consider the problem of computing the probability of each point lying on the skyline, that is, the probability that it is not dominated by any other input point. If each point’s uncertainty is described as a probability distribution over a discrete set of locations, we improve the best known exact solution. We also suggest why we believe our solution might be optimal. Next, we describe simple, near-linear time approximation algorithms for computing the probability of each point lying on the skyline. In addition, some of our methods can be adapted to construct data structures that can efficiently determine the probability of a query point lying on the skyline.

  • Spatial Approximate String Search
    By Feifei Li, Bin Yao, Mingwang Tang, Marios Hadjieleftheriou
    Vol.0, To Appear IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), December, 2012.

    Abstract

    This work deals with the approximate string search in large spatial databases. Speci%uFB01cally, we investigate range queries augmented with a string similarity search predicate in both Euclidean space and road networks. We dub this query the spatial approximate string (SAS) query. In Euclidean space, we propose an approximate solution, the MHR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on the set resemblance between the query string and the q-grams from the sub-trees of index nodes. We also discuss how to estimate the selectivity of a SAS query in Euclidean space, for which we present a novel adaptive algorithm to %uFB01nd balanced partitions using both the spatial and string information stored in the tree. For queries on road networks, we propose a novel exact method, RSASSOL, which signi%uFB01cantly outperforms the baseline algorithm in practice. The RSASSOL combines the q-gram based inverted lists and the reference nodes based pruning. Extensive experiments on large real data sets demonstrate the ef%uFB01ciency and effectiveness of our approaches.

  • Building Wavelet Histograms on Large Data in MapReduce
    By Jeffrey Jestes, Ke Yi, Feifei Li
    (To Appear) In Proceedings of 38th International Conference on Very Large Databases (VLDB 2012), pages ??-??, Istanbul, Turkey, August, 2012.

    Abstract

    MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compactaccurate summary of data is essential. Among various data summarization tools, histograms have proven to be particularly important and useful for summarizing data, and the wavelet histogram is one of the most widely used histograms. In this paper, we investigate the problem of building wavelet histograms efficiently on large datasets in MapReduce. We measure the efficiency of the algorithms by both end-to-end running time and communication cost. We demonstrate straightforward adaptations of existing exact and approximate methods for building wavelet histograms to MapReduce clusters are highly inefficient. To that end, we design new algorithms for computing exact and approximate wavelet histograms and discuss their implementation in MapReduce. We illustrate our techniques in Hadoop, and compare to baseline solutions with extensive experiments performed in a heterogeneous Hadoop cluster of 16 nodes, using large real and synthetic datasets, up to hundreds of gigabytes. The results suggest significant (often several orders of magnitude) performance improvement achieved by our new algorithms.

  • Ranking Large Temporal Data
    By J. Jestes, J. Phillips, F. Li, M. Tang
    In Proceedings of 38th International Conference on Very Large Databases (VLDB 2012), pages TBA, Istanbul, Turkey, August, 2012.

    Abstract

    Ranking temporal data has not been studied until recently [14], even though ranking is an important operator (being promoted as a first-class citizen) in database systems [8]. However, only the instant top-k queries on temporal data were studied in [14], where objects with the k highest scores at a query time instance t are to be retrieved. The instant top-k definition clearly comes with limitations (sensitive to outliers, difficult to choose a meaningful query time t). A more flexible and general ranking operation is to rank objects based on the aggregation of their scores in a query interval, which we dub the aggregate top-k query on temporal data. For example, return the top-10 weather stations having the highest average temperature from 10/01/2010 to 10/07/2010; find the top-20 stocks having the largest total transaction volumes from02/05/2011 to 02/07/2011. This work presents a comprehensive study to this problem by designing both exact and approximate methods (with approximation quality guarantees). We also provide theoretical analysis on the construction cost, the index size, the update and the query costs of each approach. Extensive experiments on large real datasets clearly demonstrate the efficiency, the effectiveness, and the scalability of our methods compared to the baseline methods.

  • Towards Fair Sharing of Block Storage in a Multi-tenant Cloud
    By X. Lin, Y. Mao, F. Li, R. Ricci
    (To Appear) In Proceedings of 4th USENIX Workshop on Hot Topics in Cloud Computing (USENIX HotCloud 2012), pages ??-??, June, 2012.

    Abstract

    A common problem with disk-based cloud storage services is that performance can vary greatly and become highly unpredictable in a multi-tenant environment. A fundamental reason is the interference between workloads co-located on the same physical disk. We observe that different IO patterns interfere with each other significantly, which makes the performance of different types of workloads unpredictable when they are executed concurrently. Unpredictability implies that users may not get a fair share of the system resources from the cloud services they are using. At the same time, replication is commonly used in cloud storage for high reliability. Connecting these two facts, we propose a cloud storage system designed to minimize workload interference without increasing storage costs or sacrificing the overall system throughput. Our design leverages log-structured disk layout, chain replication and a workload-based replica selection strategy to minimize interference, striking a balance between performance and fairness. Our initial results suggest that this approach is a promising way to improve the performance and predictability of cloud storage.

View All

Recent and Upcoming Seminars

  • April 19, 2012   No Free Lunch in Data Privacy by Ashwin Machanavajjhala at Flux Conference Room

    [PDF][Slides][ Abstract]

    Tremendous amounts of personal data about individuals isbeing collected and shared online. Legal requirements and an increasein public awareness due to egregious breaches of individual privacyhave made data privacy an important field of research. Recentresearch, culminating in the development of a powerful notion calleddifferential privacy, have transformed this field from a black artinto a rigorous mathematical discipline.This talk critically analyzes trade-off between accuracy and privacyin the context of social advertising – recommending people, productsor services to users based on their social neighborhood. I willpresent a theoretical upper bound on the accuracy of performingrecommendations that are solely based on a user's social network, fora given level of (differential) privacy of sensitive links in thesocial graph. I will show using real networks that good private socialrecommendations are feasible only for a small subset of the users inthe social network or for a lenient setting of privacy parameters.I will also describe some exciting future research directions inprivacy, security and big data management in building privacy awarecloud-based systems where individuals reclaim ownership of theirpersonal information.

    Bio : Ashwin Machanavajjhala is a Senior Research Scientist in theKnowledge Management group at Yahoo! Research. His primary researchinterests lie in data privacy with a specific focus on formallyreasoning about privacy under probabilistic adversary models. He isalso interested in big-data management and statistical methods forinformation integration. Ashwin graduated with a Ph.D. from theDepartment of Computer Science, Cornell University. His thesis work ondefining and enforcing privacy was awarded the 2008 ACM SIGMOD JimGray Dissertation Award Honorable Mention. He has also received anM.S. from Cornell University and a B.Tech in Computer Science andEngineering from the Indian Institute of Technology, Madras.

  • April 12, 2012   Clustering in MapReduce by Samira Daruki and Chenxu Ding at Flux Conference Room

    [PDF][Slides][ Abstract]

    we will have a joint presentation by Samira Daruki and Chenxu Ding on two papers that are about clustering in MapReduce. The plan is to have a short overview of the MapReduce model, then the algorithms in both papers will be presented. Finally we will plan to leave some time for discussion. My impression is that both of the papers make nice contributions, but in no-way do they close the problem of large scale clustering. I think it will be great to have a group brain storm about what is known, and what is not. To best serve this, it will be best if you are able to at least peruse the papers, and bring them to the meeting so we can have an in depth discussion.

  • March 22, 2012   Efficient Parallel kNN Joins for Large Data in MapReduce by Jeffrey Jestes at Flux seminar room

    [PDF][Slides][ Abstract]

    In data mining applications and spatial and multimedia databases, a useful tool is the kNN join, which is to produce the k nearest neighbors (NN), from a dataset S, of every point in a dataset R. Since it involves both the join and the NN search, performing kNN joins efficiently is a challenging task. Meanwhile, applications continue to witness a quick (exponential in some cases) increase in the amount of data to be processed. A popular model nowadays for large-scale data processing is the shared-nothing cluster on a number of commodity machines using MapReduce [6]. Hence, how to execute kNN joins efficiently on large data that are stored in a MapReduce cluster is an intriguing problem that meets many practical needs. This work proposes novel (exact and approximate) algorithms in MapReduce to perform efficient parallel kNN joins on large data. We demonstrate our ideas using Hadoop. Extensive experiments in large real and synthetic datasets, with tens or hundreds of millions of records in both R and S and up to 30 dimensions, have demonstrated the efficiency, effectiveness, and scalability of our methods.

  • March 22, 2012   Scalable Multi-Query Optimization for SPARQL by Wangchao Le at

    [PDF][Slides][ Abstract]

    This paper revisits the classical problem of multiquery optimization in the context of RDF/SPARQL. We show that the techniques developed for relational and semi-structured data/query languages are hard, if not impossible, to be extended to account for RDF data model and graph query patterns expressed in SPARQL. In light of the NP-hardness of the multi-query optimization for SPARQL, we propose heuristic algorithms that partition the input batch of queries into groups such that each group of queries can be optimized together. An essential component of the optimization incorporates an efficient algorithm to discover the common sub-structures of multiple SPARQL queries and an effective cost model to compare candidate execution plans. Since our optimization techniques do not make any assumption about the underlying SPARQL query engine, they have the advantage of being portable across different RDF stores. The extensive experimental studies, performed on three popular RDF stores, show that the proposed techniques are effective, efficient and scalable.

  • March 22, 2012   Efficient Parallel kNN Joins for Large Data in MapReduce by Jeffrey Jestes at Flux Seminar Room

    [PDF][Slides][ Abstract]

    In data mining applications and spatial and multimedia databases, a useful tool is the kNN join, which is to produce the k nearest neighbors (NN), from a dataset S, of every point in a dataset R. Since it involves both the join and the NN search, performing kNN joins efficiently is a challenging task. Meanwhile, applications continue to witness a quick (exponential in some cases) increase in the amount of data to be processed. A popular model nowadays for large-scale data processing is the shared-nothing cluster on a number of commodity machines using MapReduce [6]. Hence, how to execute kNN joins efficiently on large data that are stored in a MapReduce cluster is an intriguing problem that meets many practical needs. This work proposes novel (exact and approximate) algorithms in MapReduce to perform efficient parallel kNN joins on large data. We demonstrate our ideas using Hadoop. Extensive experiments in large real and synthetic datasets, with tens or hundreds of millions of records in both R and S and up to 30 dimensions, have demonstrated the efficiency, effectiveness, and scalability of our methods.

View All