Jeffrey Jestes
PhD Student, Graduated in summer 2013. Current employment: Cerner Corporation

Book Chapter  Journal  Conference  Workshop  Tech Report]



  • Comparing Implementations of Near-Data Computing with In-Memory MapReduce Workloads
    By Seth H. Pugsley,    Jeffrey Jestes,    Rajeev Balasubramonian,    Vijayalakshmi Srinivasan,    Alper Buyuktosunoglu,    Al Davis,    Feifei Li
    Vol.34, Pages 44-52, IEEE Micro Special Issue on Big Data (IEEE MICRO),  2014.


  • 2010

  • Semantics of Ranking Queries for Probabilistic Data (Project Website)
    By Jeffrey Jestes,    Graham Cormode,    Feifei Li,    Ke Yi
    Vol.250, Pages 545-556, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE),  2010.

    Recently, there have been several attempts to propose definitions and algorithms for ranking queries on probabilistic data. However, these lack many intuitive properties of a top-k over deterministic data. We define several fundamental properties, including exact-k, containment, unique-rank, value-invariance, and stability, which are satisfied by ranking queries on certain data. We argue these properties should also be carefully studied in defining ranking queries in probabilistic data, and fulfilled by definition for ranking uncertain data for most applications. We propose an intuitive new ranking definition based on the observation that the ranks of a tuple across all possible worlds represent a well-founded rank distribution. We studied the ranking definitions based on the expectation, the median and other statistics of this rank distribution for a tuple and derived the expected rank, median rank and quantile rank correspondingly. We are able to prove that the expected rank, median rank and quantile rank satisfy all these properties for a ranking query. We provide efficient solutions to compute such rankings across the major models of uncertain data, such as attribute-level and tuple-level uncertainty. Finally, a comprehensive experimental study confirms the effectiveness of our approach.

  • Conference


  • NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads
    By Seth Pugsley,    Jeffrey Jestes,    Huihui Zhang,    Rajeev Balasubramonian,    Vijayalakshmi Srinivasan,    Alper Buyuktosunoglu,    Al Davis,    Feifei Li
    In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2014),  pages 190-200,  2014.

    While Processing-in-Memory has been investigated for decades, it has not been embraced commercially. A number of emerging technologies have renewed interest in this topic. In particular, the emergence of 3D stacking and the imminent release of Micron’s Hybrid Memory Cube device have made it more practical to move computation near memory. However, the literature is missing a detailed analysis of a killer application that can leverage a Near Data Computing (NDC) architecture. This paper focuses on in-memory MapReduce workloads that are commercially important and are especially suitable for NDC because of their embarrassing parallelism and largely localized memory accesses. The NDC architecture incorporates several simple processing cores on a separate, non-memory die in a 3D-stacked memory package; these cores can perform Map operations with efficient memory access and without hitting the bandwidth wall. This paper describes and evaluates a number of key elements necessary in realizing efficient NDC operation: (i) low-EPI cores, (ii) long daisy chains of memory devices, (iii) the dynamic activation of cores and SerDes links. Compared to a baseline that is heavily optimized for MapReduce execution, the NDC design yields up to 15X reduction in execution time and 18X reduction in system energy.

  • 2013

  • Quality and Efficiency for Kernel Density Estimates in Large Data, Talk
    By Yan Zheng,    Jeffrey Jestes,    Jeff M. Phillips,    Feifei Li
    In Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD 2013),  pages 433-444,  June,  2013.

    Kernel density estimates are important for a broad variety of applications including media databases, pattern recognition, computer vision, data mining, and the sciences. Their con- struction has been well-studied, but existing techniques are expensive on massive datasets and/or only provide heuristic approximations without theoretical guarantees. We propose randomized and deterministic algorithms with quality guarantees which are orders of magnitude more ef- ficient than previous algorithms. Our algorithms do not re- quire knowledge of the kernel or its bandwidth parameter and are easily parallelizable. We demonstrate how to imple- ment our ideas in a centralized setting and in MapReduce, although our algorithms are applicable to any large-scale data processing framework. Extensive experiments on large real datasets demonstrate the quality, efficiency, and scala- bility of our techniques.

  • 2012

  • Building Wavelet Histograms on Large Data in MapReduce (Project Website), Talk
    By Jeffrey Jestes,    Ke Yi,    Feifei Li
    In Proceedings of 38th International Conference on Very Large Databases (VLDB 2012),  pages 109-120,  Istanbul, Turkey,  August,  2012.

    MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compactaccurate summary of data is essential. Among various data summarization tools, histograms have proven to be particularly important and useful for summarizing data, and the wavelet histogram is one of the most widely used histograms. In this paper, we investigate the problem of building wavelet histograms efficiently on large datasets in MapReduce. We measure the efficiency of the algorithms by both end-to-end running time and communication cost. We demonstrate straightforward adaptations of existing exact and approximate methods for building wavelet histograms to MapReduce clusters are highly inefficient. To that end, we design new algorithms for computing exact and approximate wavelet histograms and discuss their implementation in MapReduce. We illustrate our techniques in Hadoop, and compare to baseline solutions with extensive experiments performed in a heterogeneous Hadoop cluster of 16 nodes, using large real and synthetic datasets, up to hundreds of gigabytes. The results suggest significant (often several orders of magnitude) performance improvement achieved by our new algorithms.

  • Efficient Threshold Monitoring for Distributed Probabilistic Data, Talk
    By Mingwang Tang,    Feifei Li,    Jeff M. Phillips,    Jeffrey Jestes
    In Proceedings of 28th IEEE International Conference on Data Engineering (ICDE 2012),  pages 1120-1131,  Washington DC,  April,  2012.

    In distributed data management, a primary concern is monitoring the distributed data and generating an alarm when a user specified constraint is violated. A particular useful instance is the threshold based constraint, which is commonly known as the distributed threshold monitoring problem. This work extends this useful and fundamental study to distributed probabilistic data that emerge in a lot of applications, where uncertainty naturally exists when massive amounts of data are produced at multiple sources in distributed, networked locations. Examples include distributed observing stations, large sensor fields, geographically separate scientific institutes/units and many more. When dealing with probabilistic data, there are two thresholds involved, the score and the probability thresholds. One must monitor both simultaneously, as such, techniques developed for deterministic data are no longer directly applicable. This work presents a comprehensive study to this problem. Our algorithms have significantly outperformed the baseline method in terms of both the communication cost (number of messages and bytes) and the running time, as shown by an extensive experimental evaluation using several, real large datasets.

  • Efficient Parallel kNN Joins for Large Data in MapReduce (Project Website), Talk
    By Chi Zhang,    Feifei Li,    Jeffrey Jestes
    In Proceedings of 15th International Conference on Extending Database Technology (EDBT 2012),  pages 38-49,  March,  2012.

    In data mining applications and spatial and multimedia databases, a useful tool is the kNN join, which is to produce the k nearest neighbors (NN), from a dataset S, of every point in a dataset R. Since it involves both the join and the NN search, performing kNN joins efficiently is a challenging task. Meanwhile, applications continue to witness a quick (exponential in some cases) increase in the amount of data to be processed. A popular model nowadays for large-scale data processing is the shared-nothing cluster on a number of commodity machines using MapReduce. Hence, how to execute kNN joins efficiently on large data that are stored in a MapReduce cluster is an intriguing problem that meets many practical needs. This work proposes novel (exact and approximate) algorithms in MapReduce to perform efficient parallel kNN joins on large data. We demonstrate our ideas using Hadoop. Extensive experiments in large real and synthetic datasets, with tens or hundreds of millions of records in both R and S and up to 30 dimensions, have demonstrated the efficiency, effectiveness, and scalability of our methods.

  • 2010

  • Probabilistic String Similarity Joins (Project Website), Talk
    By Jeffrey Jestes,    Feifei Li,    Zhepeng Yan,    Ke Yi
    In Proceedings of 29th ACM SIGMOD International Conference on Management of Data (SIGMOD 2010),  pages 327-338,  Indianapolis, Indiana,  June,  2010.

    Edit distance based string similarity join is a fundamental operator in string databases. Increasingly, many applications in data cleaning, data integration, and scientific computing have to deal with fuzzy information in string attributes. Despite the intensive efforts devoted in processing (deterministic) string joins and managing probabilistic data respectively, modeling and processing probabilistic strings is still a largely unexplored territory. This work studies the string join problem in probabilistic string databases, using the expected edit distance (EED) as the similarity measure. We first discuss two probabilistic string models to capture the fuzziness in string values in real-world applications. The string-level model is complete, but may be expensive to represent and process. The character-level model has a much more succinct representation when uncertainty in strings only exists at certain positions. Since computing the EED between two probabilistic strings is prohibitively expensive, we have designed efficient and effective pruning techniques that can be easily implemented in existing relational database engines for both models. Extensive experiments on real data have demonstrated order-of-magnitude improvements of our approaches over the baseline.

  • 2009

  • Ranking Distributed Probabilistic Data (Project Website), Talk
    By Feifei Li,    Ke Yi,    Jeffrey Jestes
    In Proceedings of 28th ACM SIGMOD International Conference on Management of Data (SIGMOD 2009),  pages 361-374,  Providence, USA,  June,  2009.

    Ranking queries are essential tools to process large amounts of probabilistic data that encode exponentially many possible deterministic instances. In many applications where uncertainty and fuzzy information arise, data are collected from multiple sources in distributed, networked locations, e.g., distributed sensor fields with imprecise measurements, multiple scientific institutes with inconsistency in their scientific data. Due to the network delay and the economic cost associated with communicating large amounts of data over a network, a fundamental problem in these scenarios is to retrieve the global top-k tuples from all distributed sites with minimum communication cost. Using the well-founded notion of the expected rank of each tuple across all possible worlds as the basis of ranking, this work designs both communication- and computation efficient algorithms for retrieving the top-k tuples with the smallest ranks from distributed sites. Extensive experiments using both synthetic and real data sets confirm the efficiency and superiority of our algorithms over the straightforward approach of forwarding all data to the server.