III: Small: Persistent Data Summaries: Temporal Analytics on Big Data Histories

Project and Award Information: NSF III: Small: Persistent Data Summaries: Temporal Analytics on Big Data Histories, NSF IIS 1816149, Official NSF Link, PI, 08/01/18- 08/31/21, $499,934. PI: Feifei Li, co-PI: Jeff Phillips.

Abstract: An increasing number of applications require the storage of and access to all historical data to support rich analytics, learning, and mining operations. This project develops a series of methods to summarize data so that it can be queried with respect to not just the full data set, as is standard, but with respect to the state of the data set at any historical time. These summaries integrate with large temporal databases, in both offline batched-processing and online streaming application scenarios. The effectiveness of these methods will be demonstrated on an enormous scientific database of atmospheric data collected for 20 years from over 40,000 weather stations. We will work with industry collaborators to help deploy our new algorithms, and the results will be integrated into education and outreach efforts surrounding the growth of data science initiatives.

More specifically, this project extends and combines approximate query processing with temporal big data. In particular, instead of (or on top of) using a multi-version database, this project designs and implements persistent data summaries (PDSs) that offer interactive temporal analytics with strong theoretical guarantees on their approximation quality. In additional to formalizing these models, this project develops practical PDS implementations for sampling-based summaries, data sketches, and core sets that support advanced analytical queries.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

This material is based upon work supported by the National Science Foundation under Grant No. 1816149


  • NSF IIS 1816149, Official NSF Link

  • People

    Feifei Li

    Yanqing Peng
    PhD Student.

    Jeff M. Phillips
    Associate Professor

    Debjyoti Paul
    PhD Student. Research Interest: learning driven big data analytics.

    Benwei Shi
    PhD Student



  • Persistent Bloom Filter: Membership Testing for the Entire History, Talk
    By Yanqing Peng,    Jinwei Guo,    Feifei Li,    Weining Qian,    Aoying Zhou
    In Proceedings of 37th ACM SIGMOD International Conference on Management of Data (SIGMOD 2018),  pages 1037-1052,  June,  2018.

    Membership testing is the problem of testing whether an element is in a set of elements. Performing the test exactly is expensive space-wise, requiring the storage of all elements in a set. In many applications, an approximate testing that can be done quickly using small space is often desired. Bloom filter (BF) was designed and has witnessed great success across numerous application domains. But there is no compact structure that supports set membership testing for temporal queries, e.g., has person A visited a web server between 9:30am and 9:40am? And has the same person visited the web server again between 9:45am and 9:50am? It is possible to support such ``temporal membership testing'' using a BF, but we will show that this is fairly expensive. To that end, this paper designs persistent bloom filter (PBF), a novel data structure for temporal membership testing with compact space.

  • Bursty Event Detection Throughout Histories
    By Deb Paul,    Yanqing Peng,    Feifei Li
    In Proceedings of 34th IEEE International Conference on Data Engineering (ICDE 2019),  pages 1370-1381,  April,  2019.

    The widespread use of social media and the active trend of moving towards more web- and mobile-based reporting for traditional media outlets have created an avalanche of information streams. These information streams bring in first-hand reporting on live events to massive crowds in real time as they are happening. It is important to study the phenomenon of burst in this context so that end-users can quickly identify important events that are emerging and developing in their early stages. In this paper, we investigate the problem of bursty event detection where we define burst as the acceleration over the incoming rate of an event mentioning. Existing works focus on the detection of current trending events, but it is important to be able to go back in time and explore bursty events throughout the history, while without the needs of storing and traversing the entire information stream from the past. We present a succinct probabilistic data structure and its associated query strategy to find bursty events at any time instance for the entire history. Extensive empirical results on real event streams have demonstrated the effectiveness of our approach.