CIF21 DIBBs: STORM: Spatio-Temporal Online Reasoning and Management of Large Data


A fundamental challenge for many research projects is the ability to handle large quantities of heterogeneous data. Data collected from different sources and time periods can be inconsistent, or stored in different formats and data management systems. Thus, a critical step in many projects is to develop a customized query and analytical engine to translate inputs. But for each new dataset, or for each new query type or analytic task for an existing dataset, a new query interface or program must be developed, requiring significant investments of time and effort. This project will develop an automatic engine for searching large, heterogeneous data collections for weather and meteorology, particularly from instruments in the western US, in a regional network called MesoWest.This project develops an automatic query and analytical engine for large, heterogeneous spatial and temporal data. This capability allows users to automatically deploy a query and analytical engine instance over their large, heterogeneous data with spatial and temporal dimensions. The system supports a simple search-box and map-like query interface that allows numerous powerful analytical queries. Techniques to make these queries robust, relevant, and highly scalable will be developed. The project also enables users to execute queries over multiple data sources simultaneously and seamlessly. The goal of the work is to dramatically simplify the management and analysis of large spatio-temporal data at different institutions, groups, and corporations.


Funding


  • NSF DIBBs Program, 11/01/14-10/31/17, $1,157,975
  • http://www.nsf.gov/awardsearch/showAward?AWD_ID=1443046

  • People


    Jun Tang
    Master Student. Google.


    Rui Dai
    Master Student


    Robert Christensen
    PhD student. Research Interest: interactive data analytics and systems.


    Jason Voung
    BS/MS Student


    Hoa Nguyen
    PhD Student


    Feifei Li
    Associate Professor


    Jeff M. Phillips
    Assistant Professor


    Alex Clemmer
    Undergraduate student, Graduated in Spring 2013.


    Michael Matheny
    PhD Student.



    Source



    Dataset



    Publications


  • STORM: Spatio-Temporal Online Reasoning and Management of Large Spatio-Temporal Data (Project Website), Talk
    By Robert Christensen ,    Lu Wang,    Feifei Li,    Ke Yi,    Jun Tang,    Natalee Villa
    In Proceedings of 34th ACM SIGMOD International Conference on Management of Data (SIGMOD 2015),  pages 1111-1116,  Melbourne, Australia,  2015.
    Abstract

    We present the STORM system to enable spatio-temporal online reasoning and management of large spatio-temporal data. STORM supports interactive spatio-temporal analytics through novel spatial online sampling techniques. Online spatio-temporal aggregation and analytics are then derived based on the online samples, where approximate answers with approximation quality guarantees can be provided immediately from the start of query execution. The quality of these online approximations improve over time. This demonstration proposal describes key ideas in the design of the STORM system, and presents the demonstration plan.

  • Spatial Online Sampling and Aggregation, Talk
    By Lu Wang,    Robert Christensen,    Feifei Li,    Ke Yi
    In Proceedings of International Conference on Very Large Data Bases (VLDB 2016),  pages 84-95,  New Delhi India,  August,  2016.
    Abstract

    The massive adoption of smart phones and other mobile devices has generated humongous amount of spatial and spatio-temporal data. The importance of spatial analytics and aggregation is ever-increasing. An important challenge is to support interactive exploration over such data. However, spatial analytics and aggregation using all data points that satisfy a query condition is expensive, especially over large data sets, and could not meet the needs of interactive exploration. To that end, we present novel indexing structures that support spatial online sampling and aggregation on large spatial and spatio-temporal data sets. In spatial online sampling, random samples from the set of spatial (or spatio-temporal) points that satisfy a query condition are generated incrementally in an online fashion. With more and more samples, various spatial analytics and aggregations can be performed in an online, interactive fashion, with estimators that have better accuracy over time. Our design works well for both memory-based and disk-resident data sets, and scales well towards different query and sample sizes. More importantly, our structures are dynamic, hence, they are able to deal with insertions and deletions efficiently. Extensive experiments on large real data sets demonstrate the improvements achieved by our indexing structures compared to other baseline methods.

  • Exact and Approximate Flexible Aggregate Similarity Search
    By Feifei Li,    Ke Yi,    Yufei Tao,    Bin Yao,    Yang Li,    Dong Xie,    Min Wang
    Vol.25(4), Pages 317-338, The International Journal on Very Large Data Bases (VLDBJ),  2016.
    Abstract

    Aggregate similarity search, also known as aggregate nearest neighbor (Ann) query, finds many useful applications in spatial and multimedia databases. Given a group Q of M query objects, it retrieves from a database the objects most similar to Q, where the simi- larity is an aggregation (e.g., sum, max) of the distances between each retrieved object p and all the objects in Q. In this paper, we propose an added flexibility to the query definition, where the similarity is an aggre- gation over the distances between p and any subset of phi objects in Q for some support 0 < phi le 1. We call this new definition flexible aggregate similarity search, and accordingly refer to a query as a flexible aggre- gate nearest neighbor (Fann) query. We present algo- rithms for answering Fann queries exactly and approx- imately. Our approximation algorithms are especially appealing, which are simple, highly efficient, and work well in both low and high dimensions. They also return near-optimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experi- ments on large real and synthetic datasets from 2 to 74 dimensions have demonstrated their superior efficiency and high quality.

  • Simba: Efficient In-Memory Spatial Analytics (Project Website), Talk
    By Dong Xie,    Feifei Li,    Bin Yao,    Gefei Li,    Liang Zhou,    Minyi Guo
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD 2016),  pages 1071-1085,  June,  2016.
    Abstract

    Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial analytics systems are disk-based and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces the concept and construction of indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba%u2019s superior performance compared against other spatial analytics system.

  • Scalable Spatial Scan Statistics through Sampling (Project Website), Talk
    By Michael Matheny,    Raghvendra Singh,    Kaiqiang Wang,    Liang Zhang and Jeff M. Phillips
    In Proceedings of ACM International Conference on Advances in Geographic Information Systems (SIGSPATIAL),  pages ??-??,  November,  2016.
    Abstract

    Finding anomalous regions within spatial data sets is a central task for biosurveillance, homeland security, policy making, and many other important areas. These communities have mainly settled on spatial scan statistics as a rigorous way to discover regions where a measured quantity (e.g., crime) is statistically significant in its difference from a baseline population. However, most common approaches are ineffi- cient and thus, can only be run with very modest data sizes (a few thousand data points) or make assumptions on the geographic distributions of the data. We address these challenges by designing, exploring, and analyzing sample-then-scan algorithms. These algorithms randomly sample data at two scales, one to define regions and the other to approximate the counts in these regions. Our experiments demonstrate that these algorithms are efficient and accurate independent of the size of the original data set, and our analysis explains why this is the case. For the first time, these sample-then-scan algorithms allow spatial scan statistics to run on a million or more data points without making assumptions on the spatial distribution of the data. Moreover, our experiments and analysis give insight into when it is appropriate to trust the various types of spatial anomalies when the data is modeled as a random sample from a larger but unknown data set.