CIF21 DIBBs: STORM: Spatio-Temporal Online Reasoning and Management of Large Data

A fundamental challenge for many research projects is the ability to handle large quantities of heterogeneous data. Data collected from different sources and time periods can be inconsistent, or stored in different formats and data management systems. Thus, a critical step in many projects is to develop a customized query and analytical engine to translate inputs. But for each new dataset, or for each new query type or analytic task for an existing dataset, a new query interface or program must be developed, requiring significant investments of time and effort. This project will develop an automatic engine for searching large, heterogeneous data collections for weather and meteorology, particularly from instruments in the western US, in a regional network called MesoWest.This project develops an automatic query and analytical engine for large, heterogeneous spatial and temporal data. This capability allows users to automatically deploy a query and analytical engine instance over their large, heterogeneous data with spatial and temporal dimensions. The system supports a simple search-box and map-like query interface that allows numerous powerful analytical queries. Techniques to make these queries robust, relevant, and highly scalable will be developed. The project also enables users to execute queries over multiple data sources simultaneously and seamlessly. The goal of the work is to dramatically simplify the management and analysis of large spatio-temporal data at different institutions, groups, and corporations.


  • NSF DIBBs Program, 11/01/14-10/31/17, $1,157,975

  • People

    Jun Tang
    Master Student. Google.

    Rui Dai
    Master Student. Now at Airbnb

    Robert Christensen
    Master/PhD student. Now at Visa Research.

    Jason Voung
    BS/MS Student

    Hoa Nguyen
    PhD Student. Google.

    Feifei Li

    Jeff M. Phillips
    Associate Professor

    Alex Clemmer
    Undergraduate student, Graduated in Spring 2013.

    Michael Matheny
    PhD Student.




  • STORM: Spatio-Temporal Online Reasoning and Management of Large Spatio-Temporal Data (Project Website), Talk
    By Robert Christensen ,    Lu Wang,    Feifei Li,    Ke Yi,    Jun Tang,    Natalee Villa
    In Proceedings of 34th ACM SIGMOD International Conference on Management of Data (SIGMOD 2015), pages 1111-1116, Melbourne, Australia, 2015.

    We present the STORM system to enable spatio-temporal online reasoning and management of large spatio-temporal data. STORM supports interactive spatio-temporal analytics through novel spatial online sampling techniques. Online spatio-temporal aggregation and analytics are then derived based on the online samples, where approximate answers with approximation quality guarantees can be provided immediately from the start of query execution. The quality of these online approximations improve over time. This demonstration proposal describes key ideas in the design of the STORM system, and presents the demonstration plan.

  • Spatial Online Sampling and Aggregation, Talk
    By Lu Wang,    Robert Christensen,    Feifei Li,    Ke Yi
    In Proceedings of International Conference on Very Large Data Bases (VLDB 2016), pages 84-95, New Delhi India, August, 2016.

    The massive adoption of smart phones and other mobile devices has generated humongous amount of spatial and spatio-temporal data. The importance of spatial analytics and aggregation is ever-increasing. An important challenge is to support interactive exploration over such data. However, spatial analytics and aggregation using all data points that satisfy a query condition is expensive, especially over large data sets, and could not meet the needs of interactive exploration. To that end, we present novel indexing structures that support spatial online sampling and aggregation on large spatial and spatio-temporal data sets. In spatial online sampling, random samples from the set of spatial (or spatio-temporal) points that satisfy a query condition are generated incrementally in an online fashion. With more and more samples, various spatial analytics and aggregations can be performed in an online, interactive fashion, with estimators that have better accuracy over time. Our design works well for both memory-based and disk-resident data sets, and scales well towards different query and sample sizes. More importantly, our structures are dynamic, hence, they are able to deal with insertions and deletions efficiently. Extensive experiments on large real data sets demonstrate the improvements achieved by our indexing structures compared to other baseline methods.

  • Exact and Approximate Flexible Aggregate Similarity Search
    By Feifei Li,    Ke Yi,    Yufei Tao,    Bin Yao,    Yang Li,    Dong Xie,    Min Wang
    Vol.25(4), Pages 317-338, The International Journal on Very Large Data Bases (VLDBJ), 2016.

    Aggregate similarity search, also known as aggregate nearest neighbor (Ann) query, finds many useful applications in spatial and multimedia databases. Given a group Q of M query objects, it retrieves from a database the objects most similar to Q, where the simi- larity is an aggregation (e.g., sum, max) of the distances between each retrieved object p and all the objects in Q. In this paper, we propose an added flexibility to the query definition, where the similarity is an aggre- gation over the distances between p and any subset of phi objects in Q for some support 0 < phi le 1. We call this new definition flexible aggregate similarity search, and accordingly refer to a query as a flexible aggre- gate nearest neighbor (Fann) query. We present algo- rithms for answering Fann queries exactly and approx- imately. Our approximation algorithms are especially appealing, which are simple, highly efficient, and work well in both low and high dimensions. They also return near-optimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experi- ments on large real and synthetic datasets from 2 to 74 dimensions have demonstrated their superior efficiency and high quality.

  • Simba: Efficient In-Memory Spatial Analytics (Project Website), Talk
    By Dong Xie,    Feifei Li,    Bin Yao,    Gefei Li,    Liang Zhou,    Minyi Guo
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD 2016), pages 1071-1085, June, 2016.

    Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial analytics systems are disk-based and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces the concept and construction of indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba%u2019s superior performance compared against other spatial analytics system.

  • Scalable Spatial Scan Statistics through Sampling (Project Website), Talk
    By Michael Matheny,    Raghvendra Singh,    Kaiqiang Wang,    Liang Zhang and Jeff M. Phillips
    In Proceedings of ACM International Conference on Advances in Geographic Information Systems (SIGSPATIAL), pages ??-??, November, 2016.

    Finding anomalous regions within spatial data sets is a central task for biosurveillance, homeland security, policy making, and many other important areas. These communities have mainly settled on spatial scan statistics as a rigorous way to discover regions where a measured quantity (e.g., crime) is statistically significant in its difference from a baseline population. However, most common approaches are ineffi- cient and thus, can only be run with very modest data sizes (a few thousand data points) or make assumptions on the geographic distributions of the data. We address these challenges by designing, exploring, and analyzing sample-then-scan algorithms. These algorithms randomly sample data at two scales, one to define regions and the other to approximate the counts in these regions. Our experiments demonstrate that these algorithms are efficient and accurate independent of the size of the original data set, and our analysis explains why this is the case. For the first time, these sample-then-scan algorithms allow spatial scan statistics to run on a million or more data points without making assumptions on the spatial distribution of the data. Moreover, our experiments and analysis give insight into when it is appropriate to trust the various types of spatial anomalies when the data is modeled as a random sample from a larger but unknown data set.

  • Distributed Trajectory Similarity Search
    By Dong Xie,    Feifei Li,    Jeff Phillips
    In Proceedings of 43rd International Conference on Very Large Data Bases (VLDB 2017), pages 1478-1489, September, 2017.

    Mobile and sensing devices have already become ubiquitous. They have made tracking moving objects an easy task. As a result, mobile applications like Uber and many IoT projects have generated massive amounts of trajectory data that can no longer be processed by a single machine efficiently. Among the typical query operations over trajectories, similarity search is a common yet expensive operator in querying trajectory data. It is useful for applications in different domains such as traffic and transportation optimizations, weather forecast and modeling, and sports analytics. It is also a fundamental operator for many important mining operations such as clustering and classification of trajectories. In this paper, we propose a distributed query framework to process trajectory similarity search over a large set of trajectories. We have implemented the proposed framework in Spark, a popular distributed data processing engine, by carefully considering different design choices. Our query framework supports both the Hausdorff distance the Fréchet distance. Extensive experiments have demonstrated the excellent scalability and query efficiency achieved by our design, compared to other methods and design alternatives.

  • Compass: Spatio Temporal Sentiment Analysis of US Election
    By Debjyoti Paul,    Feifei Li,    Murali Krishna Teja,    Xin Yu,    Richie Frost
    In Proceedings of 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2017), pages 1585-1594, August, 2017.

    With the widespread growth of various social network tools and platforms, analyzing and understanding societal response and crowd reaction to important and emerging social issues and events through social media data is increasingly an important problem. However, there are numerous challenges towards realizing this goal effectively and efficiently, due to the unstructured and noisy nature of social media data. The large volume of the underlying data also presents a fundamental challenge. Furthermore, in many application scenarios, it is often interesting, and in some cases critical, to discover patterns and trends based on geographical and/or temporal partitions, and keep track of how they will change overtime. This brings up the interesting problem of spatio-temporal sentiment analysis from large-scale social media data. This paper investigates this problem through a data science project called US Election 2016, What Twitter Says. The objective is to discover sentiment on Twitter towards either the democratic or the republican party at US county and state levels over any arbitrary temporal intervals, using a large collection of geotagged tweets from a period of 6 months leading up to the US Presidential Election in 2016. Our results demonstrate that by integrating and developing a combination of machine learning and data management techniques, it is possible to do this at scale with effective outcomes. The results of our project have the potential to be adapted towards solving and influencing other interesting social issues such as building neighborhood happiness and health indicators.

  • Geotagged US Tweets As Predictors Of County-Level Health Outcomes (Project Website)
    By Quynh C. Nguyen. Hsien-wen Meng,    Geoffrey Loomis,    Matt McCullough,    Dapeng Li,   . Debjyoti Paul,    Suraj Kath,    Feifei Li,    Elaine O. Nsoesie,    Ken R. Smith
    Vol.0, To Appear American Journal of Public Health (AJPH), 2017.

    To leverage geotagged Twitter data to create national indicators of the social environment, with small-area indicators of prevalent sentiment and social modeling of health behaviors, and to test associations with county-level health outcomes, while controlling for demographic characteristics. We used Twitter's streaming application programming interface to continuously collect a random 1% subset of publicly available geo-located tweets in the contiguous United States. We collected approximately 80 million geotagged tweets from 603 363 unique Twitter users in a 12-month period (April 2015-March 2016). Across 3135 US counties, Twitter indicators of happiness, food, and physical activity were associated with lower premature mortality, obesity, and physical inactivity. Alcohol-use tweets predicted higher alcohol-use-related mortality. Social media represents a new type of real-time data that may enable public health officials to examine movement of norms, sentiment, and behaviors that may portend emerging issues or outbreaks-thus providing a way to intervene to prevent adverse health events and measure the impact of health interventions.

  • Twitter-derived neighborhood characteristics associated with obesity and diabetes (Project Website)
    By Quynh C. Nguyen,    Kimberly D. Brunisholz,    Weijun Yu,    Matt McCullough,    Heidi A. Hanson,    Michelle L. Litchman,    Feifei Li,    Yuan Wan,    James A. VanDerslice,    Ming Wen,    Ken R. Smith
    Vol.7, Nature Scientific Reports (Nature Scientific Reports), December, 2017.

    Neighborhood characteristics are increasingly connected with health outcomes. Social processes affect health through the maintenance of social norms, stimulation of new interests, and dispersal of knowledge. We created zip code level indicators of happiness, food, and physical activity culture from geolocated Twitter data to examine the relationship between these neighborhood characteristics and obesity and diabetes diagnoses (Type 1 and Type 2). We collected 422,094 tweets sent from Utah between April 2015 and March 2016. We leveraged administrative and clinical records on 1.86 million individuals aged 20 years and older in Utah in 2015. Individuals living in zip codes with the greatest percentage of happy and physically-active tweets had lower obesity prevalence%u2014accounting for individual age, sex, nonwhite race, Hispanic ethnicity, education, and marital status, as well as zip code population characteristics. More happy tweets and lower caloric density of food tweets in a zip code were associated with lower individual prevalence of diabetes. Results were robust in sibling random effects models that account for family background characteristics shared between siblings. Findings suggest the possible influence of sociocultural factors on individual health. The study demonstrates the utility and cost-effectiveness of utilizing existing big data sources to conduct population health studies.