Feifei Li
Associate Professor


Book Chapter  Journal  Conference  Workshop  Tech Report]

Book Chapter

2010

  • Privacy Preservation on Time Series
    by Spiros Papadimitriou,    Feifei Li,    George Kollios,    Philip S. Yu
    Privacy-Aware Knowledge Discovery: Novel Applications and New Techniques, ISBN: 978-1439803653,  December,  2010.
    Abstract
  • 2008

  • Trip Planning Queries in Road Network Databases
    by Feifei Li,    Marios Hadjieleftheriou,    George Kollios,    Dihan Cheng,    Shang-Hua Teng
    Encyclopedia of GIS, ISBN: 978-0-387-30858-6 (PRINT), 978-0-387-35973-1 (ONLINE), SPRINGER,   February,  2008.
    Abstract
  • 2007

  • Authenticated Index Structures for Outsourced Databases
    by Feifei Li,    Marios Hadjieleftheriou,    George Kollios,    Leonid Reyzin
    Handbook of Database Security, ISBN: 978-0-387-48532-4, SPRINGER,  November,  2007.
    Abstract
  • Journal

    2017

  • ATOM: Efficient Tracking, Monitoring, and Orchestration of Cloud Resources (Project Website)
    By Min Du,    Feifei Li
    Vol.0, To Appear IEEE Transactions on Parallel and Distributed Systems (TPDS),  2017.
    Abstract

    The emergence of Infrastructure as a Service framework brings new opportunities, which also accompanies with new challenges in auto scaling, resource allocation, and security. A fundamental challenge underpinning these problems is the continuous tracking and monitoring of resource usage in the system. In this paper, we present ATOM, an efficient and effective framework to automatically track, monitor, and orchestrate resource usage in an Infrastructure as a Service (IaaS) system that is widely used in cloud infrastructure. We use novel tracking method to continuously track important system usage metrics with low overhead, and develop a Principal Component Analysis (PCA) based approach to continuously monitor and automatically find anomalies based on the approximated tracking results. We show how to dynamically set the tracking threshold based on the detection results, and further, how to adjust tracking algorithm to ensure its optimality under dynamic workloads. Lastly, when potential anomalies are identified, we use introspection tools to perform memory forensics on VMs guided by analyzed results from tracking and monitoring to identify malicious behavior inside a VM. We demonstrate the extensibility of ATOM through virtual machine (VM) clustering. The performance of our framework is evaluated in an open source IaaS system

  • 2016

  • Leveraging geotagged Twitter data to examine neighborhood happiness, diet, and physical activity (Project Website)
    By Quynh C. Nguyen,    Suraj Kath,    Hsien-Wen Meng,    Dapeng Li,    Ken R. Smith,    James A. VanDerslice,    Ming Wen,    Feifei Li
    Vol.73, Pages 77-88, Applied Geography (APPLIED GEOGRAPHY),  May,  2016.
    Abstract

    Using publicly available, geotagged Twitter data, we created neighborhood indicators for happiness, food and physical activity for three large counties: Salt Lake, San Francisco and New York. Methods: We utilize 2.8 million tweets collected between FebruaryeAugust 2015 in our analysis. Geocoordinates of where tweets were sent allow us to spatially join them to 2010 census tract locations. We implemented quality control checks and tested associations between Twitter-derived variables and sociodemographic characteristics.

  • Exact and Approximate Flexible Aggregate Similarity Search
    By Feifei Li,    Ke Yi,    Yufei Tao,    Bin Yao,    Yang Li,    Dong Xie,    Min Wang
    Vol.25(4), Pages 317-338, The International Journal on Very Large Data Bases (VLDBJ),  2016.
    Abstract

    Aggregate similarity search, also known as aggregate nearest neighbor (Ann) query, finds many useful applications in spatial and multimedia databases. Given a group Q of M query objects, it retrieves from a database the objects most similar to Q, where the simi- larity is an aggregation (e.g., sum, max) of the distances between each retrieved object p and all the objects in Q. In this paper, we propose an added flexibility to the query definition, where the similarity is an aggre- gation over the distances between p and any subset of phi objects in Q for some support 0 < phi le 1. We call this new definition flexible aggregate similarity search, and accordingly refer to a query as a flexible aggre- gate nearest neighbor (Fann) query. We present algo- rithms for answering Fann queries exactly and approx- imately. Our approximation algorithms are especially appealing, which are simple, highly efficient, and work well in both low and high dimensions. They also return near-optimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experi- ments on large real and synthetic datasets from 2 to 74 dimensions have demonstrated their superior efficiency and high quality.

  • Building a National Neighborhood Dataset From Geotagged Twitter Data for Indicators of Happiness, Diet, and Physical Activity (Project Website)
    By Quynh C Nguyen,    Dapeng Li,    Hsien-Wen Meng,    Suraj Kath,    Elaine Nsoesie,    Feifei Li,    Ming Wen
    Vol.2, Pages e158 1-16, JMIR Public Health & Surveillance (JMIR),  2016.
    Abstract

    Studies suggest that where people live, play, and work can influence health and well-being. However, the dearth of neighborhood data, especially data that is timely and consistent across geographies, hinders understanding of the effects of neighborhoods on health. Social media data represents a possible new data resource for neighborhood research. The aim of this study was to build, from geotagged Twitter data, a national neighborhood database with area-level indicators of well-being and health behaviors.

  • 2014

  • Dynamic Monitoring of Optimal Locations in Road Network Databases
    By Bin Yao,    Xiaokui Xiao,    Feifei Li,    Yifan Wu
    Vol.23(5), Pages 697-720, The International Journal on Very Large Data Bases (VLDBJ),  2014.
    Abstract

    Optimal location (OL) queries are a type of spatial queries that are particularly useful for the strategic planning of resources. Given a set of existing facilities and a set of clients, an OL query asks for a location to build a new facility that optimizes a certain cost metric (defined based on the distances between the clients and the facilities). Several techniques have been proposed to address OL queries, assuming that all clients and facilities reside in an L p space. In practice, however, movements between spatial locations are usually confined by the underlying road network, and hence, the actual distance between two locations can differ significantly from their L p distance. Motivated by the deficiency of the existing techniques, this paper presents a comprehensive study on OL queries in road networks. We propose a unified framework that addresses three variants of OL queries that find important applications in practice, and we instantiate the framework with several novel query processing algorithms. We further extend our framework to efficiently monitor the OLs when locations for facilities and/or clients have been updated. Our dynamic update methods lead to efficient answering of continuous optimal location queries. We demonstrate the efficiency of our solutions through extensive experiments with large real data.

  • Scalable Keyword Search on Large RDF Data
    By Wangchao Le,    Feifei Li,    Anastasios Kementsietsidis,    Songyun Duan
    Vol.26, Pages 2774-2788, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE),  2014.
    Abstract

    Keyword search is a useful tool for exploring large RDF datasets. Existing techniques either rely on constructing a distance matrix for pruning the search space or building summaries from the RDF graphs for query processing. In this work, we show that existing techniques have serious limitations in dealing with realistic, large RDF data with tens of millions of triples. Furthermore, the existing summarization techniques may lead to incorrect/incomplete results. To address these issues, we propose an effective summarization algorithm to summarize the RDF data. Given a keyword query, the summaries lend significant pruning powers to exploratory keyword search and result in much better efficiency compared to previous works. Unlike existing techniques, our search algorithms always return correct results. Besides, the summaries we built can be updated incrementally and efficiently. Experiments on both benchmark and large real RDF data sets show that our techniques are scalable and efficient.

  • Comparing Implementations of Near-Data Computing with In-Memory MapReduce Workloads
    By Seth H. Pugsley,    Jeffrey Jestes,    Rajeev Balasubramonian,    Vijayalakshmi Srinivasan,    Alper Buyuktosunoglu,    Al Davis,    Feifei Li
    Vol.34, Pages 44-52, IEEE Micro Special Issue on Big Data (IEEE MICRO),  2014.
    Abstract

    MOVING COMPUTATION NEAR MEMORY HAS BECOME MORE PRACTICAL BECAUSE OF 3D STACKING TECHNOLOGY. THIS ARTICLE DISCUSSES IN-MEMORY MAPREDUCE IN THE CONTEXT OF NEAR-DATA COMPUTING (NDC). THE AUTHORS CONSIDER TWO NDC ARCHITECTURES: ONE THAT EXPLOITS HYBRID MEMORY CUBE DEVICES AND ONE THAT DOES NOT. THEY EXAMINE THE BENEFITS OF DIFFERENT NDC APPROACHES AND QUANTIFY THE POTENTIAL FOR IMPROVEMENT FOR AN IMPORTANT EMERGING BIG-DATA WORKLOAD.

  • Scalable data summarization on big data. (Project Website)
    By Feifei Li ,   Suman Nath,   
    Vol.32, Pages 313-314, Distributed and Parallel Databases (DPD 2014),  2014.
    Abstract
  • 2013

  • Spatial Approximate String Search
    By Feifei Li,    Bin Yao,    Mingwang Tang,    Marios Hadjieleftheriou
    Vol.25(6), Pages 1394-1409, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE),  May,  2013.
    Abstract

    This work deals with the approximate string search in large spatial databases. Speci%uFB01cally, we investigate range queries augmented with a string similarity search predicate in both Euclidean space and road networks. We dub this query the spatial approximate string (SAS) query. In Euclidean space, we propose an approximate solution, the MHR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on the set resemblance between the query string and the q-grams from the sub-trees of index nodes. We also discuss how to estimate the selectivity of a SAS query in Euclidean space, for which we present a novel adaptive algorithm to %uFB01nd balanced partitions using both the spatial and string information stored in the tree. For queries on road networks, we propose a novel exact method, RSASSOL, which signi%uFB01cantly outperforms the baseline algorithm in practice. The RSASSOL combines the q-gram based inverted lists and the reference nodes based pruning. Extensive experiments on large real data sets demonstrate the ef%uFB01ciency and effectiveness of our approaches.

  • Rethinking Abstractions for Big Data: Why, Where, How, and What. (Project Website)
    By Mary W. Hall,   Robert M. Kirby,   Feifei Li ,   Miriah D. Meyer,   Valerio Pascucci,   Jeff M. Phillips,   Robert Ricci,   Jacobus E. van der Merwe,   Suresh Venkatasubramanian,   
    Vol.abs/1306.3295, CoRR (CORR 2013),  2013.
    Abstract
  • 2012

  • Query Access Assurance in Outsourced Databases
    By Wangchao Le,    Feifei Li
    Vol.5, No. 2, Pages 178-191, IEEE Transactions on Services Computing (IEEE TSC),  2012.
    Abstract

    Query execution assurance is an important concept in defeating lazy servers in the database as a service model. We show that extending query execution assurance to outsourced databases with multiple data owners is highly inefficient. To cope with lazy servers in the distributed setting, we propose query access assurance (QAA) that focuses on IO-bound queries. The goal in QAA is to enable clients to verify that the server has honestly accessed all records that are necessary to compute the correct query answer, thus eliminating the incentives for the server to be lazy if the query cost is dominated by the IO cost in accessing these records. We formalize this concept for distributed databases, and present two efficient schemes that achieve QAA with high success probabilities. The first scheme is simple to implement and deploy, but may incur excessive server to client communication cost and verification cost at the client side, when the query selectivity or the database size increases. The second scheme is more involved, but successfully addresses the limitation of the first scheme. Our design employs a few number theory techniques. Extensive experiments demonstrate the efficiency, effectiveness and usefulness of our schemes.

  • 2011

  • Group Enclosing Queries
    By Feifei Li,    Bin Yao,    Piyush Kumar
    Vol.23, No. 10, Pages 1526-1540, IEEE Transactions on Knowledge and Data Enginnering (TKDE),  2011.
    Abstract

    Given a set of points P and a query set Q, a group enclosing query (GEQ) fetches the point p such that the maximum distance of p to all points in Q is minimized. This problem is equivalent to the Min-Max case (minimizing the maximum distance) of aggregate nearest neighbor queries for spatial databases. This work first designs a new exact solution by exploring new geometric insights, such as the minimum enclosing ball, the convex hull and the furthest voronoi diagram of the query group. To further reduce the query cost, especially when the dimensionality increases, we turn to approximation algorithms. Our main approximation algorithm has a worst case p2-approximation ratio if one can find the exact nearest neighbor of a point. In practice, its approximation ratio never exceeds 1.05 for a large number of data sets up to six dimension.We also discuss how to extend it to higher dimensions (up to 74 in our experiment) and show that it still maintains a very good approximation quality (still close to 1) and low query cost. In fixed dimensions,we extend the p2-approximation algorithm to get a (1 epsilon)-approximate solution for the GEQ problem. Both approximation algorithms have O(logN M) query cost in any fixed dimension, where N and M are the sizes of the data set P and query group Q. Extensive experiments on both synthetic and real data sets, up to 10 million points and 74 dimensions, confirm the efficiency, effectiveness and scalability of the proposed algorithms, especially their significant improvement over the state-of-the-art method.

  • The World in a Nutshell: Concise Range Queries
    By Ke Yi,    Xiang Lian,    Feifei Li,    Lei Chen
    Vol.23, No. 1, Pages 139-154, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE),  2011.
    Abstract

    With the advance of wireless communication technology, it is quite common for people to view maps or get related servicesfrom the handheld devices, such as mobile phones and PDAs. Range queries, as one of the most commonly used tools, are oftenposed by the users to retrieve needful information from a spatial database. However, due to the limits of communication bandwidthand hardware power of handheld devices, displaying all the results of a range query on a handheld device is neither communicationefficient nor informative to the users. This is simply because that there are often too many results returned from a range query. In viewof this problem, we present a novel idea that a concise representation of a specified size for the range query results, while incurringminimal information loss, shall be computed and returned to the user. Such a concise range query not only reduces communicationcosts, but also offers better usability to the users, providing an opportunity for interactive exploration.The usefulness of the concise range queries is confirmed by comparing it with other possible alternatives, such as sampling andclustering. Unfortunately, we prove that finding the optimal representation with minimum information loss is an NP-hard problem.Therefore, we propose several effective and non-trivial algorithms to find a good approximate result. Extensive experiments on realworlddata have demonstrated the effectiveness and efficiency of the proposed techniques.

  • 2010

  • Semantics of Ranking Queries for Probabilistic Data (Project Website)
    By Jeffrey Jestes,    Graham Cormode,    Feifei Li,    Ke Yi
    Vol.250, Pages 545-556, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE),  2010.
    Abstract

    Recently, there have been several attempts to propose definitions and algorithms for ranking queries on probabilistic data. However, these lack many intuitive properties of a top-k over deterministic data. We define several fundamental properties, including exact-k, containment, unique-rank, value-invariance, and stability, which are satisfied by ranking queries on certain data. We argue these properties should also be carefully studied in defining ranking queries in probabilistic data, and fulfilled by definition for ranking uncertain data for most applications. We propose an intuitive new ranking definition based on the observation that the ranks of a tuple across all possible worlds represent a well-founded rank distribution. We studied the ranking definitions based on the expectation, the median and other statistics of this rank distribution for a tuple and derived the expected rank, median rank and quantile rank correspondingly. We are able to prove that the expected rank, median rank and quantile rank satisfy all these properties for a ranking query. We provide efficient solutions to compute such rankings across the major models of uncertain data, such as attribute-level and tuple-level uncertainty. Finally, a comprehensive experimental study confirms the effectiveness of our approach.

  • Top-k Queries on Temporal Data (Project Website)
    By Feifei Li,    Ke Yi,    Wangchao Le
    Vol.19, No.5, Pages 715-733, the International Journal on Very Large Data Bases (VLDBJ),  2010.
    Abstract

    The database community has devoted extensiveamount of efforts to indexing and querying temporaldata in the past decades. However, insufficientamount of attention has been paid to temporal rankingqueries. More precisely, given any time instance t, thequery asks for the top-k objects at time t with respect tosome score attribute. Some generic indexing structuresbased on R-trees do support ranking queries on temporaldata, but as they are not tailored for such queries,the performance is far from satisfactory. We presentthe Seb-tree, a simple indexing scheme that supportstemporal ranking queries much more efficiently. TheSeb-tree answers a top-k query for any time instance tin the optimal number of I/Os in expectation, namely,O(log_B(N/B k/B)) I/Os, where N is the size of the data set and B is the disk block size. The index has near-linearsize (for constant and reasonable kmax values, where kmax is the maximum value for the possible values of the query parameter k), can be constructed in near-lineartime, and also supports insertions and deletions without affecting its query performance guarantee. Most ofall, the Seb-tree is especially appealing in practice dueto its simplicity as it uses the B-tree as the only building block. Extensive experiments on a number of largedata sets, show that the Seb-tree is more than an orderof magnitude faster than the R-tree based indexes for temporal ranking queries.

  • Authenticated Index Structures for Aggregation Queries (Project Website)
    By Feifei Li,    Marios Hadjieleftheriou,    George Kollios,    Leonid Reyzin
    Vol.13, Pages 32:1-32:35, ACM Transactions on Information and System Security (ACM TISSEC),  2010.
    Abstract

    Query authentication is an essential component in outsourced database (ODB) systems. This arti-cle introduces efficient index structures for authenticating aggregation queries over large data sets. First, we design an index that features good performance characteristics for static environments.Then, we propose more involved structures for the dynamic case. Our structures feature excellentperformance for authenticating queries with multiple aggregate attributes and multiple selectionpredicates. Furthermore, our techniques cover a large number of aggregate types, including distributive aggregates (such as SUM, COUNT, MIN and MAX), algebraic aggregates (such as the AVG), and holistic aggregates (such as MEDIAN and QUANTILE). We have also addressed the issue of authenticating aggregation queries efficiently when the database is encrypted to protect data confidentiality. Finally, we implemented a working prototype of the proposed techniques andexperimentally validated the effectiveness and efficiency of our methods.

  • Top- queries on temporal data. (Project Website)
    By Feifei Li ,   Ke Yi,   Wangchao Le,   
    Vol.19, Pages 715-733, VLDB J. (VLDB 2010),  2010.
    Abstract
  • 2009

  • Small Synopses for Group-By Query Verification on Outsourced Data Streams (Project Website)
    By Ke Yi,    Feifei Li,    Graham Cormode,    Marios Hadjieleftheriou,    George Kollios,    Divesh Srivastava
    Vol.34, Pages 1-42, ACM Transactions on Database Systems (ACM TODS),  2009.
    Abstract

    Due to the overwhelming flow of information in many data stream applications, data outsourcing isa natural and effective paradigm for individual businesses to address the issue of scale. In the standard data outsourcing model, the data owner outsources streaming data to one or more third-partyservers, which answer queries posed by a potentially large number of clients on the data owner's behalf. Data outsourcing intrinsically raises issues of trust, making outsourced query assurance on data streams a problem with important practical implications. Existing solutions proposed in this model all build upon cryptographic primitives such as signatures and collision-resistant hash functions, which only work for certain types of queries, for example, simple selection/aggregation queries.In this article, we consider another common type of queries, namely, “GROUP BY, SUM” queries, which previous techniques fail to support. Our new solutions are not based on cryptographic primitives, but instead use algebraic and probabilistic techniques to compute a small synopsis on the true query result, which is then communicated to the client so as to verify the correctness of the query result returned by the server. The synopsis uses a constant amount of space irrespective of the result size, has an extremely small probability of failure, and can be maintained using no extra space when the query result changes as elements stream by. We then generalize our synopsisto allow some tolerance on the number of erroneous groups, in order to support semantic load shedding on the server.When the number of erroneous groups is indeed tolerable, the synopsis can be strengthened so that we can locate and even correct these errors. Finally, we implement our techniques and perform an empirical evaluation using live network traffic.

  • Robust Approximate Aggregation in Sensor Data Management Systems (Project Website)
    By Jeffrey Considine,    Marios Hadjieleftheriou,    Feifei Li,    John Byers,    George Kollios
    Vol.34, Pages 1-35, ACM Transactions on Database Systems (ACM TODS),  2009.
    Abstract

    In the emerging area of sensor-based systems, a significant challenge is to develop scalable, fault-tolerant methods to extract useful information from the data the sensors collect. An approach tothis data management problem is the use of sensor database systems, which allow users to performaggregation queries such as MIN, COUNT, and AVG on the readings of a sensor network. In addition, more advanced queries such as frequency counting and quantile estimation can be supported. Due to energy limitations in sensor-based networks, centralized data collection is generally impractical, so most systems use in-network aggregation to reduce network traffic. However, even these aggregation strategies remain bandwidth-intensive when combined with the fault-tolerant, multipath routing methods often used in these environments. To avoid this expense, we investigate the use of approximate in-network aggregation using small sketches.We present duplicate-insensitive sketching techniques that can be implemented efficiently on small sensor devices with limited hardware support and we analyze both their performance and accuracy. Finally, we present anexperimental evaluation that validates the effectiveness of our methods.

  • 2008

  • Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations (Project Website)
    By Ke Yi,    Feifei Li,    George Kollios,    Divesh Srivastava
    Vol.20, Pages 1669-1682, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE),  2008.
    Abstract

    This work introduces novel polynomial algorithms for processing top-k queries in uncertain databases under the generally adopted model of x-relations. An x-relation consists of a number of x-tuples, and each x-tuple randomly instantiates into one tuple from one or more alternatives. Our results significantly improve the best known algorithms for top-k query processing in uncertain databases, in terms of both runtime and memory usage. In the single-alternative case, the new algorithms are 2 to 3 orders of magnitude faster than the previous algorithms. In the multialternative case, we introduce the first-known polynomial algorithms, while the current best algorithms have exponential complexity in both time and space. Our algorithms run in near linear or low polynomial time and cover both types of top-k queries in uncertain databases. We provide both the theoretical analysis and an extensive experimental evaluation to demonstrate the superiority of the new approaches over existing solutions.

  • 2005

  • Robust Aggregation in Sensor Networks
    By George Kollios,    John W. Byers,    Jeffrey Considine,    Marios Hadjieleftheriou,    Feifei Li
    Vol.28, Pages 26-32, IEEE Data Engineering Bulletin,  2005.
    Abstract

    In the emerging area of sensor-based systems, a significant challenge is to develop scalable, fault-tolerant methods to extract useful information from the data the sensors collect. An approach to this data management problem is the use of sensor “database” systems, which allow users to perform aggregation queries on the readings of a sensor network. Due to power and range constraints, centralized approaches are generally impractical, so most systems use in-network aggregation to reduce network traffic. However, these aggregation strategies become bandwidth intensive when combined with the fault-tolerant, multi-path routing methods often used in these environments. In order to avoid this expense, we investigate the use of approximate in-network aggregation using small sketches and we survey robust and scalable methods for computing duplicate-sensitive aggregates.

  • 2004

  • Towards Building Logical Views of Websites
    By Zehua Liu,    Wee Keong Ng,    Ee-Peng Lim,    Feifei Li
    Vol.0, Data & Knowledge Engineering (DKE 2004, ELSEVIER),  2004,  2004.
    Abstract
  • Conference

    2016

  • Spell: Streaming Parsing of System Event Logs, Talk
    By Min Du,    Feifei Li
    In Proceedings of In Proceedings of 16th IEEE International Conference on Data Mining (ICDM 2016),  pages 859-864,  Barcelona, Spain,  December,  2016.
    Abstract

    System event logs contain critical information for diagnosis and monitoring purposes with the growing complexity of modern computer systems. They have been frequently used as a valuable resource in data-driven approaches to enhance system health and stability. A typical procedure in system log analytics is to first parse unstructured logs to structured data, and then apply data mining and machine learning techniques and/or build workflow models from the resulting structured data. Previous work on parsing system event logs focused on offline, batch processing of raw log files. But increasingly, applications demand online monitoring and processing. As a result, a streaming method to parse unstructured logs is needed. We propose an online streaming method Spell, which utilizes a longest common subsequence based approach, to parse system event logs. We show how to dynamically extract log patterns from incoming logs and how to maintain a set of discovered message types in streaming fashion. Enhancement to find more accurate message types is also proposed. We compare Spell against two popular offline batched methods to extract patterns from system event logs on large real data. The results demonstrate that, even compared with the offline alternatives, Spell shows its superiority in terms of both efficiency and effectiveness.

  • Fast and Concurrent RDF Queries with RDMA-based Distributed Graph Exploration
    By Jiaxin Shi,    Youyang Yao,    Rong Chen,    Haibo Chen,    Feifei Li
    In Proceedings of 12th USENIX Symposium on Operating Systems Design and Implebbmentation (OSDI 2016),  pages 317-332,  Savannah, GA,  November,  2016.
    Abstract

    Many knowledge bases like Google and Facebook%u2019s knowledge/social graphs are represented and stored as RDF graphs, where users can issue structured queries on such graphs using SPARQL. With massive queries over large and constantly growing RDF data, it is im- perative that an RDF graph store should provide low latency and high throughput for concurrent query processing. However, prior systems still experience high per-query latency over large datasets and most prior designs have poor resource utilization such that each query is processed in sequence. We present Wukong, a distributed graph-based RDF store that leverages RDMA-based graph exploration to support highly concurrent and low-latency queries over large data. Following a graph-centric design, Wukong builds indexes by extending the graphs with index vertices and leverages differentiated graph partitioning to retain locality (for normal vertices) while exploiting parallelism (for index vertices). To explore the low-latency feature of RDMA, Wukong leverages a new RDMA-friendly, predicate-based key/value store to store the partitioned graphs. To provide low latency and high parallelism, Wukong decomposes a query into sub-queries, each of which may be distributed over and handled by a set of machines in parallel. For each sub-query, Wukong leverages RDMA to provide communication-aware optimizations to balance between execution and data migration. To reduce inter-query interference, Wukong leverages a worker-obliger work stealing mechanism to oblige queries in straggler workers. Evaluation on a 6-node RDMA-capable cluster shows that Wukong signif- icantly outperforms state-of-the-art systems like TriAD and Trinity.RDF for both latency and throughput, usually at the scale of orders of magnitude.

  • Simba: Spatial In-Memory Big Data Analytics (Project Website)
    By Dong Xie,    Feifei Li,    Bin Yao,    Gefei Li,    Liang Zhou,    Zhongpu Chen,    Minyi Guo
    In Proceedings of In Proceedings of 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2016),  pages TBA,  San Francisco, USA,  November,  2016.
    Abstract

    We present the Simba (Spatial In-Memory Big data Analytics) system, which offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba natively extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and DataFrame API. It enables the construction of indexes over RDDs inside the engine in order to work with big spatial data and complex spatial operations. Simba also comes with an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput in big spatial data analysis. This demonstration proposal describes key ideas in the design of Simba, and presents a demonstration plan.

  • Oblivious RAM: A Dissection and Experimental Evaluation (Project Website), Talk
    By Zhao Chang,    Dong Xie,    Feifei Li
    In Proceedings of Very Large Data Bases (VLDB 2016),  pages 1113-1124,  New Delhi, India,  September,  2016.
    Abstract

    Many companies choose the cloud as their data and IT infrastructure platform. The remote access of the data brings the issue of trust, and the potential risk of compromising sensitive information should not be underestimated. Despite the use of strong encryption schemes, adversaries can still learn valuable information regarding encrypted data by observing the data access patterns. To that end, one can hide the access patterns, which may leak sensitive information, using Oblivious RAMs (ORAMs). Numerous works have proposed different ORAM constructions. Nevertheless, many such ORAM constructions are of only theoretical interest, hence, are notuseful in practice. Several more practical ORAM constructions do exist, but they have never been thoroughly compared against and tested on large databases. There are no open source implementation of these schemes, making such a study challenging to carry out (since most ORAMs are quite contrived in terms of both theoretical analysis and practical implementations).These limitations make it difficult for researchers and practitioners to choose and adopt a suitable ORAM for their applications. To address this issue, we provide a thorough study over several practical ORAM constructions, and implement them under the same library. We perform extensive experiments to provide insights into their performance characteristics with respect to efficiency, scalability, and communication cost. Lastly, we plan to release our ORAM implementations through GitHub so that the communities at large may benefit from and contribute to an open source ORAM library under one unified framework.

  • Spatial Online Sampling and Aggregation, Talk
    By Lu Wang,    Robert Christensen,    Feifei Li,    Ke Yi
    In Proceedings of International Conference on Very Large Data Bases (VLDB 2016),  pages 84-95,  New Delhi India,  August,  2016.
    Abstract

    The massive adoption of smart phones and other mobile devices has generated humongous amount of spatial and spatio-temporal data. The importance of spatial analytics and aggregation is ever-increasing. An important challenge is to support interactive exploration over such data. However, spatial analytics and aggregation using all data points that satisfy a query condition is expensive, especially over large data sets, and could not meet the needs of interactive exploration. To that end, we present novel indexing structures that support spatial online sampling and aggregation on large spatial and spatio-temporal data sets. In spatial online sampling, random samples from the set of spatial (or spatio-temporal) points that satisfy a query condition are generated incrementally in an online fashion. With more and more samples, various spatial analytics and aggregations can be performed in an online, interactive fashion, with estimators that have better accuracy over time. Our design works well for both memory-based and disk-resident data sets, and scales well towards different query and sample sizes. More importantly, our structures are dynamic, hence, they are able to deal with insertions and deletions efficiently. Extensive experiments on large real data sets demonstrate the improvements achieved by our indexing structures compared to other baseline methods.

  • Matrix Sketching Over Sliding Windows, Talk
    By Zhewei Wei,    Xuancheng Liu,    Feifei Li,    Shuo Shang,    Xiaoyong Du,    Ji-Rong Wen
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIMGOD 2016),  pages 1465-1480,  June,  2016.
    Abstract

    Large-scale matrix computation becomes essential for many data data applications, and hence the problem of sketching matrix with small space and high precision has received extensive study for the past few years. This problem is often considered in the row-update streaming model, where the data set is a matrix $A in mathbb{R}^{n times d}$, and the processor receives a row ($1times d$) of $A$ at each timestamp. The goal is to maintain a smaller matrix (termed approximation matrix, or simply approximation) $Bin mathbb{R}^{elltimes d}$ as an approximation to $A$, such that the covariance error $|A^T A - B^TB|$ is small and $ell ll n$.

    This paper studies continuous tracking approximations to the matrix defined by a sliding window of most recent rows. We consider both sequence-based and time-based window. We show that maintaining $A^TA$ exactly requires linear space in the sliding window model, as opposed to $O(d^2)$ space in the streaming model. With this observation, we present three general frameworks for matrix sketching on sliding windows. The sampling techniques give random samples of the rows in the window according to their squared norms. The textsf{Logarithmic Method} converts a {em mergeable} streaming matrix sketch into a matrix sketch on time-based sliding windows. The textsf{Dyadic Interval} framework converts arbitrary streaming matrix sketch into a matrix sketch on sequence-based sliding windows. In addition to proving all algorithmic properties theoretically, we also conduct extensive empirical study with real datasets to demonstrate the efficiency of these algorithms.

  • Graph Analytics Through Fine-Grained Parallelism, Talk
    By Zechao Shang,    Feifei Li,    Jeffrey Xu Yu,    Zhiwei Zhang,    Hong Chen
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD 2016),  pages 463-478,  June,  2016.
    Abstract

    Large graphs are getting increasingly popular and even indispensable in applications from a number of important application domains, for example, in social media data, large networks, and knowledge bases. Efficient execution of various analytics over large graphs thus becomes an important subject of study. To increase efficiency and scalability, in-memory computation and parallelism have been explored extensively to speed up various graph analytical workloads. In many existing graph analytical engines (e.g., Pregel, Neo4j, GraphLab), parallelism is achieved via one of the three concurrency control models, namely, bulk synchronization processing (BSP), asynchronous processing, and synchronous processing. Among them, synchronous processing has the potential to achieve the best performance due to fine-grained parallelism, while ensuring the correctness and the convergence of the computation, if an effective concurrency control scheme is used. This paper explores the topological properties of the underlying graph to design and implement a highly effective concurrency control scheme for efficient synchronous processing in an in-memory graph analytical engine. Our design uses a novel hybrid approach that combines 2PL (two-phase locking) with OCC (optimistic concurrency control), for high degree and low degree vertices in a graph respectively. Our results show that the proposed hybrid synchronous scheduler has significantly outperformed other synchronous schedulers in existing graph analytical engines, as well as BSP and asynchronous schedulers.

  • Simba: Efficient In-Memory Spatial Analytics (Project Website), Talk
    By Dong Xie,    Feifei Li,    Bin Yao,    Gefei Li,    Liang Zhou,    Minyi Guo
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD 2016),  pages 1071-1085,  June,  2016.
    Abstract

    Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial analytics systems are disk-based and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces the concept and construction of indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba%u2019s superior performance compared against other spatial analytics system.

  • Wander Join: Online Aggregation via Random Walks, Talk
    By Feifei Li,    Bin Wu,    Ke Yi,    Zhuoyue Zhao
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD 2016),  pages 615-629,  June,  2016.
    Abstract

    Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and even needs unrealistic assumptions (e.g., tuples in a table are stored in random order). This paper proposes a new approach the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. Wander join produces independent but non-uniform samples, which gives huge performance gains over the uniform but non-independent samples returned by ripple join. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Wander join works for different types of joins including chain, acyclic, cyclic joins, with selection predicates and group-by clauses. It easily extends to $theta$-joins as well. Extensive experiments using the TPC-H benchmark have demonstrated the superior performance of wander join over ripple join in both internal and external memory. We have also implemented and tested wander join in the latest version of PostgreSQL; the results show its excellent efficiency and effectiveness in a full fledged, commercial level DBMS.

  • Wander Join: Online Aggregation For Joins, Talk
    By Feifei Li,    Bin Wu,    Ke Yi,    Zhuoyue Zhao
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data 2016 (Demo Paper) (SIGMOD 2016),  pages 2121-2124,  June,  2016.
    Abstract

    Joins are expensive, and online aggregation over joins wasproposed to mitigate the cost, which offers a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and may also need very restrictive assumptions (e.g., tuples in a table are stored in random order). We have introduced a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. We have also implemented and tested wander join in the latest version of PostgreSQL; this is the first time online aggregation has been integrated into a full fledged, commercial level DBMS.

  • Privacy Preserving Subgraph Matching on Large Graphs in Cloud, Talk
    By Zhao Chang,    Lei Zou,    Feifei Li
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD 2016),  pages 199-213,  2016.
    Abstract

    The wide presence of large graph data and the increasing popularity of storing data in the cloud drive the needs for graph query processing on a remote cloud. But a fundamental challenge is to process user queries without compromising sensitive information.This work focuses on privacy preserving subgraph matching in a cloud server. The goal is to minimize the overhead on both cloud and client sides for subgraph matching, without compromising users%u2019 sensitive information. To that end, we transform an originalgraph G into a privacy preserving graph Gk, which meets the requirement of an existing privacy model known as k-automorphism. By making use of the symmetry in a k-automorphic graph, a subgraph matching query can be efficiently answered using a graphGo, a small subset of Gk. This approach saves both space and query cost in the cloud server. In addition, we anonymize the original query graphs to protect their label information using label generalization technique. To reduce the search space for a subgraph matching query, we propose a cost model to select the more effectivelabel combinations. The effectiveness and efficiency of our method are demonstrated through extensive experimental results on real datasets.

  • Web Technologies and Applications - 18th Asia-Pacific Web Conference, APWeb 2016, Suzhou, China, September 23-25, 2016. Proceedings, Part II (Project Website)
    By Feifei Li ,   Kyuseok Shim,   Kai Zheng,   Guanfeng Liu,   
    In Proceedings of APWeb (2) (APWEB 2016),  pages ??-??,  2016.
    Abstract
  • Web Technologies and Applications - 18th Asia-Pacific Web Conference, APWeb 2016, Suzhou, China, September 23-25, 2016. Proceedings, Part I (Project Website)
    By Feifei Li ,   Kyuseok Shim,   Kai Zheng,   Guanfeng Liu,   
    In Proceedings of APWeb (1) (APWEB 2016),  pages ??-??,  2016.
    Abstract
  • 2015

  • ATOM: Automated Tracking, Orchestration, and Monitoring of Resource Usage in Infrastructure as a Service Systems, (Project Website), Talk
    By Min Du,    Feifei Li
    In Proceedings of IEEE (IEEE BIGDATA 2015),  pages 271-278,  Santa Clara CA,  November,  2015.
    Abstract

    We present ATOM, an efficient and effective framework to enable automated tracking, monitoring, and orchestration of resource usage in an Infrastructure as a Service (IaaS) system. We design a novel tracking method to continuously track important performance metrics with low overhead, and develop a Principal Component Analysis (PCA) approach with quality guarantees to continuously monitor and automatically find anomalies based on the approximate tracking results. Lastly, when potential anomalies are identified, we use introspection tools to perform memory forensics on virtual machines (VMs) to identify malicious behavior inside a VM. We deploy ATOM in an IaaS system to monitor VM resource usage, and to detect anomalies. Various attacks are used as an example to demonstrate how ATOM is both effective and efficient to track and monitor resource usage, detect anomalies, and orchestrate system resource usage.

  • Fixed-Function Hardware Sorting Accelerators for Near Data MapReduce Execution
    By Seth Pugsley,    Arjun Deb,    Rajeev Balasubramonian,    Feifei Li
    In Proceedings of 33rd IEEE International Conference on Computer Design (IEEE ICCD-33),  pages 439-442,  New York,  October,  2015.
    Abstract

    A large fraction of MapReduce execution time is spent processing the Map phase, and a large fraction of Map phase execution time is spent sorting the intermediate key-value pairs generated by the Map function. Sorting accelerators can achieve high performance and low power because they lack the overheads of sorting implementations on general purpose hardware, such as instruction fetch and decode. We find that sorting accelerators are a good match for 3D-stacked Near Data Processing (NDP) because their sorting throughput is so high that it saturates the memory bandwidth available in other memory organizations. The increased sorting performance and low power requirement of fixed-function hardware lead to very high Map phase performance and energy efficiency, reducing Map phase execution time by up to 92%, and reducing energy consumption by up to 91%. We further find that sorting accelerators in a less exotic form of NDP outperform more expensive forms of 3D-stacked NDP without accelerators. We also implement the accelerator on an FPGA to validate our claims.

  • STORM: Spatio-Temporal Online Reasoning and Management of Large Spatio-Temporal Data (Project Website), Talk
    By Robert Christensen ,    Lu Wang,    Feifei Li,    Ke Yi,    Jun Tang,    Natalee Villa
    In Proceedings of 34th ACM SIGMOD International Conference on Management of Data (SIGMOD 2015),  pages 1111-1116,  Melbourne, Australia,  2015.
    Abstract

    We present the STORM system to enable spatio-temporal online reasoning and management of large spatio-temporal data. STORM supports interactive spatio-temporal analytics through novel spatial online sampling techniques. Online spatio-temporal aggregation and analytics are then derived based on the online samples, where approximate answers with approximation quality guarantees can be provided immediately from the start of query execution. The quality of these online approximations improve over time. This demonstration proposal describes key ideas in the design of the STORM system, and presents the demonstration plan.

  • Distributed Online Tracking, Talk
    By Mingwang Tang,    Feifei Li,    Yufei Tao
    In Proceedings of 34th ACM SIGMOD International Conference on Management of Data (SIGMOD 2015),  pages 2047-2061,  Melbourne, Australia,  2015.
    Abstract

    In online tracking, an observer S receives a sequence of values, one per time instance, from a data source that is described by a function f. A tracker T wants to continuously maintain an approximation that is within an error threshold of the value f(t) at any time instance t, with small communication overhead. This problem was recently formalized and studied in, and a principled approach with optimal competitive ratio was proposed. This work extends the study of online tracking to a distributed setting, where a tracker T wants to track a function f that is computed from a set of functions {f1, . . . , fm} from m distributed observers and respective data sources. This formulation finds numerous important and natural applications, e.g., sensor networks, distributed systems, measurement networks, and pub-sub systems. We formalize this problem and present effective online algorithms for various topologies of a distributed system/network for different aggregate functions. Experiments on large real data sets demonstrate the excellent performance of our methods in practice.

  • 2014

  • NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads
    By Seth Pugsley,    Jeffrey Jestes,    Huihui Zhang,    Rajeev Balasubramonian,    Vijayalakshmi Srinivasan,    Alper Buyuktosunoglu,    Al Davis,    Feifei Li
    In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2014),  pages 190-200,  2014.
    Abstract

    While Processing-in-Memory has been investigated for decades, it has not been embraced commercially. A number of emerging technologies have renewed interest in this topic. In particular, the emergence of 3D stacking and the imminent release of Micron’s Hybrid Memory Cube device have made it more practical to move computation near memory. However, the literature is missing a detailed analysis of a killer application that can leverage a Near Data Computing (NDC) architecture. This paper focuses on in-memory MapReduce workloads that are commercially important and are especially suitable for NDC because of their embarrassing parallelism and largely localized memory accesses. The NDC architecture incorporates several simple processing cores on a separate, non-memory die in a 3D-stacked memory package; these cores can perform Map operations with efficient memory access and without hitting the bandwidth wall. This paper describes and evaluates a number of key elements necessary in realizing efficient NDC operation: (i) low-EPI cores, (ii) long daisy chains of memory devices, (iii) the dynamic activation of cores and SerDes links. Compared to a baseline that is heavily optimized for MapReduce execution, the NDC design yields up to 15X reduction in execution time and 18X reduction in system energy.

  • Continuous Matrix Approximation on Distributed Data (Project Website), Talk
    By Mina Ghashami,    Jeff Phillips,    Feifei Li
    In Proceedings of 40th International Conference on Very Large Data Bases (VLDB 2014),  pages 809-820,  Hangzhou China,  2014.
    Abstract

    Tracking and approximating data matrices in streaming fashion is a fundamental challenge. The problem requires more care and attention when data comes from multiple distributed sites, each receiving a stream of data. This paper considers the problem of %u201Ctracking approximations to a matrix%u201D in the distributed streaming model. In this model, there are m distributed sites each observing a distinct stream of data (where each element is a row of a distributed matrix) and has a communication channel with a coordinator, and the goal is to track an %u03B5-approximation to the norm of the matrix along any direction. To that end, we present novel algorithms to address the matrix approximation problem. Our algorithms maintain a smaller matrix B, as an approximation to a distributed streaming matrix A, such that for any unit vector x: ||Ax||^2 %u2212 ||Bx||^2 %u2264 %u03B5||A||_F^2. Our algorithms work in streaming fashion and incur small communication, which is critical for distributed computation. Our best method is deterministic and uses only O((m/%u03B5) log(%u03B2N)) communication, where N is the size of stream (at the time of the query) and %u03B2 is an upperbound on the squared norm of any row of the matrix. In addition to proving all algorithmic properties theoretically, extensive experiments with real large datasets demonstrate the efficiency of these protocols.

  • Scalable Histograms on Large Probabilistic Data, Talk
    By Mingwang Tang,    Feifei Li
    In Proceedings of 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2014),  pages 631-640,  New York,  2014.
    Abstract

    Histogram construction is a fundamental problem in data management, and a good histogram supports numerous mining operations. Recent work has extended histograms to probabilistic data. However, constructing histograms for probabilistic data can be extremely expensive, and existing studies suffer from limited scalability. This work designs novel approximation methods to construct scalable histograms on probabilistic data. We show that our methods provide constant approximations compared to the optimal histograms produced by the state-of-the-art in the worst case. We also extend our methods to parallel and distributed settings so that they can run gracefully in a cluster of commodity machines. We introduced novel synopses to reduce communication cost when running our methods in such settings. Extensive experiments on large real data sets have demonstrated the superb scalability and efficiency achieved by our methods, when compared to the state-ofthe-art methods. They also achieved excellent approximation quality in practice.

  • Web-Age Information Management - 15th International Conference, WAIM 2014, Macau, China, June 16-18, 2014. Proceedings (Project Website)
    By Feifei Li ,   Guoliang Li,   Seung-won Hwang,   Bin Yao ,   Zhenjie Zhang,   
    In Proceedings of WAIM (WAIM 2014),  pages ??-??,  2014.
    Abstract
  • International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014 (Project Website)
    By Curtis E. Dyreson,   Feifei Li ,   M. Tamer Özsu,   
    In Proceedings of SIGMOD Conference (SIGMOD 2014),  pages ??-??,  2014.
    Abstract
  • 2013

  • Adaptive Log Compression for Massive Log Data (Project Website), Talk
    By Robert Christensen,    Feifei Li
    In Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD 2013, Undergraduate Research Poster) (SIGMOD 2013),  pages 1283-1284,  June,  2013.
    Abstract

    We present novel adaptive log compression schemes. Results show 30% improvement on compression ratios over existing approaches.

  • Quality and Efficiency for Kernel Density Estimates in Large Data, Talk
    By Yan Zheng,    Jeffrey Jestes,    Jeff M. Phillips,    Feifei Li
    In Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD 2013),  pages 433-444,  June,  2013.
    Abstract

    Kernel density estimates are important for a broad variety of applications including media databases, pattern recognition, computer vision, data mining, and the sciences. Their con- struction has been well-studied, but existing techniques are expensive on massive datasets and/or only provide heuristic approximations without theoretical guarantees. We propose randomized and deterministic algorithms with quality guarantees which are orders of magnitude more ef- ficient than previous algorithms. Our algorithms do not re- quire knowledge of the kernel or its bandwidth parameter and are easily parallelizable. We demonstrate how to imple- ment our ideas in a centralized setting and in MapReduce, although our algorithms are applicable to any large-scale data processing framework. Extensive experiments on large real datasets demonstrate the quality, efficiency, and scala- bility of our techniques.

  • Optimal Splitters for Temporal and Multi-version Databases, Talk
    By Wangchao Le,    Feifei Li,    Yufei Tao,    Robert Christensen
    In Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD 2013),  pages 109-120,  June,  2013.
    Abstract

    Temporal and multi-version databases often generate massive amounts of data, due to the increasing availability of large storage space and the increasing importance of mining and auditing opera- tions from historical data. For example, Google now allows users to limit and rank search results by setting a time range. These data- bases are ideal candidates for a distributed store, which offers large storage space, and parallel and distributed processing power from a cluster of (commodity) machines. A key challenge is to achieve a good load balancing algorithm for storage and processing of these data, which is done by partitioning the database. In this paper, we introduce the concept of optimal splitters for temporal and multi- version databases, which induce a partition of the input data set, and guarantee that the size of the maximum bucket be minimized among all possible configurations, given a budget for the desired number of buckets. We design efficient methods for memory- and disk-resident data respectively, and show that they significantly outperform com- peting baseline methods both theoretically and empirically on large real data sets.

  • Secure Nearest Neighbor Revisited, Talk
    By Bin Yao,    Feifei Li,    Xiaokui Xiao
    In Proceedings of 29th IEEE International Conference on Data Engineering (ICDE 2013),  pages 733-744,  Brisbane, Australia,  April,  2013.
    Abstract

    The increasing popularity of the cloud drives the demands for secure queries on an encrypted database E(D) stored in the cloud. In this paper, we investigate the secure nearest neighbor (SNN) problem, in which a client issues an encrypted query point E(q) to a server and asks for an encrypted data point in E(D) that is closest to the query point, without allowing the server to learn the plaintexts of the data or the query (and its result). We show that efficient attacks exist for existing SNN methods, even though they were claimed to be secure in standard security models (such as indistinguishability under chosen plaintext or ciphertext attacks). We also establish a relationship between the SNN problem and the order-preserving encryption (OPE) problem from the cryptography field, and we show that SNN is at least as hard as OPE. Since it is impossible to construct secure OPE schemes in standard security models, our results imply that one cannot expect to find the exact (encrypted) nearest neighbor based on only E(q) and E(D). Given this hardness result, we design new SNN methods by asking the server, given only E(q) and E(D), to return a relevant (encrypted) partition E(G) from E(D) (i.e., G subseteq D), such that that E(G) is guaranteed to contain the answer for the SNN query. Our methods provide customizable tradeoff between efficiency and communication cost, and they are as secure as the encryption scheme E used to encrypt the query and the database. Note that E can be any well-established encryption schemes. The efficiency and scalability of our methods are demonstrated through extensive experiments on large real data.

  • LogKV: Exploiting Key-Value Stores for Log Processing, Talk
    By Zhao Cao,    Shimin Chen,    Feifei Li,    Min Wang,    Xiaoyang Sean Wang
    In Proceedings of 6th Biennial Conference on Innovative Data System Research (CIDR 2013),  pages TBA,  Asilomar, California,  January,  2013.
    Abstract

    Event log processing and analysis play a key role in applications ranging from security management, IT trouble shooting, to user behavior analysis. Recent years have seen a rapid growth in system scales and the corresponding rapid increase in the amount of log event data. At the same time, as logs are found to be a valuable information source, log analysis tasks have become more sophisticated demanding both interactive exploratory query processing and batch computation. Desirable query types include selection with time ranges and value filtering criteria, join within time windows, join between log data and reference tables, and various aggregation types. In such a situation, parallel solutions are necessary, but existing parallel and distributed solutions either support limited query types or perform only batch computations on logs. With a system called LogKV, this paper reports a first study of using Key-Value stores to support log processing and analysis, exploiting the scalability, reliability, and efficiency commonly found in Key-Value store systems. LogKV contains a number of unique techniques that are needed to handle log data in terms of event ingestion, load balancing, storage optimization, and query processing. Preliminary experimental results show that LogKV is a promising solution.

  • Proceedings of the 12th International ACM Workshop on Data Engineering for Wireless and Mobile Access, MobiDE 2013, New York, NY, USA, June 23, 2013 (Project Website)
    By Mario A. Nascimento,   Mohamed A. Sharaf,   Feifei Li ,   Kyriakos Mouratidis,   
    In Proceedings of MobiDE (MOBIDE 2013),  pages ??-??,  2013.
    Abstract
  • Proceedings of the fifth international workshop on Cloud data management, CloudDB 2013, San Francisco, California, USA, October 28, 2013 (Project Website)
    By Xiaofeng Meng,   Fusheng Wang,   Feifei Li ,   Cong Yu,   
    In Proceedings of CloudDB@CIKM (CIKM 2013),  pages ??-??,  2013.
    Abstract
  • CloudDB 2013: fifth international workshop on cloud data management. (Project Website)
    By Feifei Li ,   Xiaofeng Meng,   Fusheng Wang,   Cong Yu,   
    In Proceedings of CIKM (CIKM 2013),  pages 2549-2550,  2013.
    Abstract
  • 2012

  • Building Wavelet Histograms on Large Data in MapReduce (Project Website), Talk
    By Jeffrey Jestes,    Ke Yi,    Feifei Li
    In Proceedings of 38th International Conference on Very Large Databases (VLDB 2012),  pages 109-120,  Istanbul, Turkey,  August,  2012.
    Abstract

    MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compactaccurate summary of data is essential. Among various data summarization tools, histograms have proven to be particularly important and useful for summarizing data, and the wavelet histogram is one of the most widely used histograms. In this paper, we investigate the problem of building wavelet histograms efficiently on large datasets in MapReduce. We measure the efficiency of the algorithms by both end-to-end running time and communication cost. We demonstrate straightforward adaptations of existing exact and approximate methods for building wavelet histograms to MapReduce clusters are highly inefficient. To that end, we design new algorithms for computing exact and approximate wavelet histograms and discuss their implementation in MapReduce. We illustrate our techniques in Hadoop, and compare to baseline solutions with extensive experiments performed in a heterogeneous Hadoop cluster of 16 nodes, using large real and synthetic datasets, up to hundreds of gigabytes. The results suggest significant (often several orders of magnitude) performance improvement achieved by our new algorithms.

  • Ranking Large Temporal Data, Talk
    By Jestes Jestes,    Jeff M. Phillips,    Feifei Li,    Mingwang Tang
    In Proceedings of 38th International Conference on Very Large Databases (VLDB 2012),  pages pages 1412-1423,  Istanbul, Turkey,  August,  2012.
    Abstract

    Ranking temporal data has not been studied until recently [14], even though ranking is an important operator (being promoted as a first-class citizen) in database systems [8]. However, only the instant top-k queries on temporal data were studied in [14], where objects with the k highest scores at a query time instance t are to be retrieved. The instant top-k definition clearly comes with limitations (sensitive to outliers, difficult to choose a meaningful query time t). A more flexible and general ranking operation is to rank objects based on the aggregation of their scores in a query interval, which we dub the aggregate top-k query on temporal data. For example, return the top-10 weather stations having the highest average temperature from 10/01/2010 to 10/07/2010; find the top-20 stocks having the largest total transaction volumes from02/05/2011 to 02/07/2011. This work presents a comprehensive study to this problem by designing both exact and approximate methods (with approximation quality guarantees). We also provide theoretical analysis on the construction cost, the index size, the update and the query costs of each approach. Extensive experiments on large real datasets clearly demonstrate the efficiency, the effectiveness, and the scalability of our methods compared to the baseline methods.

  • Towards Fair Sharing of Block Storage in a Multi-tenant Cloud
    By Xing Lin,    Yun Mao,    Feifei Li,    Robert Ricci
    In Proceedings of 4th USENIX Workshop on Hot Topics in Cloud Computing (USENIX HOTCLOUD 2012),  pages 1-6,  June,  2012.
    Abstract

    A common problem with disk-based cloud storage services is that performance can vary greatly and become highly unpredictable in a multi-tenant environment. A fundamental reason is the interference between workloads co-located on the same physical disk. We observe that different IO patterns interfere with each other significantly, which makes the performance of different types of workloads unpredictable when they are executed concurrently. Unpredictability implies that users may not get a fair share of the system resources from the cloud services they are using. At the same time, replication is commonly used in cloud storage for high reliability. Connecting these two facts, we propose a cloud storage system designed to minimize workload interference without increasing storage costs or sacrificing the overall system throughput. Our design leverages log-structured disk layout, chain replication and a workload-based replica selection strategy to minimize interference, striking a balance between performance and fairness. Our initial results suggest that this approach is a promising way to improve the performance and predictability of cloud storage.

  • ColumbuScout: Towards Building Local Search Engines over Large Databases (Project Website), Talk
    By Cody Hansen,    Feifei Li
    In Proceedings of 31st ACM SIGMOD International Conference on Management of Data (SIGMOD 2012, SYSTEM DEMO),  pages 617-620,  May,  2012.
    Abstract

    In many database applications, search is still executed via formbased query interfaces, which are then translated into SQL statements to find matching records. Ranking is usually not implemented unless users have explicitly indicated how to rank thematching records, e.g., in the ascending order of year. Often, this approach is neither intuitive nor user-friendly (especially with many search fields in a query form). It also requires application developers to design schema-specific query forms and develop specific user programs that understand these forms. In this work, we propose to demonstrate the ColumbuScout system that aims at quickly building and deploying a local search engine over one ormore large databases. The ColumbuScout system adopts a keyword-centric search approach. It integrates the keyword-centric principle with the latest results from approximate string search, and designs search-enginestyle ranking functions. It also introduces some of its own indexing structures and storage designs, to improve its overall efficiency and scalability. We will demonstrate that it is almost effortless for application developers to deploy ColumbuScout over any databases, and ColumbuScout is able to support search-engine-like types of search over large databases (more than 1.7 billion records in the examples we used) efficiently and effectively.

  • Scalable Multi-Query Optimization for SPARQL, Talk
    By Wangchao Le,    Anastasios Kementsietsidis,    Songyun Duan,    Feifei Li
    In Proceedings of 28th IEEE International Conference on Data Engineering (ICDE 2012),  pages 666-677,  Washington DC,  April,  2012.
    Abstract

    This paper revisits the classical problem of multi-query optimization in the context of RDF/SPARQL. We show that the techniques developed for relational and semi-structured data/query languages are hard, if not impossible, to be extended to account for RDF data model and graph query patterns expressed in SPARQL. In light of the NP-hardness of the multi-query optimization for SPARQL, we propose heuristic algorithms that partition the input batch of queries into groups such that each group of queries can be optimized together. An essential component of the optimization incorporates an efficient algorithm to discover the common sub-structures of multiple SPARQL queries and an effective cost model to compare candidate execution plans. Since our optimization techniques do not make any assumption about the underlying SPARQL query engine, they have the advantage of being portable across different RDF stores. The comprehensive experimental studies, performed on three popular RDF stores, show that the proposed techniques are effective, efficient and scalable.

  • Efficient Threshold Monitoring for Distributed Probabilistic Data, Talk
    By Mingwang Tang,    Feifei Li,    Jeff M. Phillips,    Jeffrey Jestes
    In Proceedings of 28th IEEE International Conference on Data Engineering (ICDE 2012),  pages 1120-1131,  Washington DC,  April,  2012.
    Abstract

    In distributed data management, a primary concern is monitoring the distributed data and generating an alarm when a user specified constraint is violated. A particular useful instance is the threshold based constraint, which is commonly known as the distributed threshold monitoring problem. This work extends this useful and fundamental study to distributed probabilistic data that emerge in a lot of applications, where uncertainty naturally exists when massive amounts of data are produced at multiple sources in distributed, networked locations. Examples include distributed observing stations, large sensor fields, geographically separate scientific institutes/units and many more. When dealing with probabilistic data, there are two thresholds involved, the score and the probability thresholds. One must monitor both simultaneously, as such, techniques developed for deterministic data are no longer directly applicable. This work presents a comprehensive study to this problem. Our algorithms have significantly outperformed the baseline method in terms of both the communication cost (number of messages and bytes) and the running time, as shown by an extensive experimental evaluation using several, real large datasets.

  • Efficient Parallel kNN Joins for Large Data in MapReduce (Project Website), Talk
    By Chi Zhang,    Feifei Li,    Jeffrey Jestes
    In Proceedings of 15th International Conference on Extending Database Technology (EDBT 2012),  pages 38-49,  March,  2012.
    Abstract

    In data mining applications and spatial and multimedia databases, a useful tool is the kNN join, which is to produce the k nearest neighbors (NN), from a dataset S, of every point in a dataset R. Since it involves both the join and the NN search, performing kNN joins efficiently is a challenging task. Meanwhile, applications continue to witness a quick (exponential in some cases) increase in the amount of data to be processed. A popular model nowadays for large-scale data processing is the shared-nothing cluster on a number of commodity machines using MapReduce. Hence, how to execute kNN joins efficiently on large data that are stored in a MapReduce cluster is an intriguing problem that meets many practical needs. This work proposes novel (exact and approximate) algorithms in MapReduce to perform efficient parallel kNN joins on large data. We demonstrate our ideas using Hadoop. Extensive experiments in large real and synthetic datasets, with tens or hundreds of millions of records in both R and S and up to 30 dimensions, have demonstrated the efficiency, effectiveness, and scalability of our methods.

  • 2011

  • Multi-Approximate-Keyword Routing in GIS Data (Project Website), Talk
    By Bin Yao,    Mingwang Tang,    Feifei Li
    In Proceedings of 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL GIS 2011),  pages 201-210,  Chicago, USA,  November,  2011.
    Abstract

    GIS data usually consist of both spatial and textual information,where the spatial component represents the location ofthe object and the textual element contains a set of stringsdescribing object in that location. For GIS data situated ona road network, shortest path search is a basic operation. Inpractice, however, users are often interested at routing whencertain constraints on the textual information have been alsoincorporated. This work complements the standard shortestpath search with multiple keywords and an approximatestring similarity function, where the goal is to find the shortestpath that passes through at least one matching objectper keyword; we dub this problem the multi-approximatekeywordrouting (makr) query. We present both exact andapproximate solutions. When the number %u03BA of query keywordsis small (e.g., %u03BA %u2264 6), the exact solution works efficiently.However, when %u03BA increases, it becomes increasinglyexpensive (especially on large GIS data). In this case, ourapproximate methods achieve superb query efficiency, excellentscalability, and high approximation quality, as indicatedin our extensive experiments on large, real datasets (up to 2million points on road networks with hundreds of thousandsof nodes and edges). We also prove that one approximatemethod has a %u03BA-approximation in the worst case.

  • Flexible Aggregate Similarity Search, Talk
    By Yang Li,    Feifei Li,    Ke Yi,    Bin Yao,    Min Wang
    In Proceedings of 30th ACM SIGMOD International Conference on Management of Data (SIGMOD 2011),  pages 1009-1020,  Athens, Greece,  June,  2011.
    Abstract

    Aggregate similarity search, a.k.a. aggregate nearest neighbor (Ann) query, finds many useful applications in spatialand multimedia databases. Given a group Q of M query objects, it retrieves the most (or top-k) similar object to Q froma database P, where the similarity is an aggregation (e.g.,sum, max) of the distances between the retrieved object pand all the objects in Q. In this paper, we propose an addedflexibility to the query definition, where the similarity is anaggregation over the distances between p and any subset ofφM objects in Q for some support 0 < φ ≤ 1. We call thisnew definition flexible aggregate similarity (Fann) search,which generalizes the Ann problem. Next, we present algorithms for answering Fann queries exactly and approximately. Our approximation algorithms are especially appealing, which are simple, highly efficient, and work wellin both low and high dimensions. They also return nearoptimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experiments on largereal and synthetic datasets from 2 to 74 dimensions havedemonstrated their superior efficiency and high quality.

  • Optimal Location Queries in Road Network Databases , Talk
    By Xiaokui Xiao,    Bin Yao,    Feifei Li
    In Proceedings of 27th IEEE International Conference on Data Engineering (ICDE 2011),  pages 804--815,  Hannover, Germany,  April,  2011.
    Abstract

    Optimal location (OL) queries are a type of spatial queries particularly useful for the strategic planning of resources.Given a set of existing facilities and a set of clients, an OL query asks for a location to build a new facility that optimizes a certain cost metric (defined based on the distances between the clients and the facilities). Several techniques have been proposed to address OL queries, assuming that all clients and facilities reside in an Lp space. In practice, however, movements between spatial locations are usually confined by the underlying road network, and hence, the actual distance between two locations can differ significantly from their Lp distance.Motivated by the deficiency of the existing techniques, this paper presents the first study on OL queries in road networks.We propose a unified framework that addresses three variants of OL queries that find important applications in practice,and we instantiate the framework with several novel query processing algorithms. We demonstrate the efficiency of our solutions through extensive experiments with real data.

  • Rewriting Queries on SPARQL Views, Talk
    By Wangchao Le,    Songyun Duan,    Anastasios Kementsietsidis,    Feifei Li,    Min Wang
    In Proceedings of 20th International World Wide Web Conference (WWW 2011),  pages 655--664,  Hyderabad, India,  March,  2011.
    Abstract

    The problem of answering SPARQL queries over virtual SPARQL views is commonly encountered in a number of settings, including while enforcing security policies to access RDF data, or when integrating RDF data from disparate sources. We approach this problem by rewriting SPARQL queries over the views to equivalent queries over the underlying RDF data, thus avoiding the costs entailed by view materialization and maintenance. We show that SPARQL query rewriting combines the most challenging aspects of rewriting for the relational and XML cases: like the relational case, SPARQL query rewriting requires synthesizing multiple views; like the XML case, the size of the rewritten query is exponential to the size of the query and the views. In this paper, we present the first native query rewriting algorithm for SPARQL. For an input SPARQL query over a set of virtual SPARQL views, the rewritten query resembles a union of conjunctive queries and can be of exponential size. We propose optimizations overthe basic rewriting algorithm to (i) minimize each conjunctive query in the union; (ii) eliminate conjunctive queries with empty results from evaluation; and (iii) efficiently prune out big portions of the search space of empty rewritings. The experiments, performed on two RDF stores, show that our algorithms arescalable and independent of the underlying RDF stores. Furthermore, our optimizations have order of magnitude improvements over the basic rewriting algorithm in both the rewriting size and evaluation time.

  • 2010

  • Probabilistic String Similarity Joins (Project Website), Talk
    By Jeffrey Jestes,    Feifei Li,    Zhepeng Yan,    Ke Yi
    In Proceedings of 29th ACM SIGMOD International Conference on Management of Data (SIGMOD 2010),  pages 327-338,  Indianapolis, Indiana,  June,  2010.
    Abstract

    Edit distance based string similarity join is a fundamental operator in string databases. Increasingly, many applications in data cleaning, data integration, and scientific computing have to deal with fuzzy information in string attributes. Despite the intensive efforts devoted in processing (deterministic) string joins and managing probabilistic data respectively, modeling and processing probabilistic strings is still a largely unexplored territory. This work studies the string join problem in probabilistic string databases, using the expected edit distance (EED) as the similarity measure. We first discuss two probabilistic string models to capture the fuzziness in string values in real-world applications. The string-level model is complete, but may be expensive to represent and process. The character-level model has a much more succinct representation when uncertainty in strings only exists at certain positions. Since computing the EED between two probabilistic strings is prohibitively expensive, we have designed efficient and effective pruning techniques that can be easily implemented in existing relational database engines for both models. Extensive experiments on real data have demonstrated order-of-magnitude improvements of our approaches over the baseline.

  • Logging Every Footstep: Quantile Summaries for the Entire History
    By Yufei Tao,    Ke Yi,    Cheng Sheng,    Jian Pei,    Feifei Li
    In Proceedings of 29th ACM SIGMOD International Conference on Management of Data (SIGMOD 2010),  pages 639-650,  Indianapolis, Indiana,  June,  2010.
    Abstract

    Quantiles are a crucial type of order statistics in databases. Extensive research has been focused on maintaining a space-efficient structure for approximate quantile computation as the underlyingdataset is updated. The existing solutions, however, are designed to support only the current, most-updated, snapshot of the dataset. Queries on the past versions of the data cannot be answered. This paper studies the problem of historical quantile search. The objective is to enable epsilon-approximate quantile retrieval on any snapshot of the dataset in history. The problem is very important in analyzing the evolution of a distribution, monitoring the quality of services, query optimization in temporal databases, and so on. We present the first formal results in the literature. First, we prove a novel theoretical lower bound on the space cost of supporting-approximate historical quantile queries. The bound reveals the fundamental difference between answering quantile queries about the past and those about the present time. Second, we propose a structure for finding -approximate historical quantiles, and show that it consumes more space than the lower bound by only a square logarithmic factor. Extensive experiments demonstrate that in practice our technique performs much better than predicted by theory. In particular, the quantiles it returns are remarkably more accurate than the theoretical precision guarantee.

  • Approximate String Search in Spatial Databases (Project Website), Talk
    By Bin Yao,    Feifei Li,    Marios Hadjieleftheriou,    Kun Hou
    In Proceedings of 26th IEEE International Conference on Data Engineering (ICDE 2010),  pages 4-15,  Long Beach, California,  March,  2010.
    Abstract

    This work presents a novel index structure, MHR-tree, for efficiently answering approximate string match queries in large spatial databases. The MHR-tree is based on the R-tree augmented with the min-wise signature and the linear hashing technique. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on set resemblance between the query string and the q-grams from the sub-trees of index nodes. MHR-tree supports a wide range of query predicates efficiently, including range and nearest neighbor queries. We also discuss how to estimate range query selectivity accurately. We present a novel adaptive algorithm for finding balancedpartitions using both the spatial and string information stored in the tree. Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our approach.

  • K Nearest Neighbor Queries and kNN-Joins in Large Relational Databases (Almost) for Free, Talk
    By Bin Yao,    Feifei Li,    Piyush Kumar
    In Proceedings of 26th IEEE International Conference on Data Engineering (ICDE 2010),  pages 545-556,  Long Beach, California,  March,  2010.
    Abstract

    Finding the k nearest neighbors (kNN) of a query point, or a set of query points (kNN-Join) are fundamental problems in many application domains. Many previous efforts tosolve these problems focused on spatial databases or stand-alone systems, where changes to the database engine may be required,which may limit their application on large data sets that are stored in a relational database management system. Furthermore,these methods may not automatically optimize kNN queries or kNN-Joins when additional query conditions are specified. In this work, we study both the kNN query and the kNN-Join ina relational database, possibly augmented with additional query conditions. We search for relational algorithms that require no changes to the database engine. The straightforward solution uses the user-defined-function (UDF) that a query optimizer cannot optimize.We design algorithms that could be implementedby SQL operators without changes to the database engine, hence enabling the query optimizer to understand and generate the “best” query plan. Using only a small constant number of random shifts for databases in any fixed dimension, our approach guarantees to find the approximate kNN with only logarithmic number of page accesses in expectation with aconstant approximation ratio and it could be extended to find the exact kNN efficiently in any fixed dimension. Our design paradigm easily supports the kNN-Join and updates. Extensive experiments on large, real and synthetic, data sets confirm the efficiency and practicality of our approach.

  • Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA (Project Website)
    By Feifei Li ,   Mirella M. Moro,   Shahram Ghandeharizadeh,   Jayant R. Haritsa,   Gerhard Weikum,   Michael J. Carey,   Fabio Casati,   Edward Y. Chang,   Ioana Manolescu,   Sharad Mehrotra,   Umeshwar Dayal,   Vassilis J. Tsotras,   
    In Proceedings of ICDE (ICDE 2010),  pages ??-??,  2010.
    Abstract
  • 2009

  • Ranking Distributed Probabilistic Data (Project Website), Talk
    By Feifei Li,    Ke Yi,    Jeffrey Jestes
    In Proceedings of 28th ACM SIGMOD International Conference on Management of Data (SIGMOD 2009),  pages 361-374,  Providence, USA,  June,  2009.
    Abstract

    Ranking queries are essential tools to process large amounts of probabilistic data that encode exponentially many possible deterministic instances. In many applications where uncertainty and fuzzy information arise, data are collected from multiple sources in distributed, networked locations, e.g., distributed sensor fields with imprecise measurements, multiple scientific institutes with inconsistency in their scientific data. Due to the network delay and the economic cost associated with communicating large amounts of data over a network, a fundamental problem in these scenarios is to retrieve the global top-k tuples from all distributed sites with minimum communication cost. Using the well-founded notion of the expected rank of each tuple across all possible worlds as the basis of ranking, this work designs both communication- and computation efficient algorithms for retrieving the top-k tuples with the smallest ranks from distributed sites. Extensive experiments using both synthetic and real data sets confirm the efficiency and superiority of our algorithms over the straightforward approach of forwarding all data to the server.

  • Semantics of Ranking Queries for Probabilistic Data and Expected Ranks (Project Website), Talk
    By Graham Cormode,    Feifei Li,    Ke Yi
    In Proceedings of 25th IEEE International Conference on Data Engineering (ICDE 2009),  pages 305-316,  Shanghai, China,  April,  2009.
    Abstract

    When dealing with massive quantities of data, topkqueries are a powerful technique for returning only the kmost relevant tuples for inspection, based on a scoring function.The problem of efficiently answering such ranking queries hasbeen studied and analyzed extensively within traditional databasesettings. The importance of the top-k is perhaps even greater inprobabilistic databases, where a relation can encode exponentiallymany possible worlds. There have been several recent attemptsto propose definitions and algorithms for ranking queries overprobabilistic data. However, these all lack many of the intuitiveproperties of a top-k over deterministic data. Specifically, wedefine a number of fundamental properties, including exact-k,containment, unique-rank, value-invariance, and stability, whichare all satisfied by ranking queries on certain data. We arguethat all these conditions should also be fulfilled by any reasonabledefinition for ranking uncertain data. Unfortunately, none of theexisting definitions is able to achieve this.To remedy this shortcoming, this work proposes an intuitivenew approach of expected rank. This uses the well-founded notionof the expected rank of each tuple across all possible worldsas the basis of the ranking. We are able to prove that, incontrast to all existing approaches, the expected rank satisfiesall the required properties for a ranking query. We provideefficient solutions to compute this ranking across the majormodels of uncertain data, such as attribute-level and tuple-leveluncertainty. For an uncertain relation of N tuples, the processingcost is O(N logN)—no worse than simply sorting the relation.In settings where there is a high cost for generating each tuple inturn, we provide pruning techniques based on probabilistic tailbounds that can terminate the search early and guarantee thatthe top-k has been found. Finally, a comprehensive experimentalstudy confirms the effectiveness of our approach.

  • Reverse Furthest Neighbors in Spatial Databases (Project Website), Talk
    By Bin Yao,    Feifei Li,    Piyush Kumar
    In Proceedings of 25th IEEE International Conference on Data Engineering (ICDE 2009),  pages 664-675,  Shanghai, China,  April,  2009.
    Abstract

    Given a set of points $P$ and a query point q, the reverse furthest neighbor (RFN) query fetches the set of points $p in P$ such that q is their furthest neighbor among all points in $Pcup{q}$. This is the monochromatic RFN (MRFN) query. Another interesting version of RFN query is the bichromatic reverse furthest neighbor (BRFN) query. Given a set of points P, a queryset Q and a query point $q in Q$, a BRFN query fetches the set of points $pin P$ such that q is the furthest neighbor of p amongall points in Q. The RFN query has many interesting applications in spatial databases and beyond. For instance, given a large residential database (as P) and a set of potential sites (as Q) for building a chemical plant complex, the construction site should be selected as the one that has the maximum number of reverse furthest neighbors. This is an instance of the BRFN query. This paper presents the challenges associated with such queries and proposes efficient, R-tree based algorithms for both monochromatic and bichromatic versions of the RFN queries. We analyze properties of the RFN query that differentiate it from the widely studied reverse nearest neighbor queries and enable the design of novel algorithms. Our approach takes advantage of the furthest Voronoi diagrams as well as the convex hulls of either the data set P (in the MRFN case) or thequery set Q (in the BRFN case). For the BRFN queries, we also extend the analysis to the situation when Q is large in size andbecomes disk-resident. Experiments on both synthetic and real data sets confirm the efficiency and scalability of proposed algorithms over the brute-force search based approach.

  • Improving Transaction-time DBMS Performance and Functionality, Talk
    By David B. Lomet,    Feifei Li
    In Proceedings of 25th IEEE International Conference on Data Engineering (ICDE 2009),  pages 581-591,  Shanghai, China,  April,  2009.
    Abstract

    Immortal DB is a transaction time database systemthat is built into a commercial database system rather than beinglayered on top. This enables it to have performance that is veryclose to the performance of an unversioned current timedatabase system. Achieving such competitive performance isessential for wide acceptance of this temporal functionality. Inthis paper we describe further performance improvements in twocritical dimensions. First Immortal DB range searchperformance is improved for current time data via improvedcurrent version storage utilization, making this performanceessentially the same as unversioned performance. Second,Immortal DB update performance is increased by furtherreducing the cost for the timestamping of versions. Finally, weshow how a simple modification, integrated into thetimestamping mechanism, can provide a foundation for auditingdatabase activity. Our algorithms have been incorporated into acommercial database engine and experiments using this databaseengine demonstrate the effectiveness of our approach.

  • A Concise Representation of Range Queries
    By Ke Yi,    Xiang Lian,    Feifei Li,    Lei Chen
    In Proceedings of 25th IEEE International Conference on Data Engineering (ICDE 2009),  pages 1179-1182,  Shanghai, China,  April,  2009.
    Abstract

    With the advance of wireless communication technology,it is quite common for people to view maps or get relatedservices from the handheld devices, such as mobile phones andPDAs. Range queries, as one of the most commonly used tools,are often posed by the users to retrieve needful information froma spatial database. However, due to the limits of communicationbandwidth and hardware power of handheld devices, displayingall the results of a range query on a handheld device is neithercommunication efficient nor informative to the users. This issimply because that there are often too many results returnedfrom a range query. In view of this problem, we present a novelidea that a concise representation of a specified size for the rangequery results, while incurring minimal information loss, shall becomputed and returned to the user. Such a concise range querynot only reduces communication costs, but also offers betterusability to the users, providing an opportunity for interactiveexploration. The usefulness of the concise range queries isconfirmed by comparing it with other possible alternatives, suchas sampling and clustering. Then we propose algorithms to finda good concise representation.

  • 2008

  • Finding Frequent Items in Probabilistic Data (Project Website), Talk
    By Qin Zhang,    Feifei Li,    Ke Yi
    In Proceedings of 27th ACM SIGMOD International Conference on Management of Data (SIGMOD 2008),  pages 819-832,  Vancouver, Canada,  June,  2008.
    Abstract

    Computing statistical information on probabilistic data hasattracted a lot of attention recently, as the data generatedfrom a wide range of data sources are inherently fuzzy oruncertain. In this paper, we study an important statisti-cal query on probabilistic data: finding the frequent items.One straightforward approach to identify the frequent itemsin a probabilistic data set is to simply compute the expectedfrequency of an item and decide if it exceeds a certain frac-tion of the expected size of the whole data set. However,this simple definition misses important information aboutthe internal structure of the probabilistic data and the in-terplay among all the uncertain entities. Thus, we proposea new definition based on the possible world semantics thathas been widely adopted for many query types in uncertaindata management, trying to find all the items that are likelyto be frequent in a randomly generated possible world. Ourapproach naturally leads to the study of ranking frequentitems based on confidence as well.Finding likely frequent items in probabilistic data turnsout to be much more difficult. We first propose exact algo-rithms for offline data with either quadratic or cubic time.Next, we design novel sampling-based algorithms for stream-ing data to find all approximately likely frequent items withtheoretically guaranteed high probability and accuracy. Oursampling schemes consume sublinear memory and exhibitexcellent scalability. Finally, we verify the effectiveness andefficiency of our algorithms using both real and syntheticdata sets with extensive experimental evaluations.

  • Efficient Processing of Top-k Queries in Uncertain Databases (Project Website), Talk
    By Ke Yi,    Feifei Li,    George Kollios,    Divesh Srivastava
    In Proceedings of 24th IEEE International Conference on Data Engineering (ICDE 2008),  pages 1406-1408,  Cancun, Mexico,  April,  2008.
    Abstract

    This work introduces novel polynomial-time algorithmsfor processing top-k queries in uncertain databases,under the generally adopted model of x-relations. An x-relationconsists of a number of x-tuples, and each x-tuple randomlyinstantiates into one tuple from one or more alternatives. Ourresults signi cantly improve the best known algorithms for top-kquery processing in uncertain databases, in terms of both runningtime and memory usage. Focusing on the single-alternative case,the new algorithms are orders of magnitude faster.

  • Randomized Synopses for Query Assurance on Data Streams (Project Website), Talk
    By Ke Yi,    Feifei Li,    Marios Hadjieleftheriou,    George Kollios,    Divesh Srivastava
    In Proceedings of 24th IEEE International Conference on Data Engineering (ICDE 2008),  pages 416-425,  Cancun, Mexico,  April,  2008.
    Abstract

    The overwhelming flow of information in manydata stream applications forces many companies to outsourceto a third-party the deployment of a Data Stream Management System (DSMS) for performing desired computations. Remotecomputations intrinsically raise issues of trust, making query execution assurance on data streams a problem with practicalimplications. Consider a client observing the same data stream asa remote server (e.g., network traffic), that registers a continuousquery on the server’s DSMS, and receives answers upon request.The client needs to verify the integrity of the results usingsignificantly fewer resources than evaluating the query locally.Towards that goal, we propose a probabilistic algorithm forselection and aggregate/group-by queries, that uses constantspace irrespective of the result-set size, has low update cost,and arbitrarily small probability of failure. We generalize thisalgorithm to allow some tolerance on the number of errors permitted (irrespective of error magnitude), and also discuss thehardness of permitting arbitrary errors of small magnitude. Wealso perform an empirical evaluation using live network traffic.

  • 2007

  • Time Series Compressibility and Privacy, Talk
    By Spiros Papadimitriou,    Feifei Li,    George Kollios,    Philip S. Yu
    In Proceedings of 33rd International Conference on Very Large Databases (VLDB 2007),  pages 459-470,  Vienna, Austria,  September,  2007.
    Abstract

    In this paper we study the trade-offs between time series compressibility and partial information hiding and their fundamental implications on how we should introduce uncertainty about individual values by perturbing them. More specifically, if the perturbation does not have the same compressibility properties as the original data, then it can be detected and filtered out, reducing uncertainty. Thus, by making the perturbation “similar” to the original data, we can both preserve the structure of the data better, while simultaneously making breaches harder. However, as data become more compressible, a fraction of the uncertainty can be removed if true values are leaked, revealing how they were perturbed. We formalize these notions, study the above trade-offs on real data and develop practical schemes which strike a good balance and can also be extended for on-the-fly data hiding in a streaming environment.

  • Proof-Infused Streams: Enabling Authentication of Sliding Window Queries On Streams, Talk
    By Feifei Li,    Ke Yi,    Marios Hadjieleftheriou,    George Kollios
    In Proceedings of 33rd International Conference on Very Large Databases (VLDB 2007),  pages 147-158,  Vienna, Austria,  September,  2007.
    Abstract

    As computer systems are essential components of many critical commercial services, the need for secure online transactions is now becoming evident. The demand for such applications, as the market grows, exceeds the capacity of individual businesses to provide fast and reliable services, making outsourcing technologies a key player in alleviating issues of scale. Consider a stock broker that needs to provide a real-time stock trading monitoring service to clients. Since the cost of multicasting this information to a large audience might become prohibitive, the broker could outsource the stock feed to third-party providers, who are in turn responsible for forwarding the appropriate sub-feed to clients. Evidently, in critical applications the integrity of the third-party should not be taken for granted. In this work we study a variety of authentication algorithms for selection and aggregation queries over sliding windows. Our algorithms enable the end-users to prove that the results provided by the third-party are correct, i.e., equal to the results that would have been computed by the original provider. Our solutions are based on Merkle hash trees over a forest of space partitioning data structures, and try to leverage key features,like update, query, signing, and authentication costs. We present detailed theoretical analysis for our solutions and empirically evaluate the proposed techniques.

  • Hiding in the Crowd: Privacy Preservation on Evolving Streams through Correlation Tracking, Talk
    By Feifei Li,    Jimeng Sun,    Spiros Papadimitriou,    George Mihaila,    Ioana Stanoi
    In Proceedings of 23th IEEE International Conference on Data Engineering (ICDE 2007),  pages 686-695,  April,  2007.
    Abstract

    We address the problem of preserving privacy in streams, which has received surprisingly limited attention. For static data, a well-studied and widely used approach is based on random perturbation of the data values. However, streams pose additional challenges. First, analysis of the data has to be performed incrementally, using limited processing time and buffer space, making batch approaches unsuitable. Second, the characteristics of streams evolve over time. Consequently, approaches based on global analysis of the data are not adequate. We show that it is possible to efficiently and effectively track the correlation and autocorrelation structure of multivariate streams and leverage it to add noise which maximally preserves privacy, in the sense that it is very hard to remove. Our techniques achieve much better results than previous static, global approaches, while requiring limited processing time and memory. We provide both a mathematical analysis and experimental evaluation on real data to validate the correctness, efficiency, and effectiveness of our algorithms.

  • 2006

  • Dynamic Authenticated Index Structures for Outsourced Databases (Project Website), Talk
    By Feifei Li,    Marios Hadjieleftheriou,    George Kollios,    Leonid Reyzin
    In Proceedings of 25th ACM SIGMOD International Conference on Management of Data (SIGMOD 2006),  pages 121-132,  Chicago, USA,  June,  2006.
    Abstract

    In outsourced database (ODB) systems the database owner publishes its data through a number of remote servers, with the goal of enabling clients at the edge of the network to access and query the data more efficiently. As servers might be untrusted or can be compromised, query authentication becomes an essential component of ODB systems. Exist-ing solutions for this problem concentrate mostly on static scenarios and are based on idealistic properties for certain cryptographic primitives. In this work, first we define a variety of essential and practical cost metrics associated with ODB systems. Then, we analytically evaluate a number of different approaches, in search for a solution that best lever-ages all metrics. Most importantly, we look at solutions that can handle dynamic scenarios, where owners periodi-cally update the data residing at the servers. Finally, we discuss query freshness, a new dimension in data authentication that has not been explored before. A comprehensive experimental evaluation of the proposed and existing approaches is used to validate the analytical models and verify our claims. Our findings exhibit that the proposed solutions improve performance substantially over existing approaches, both for static and dynamic environments.

  • Characterizing and Exploiting Reference Locality in Data Stream Applications (Project Website), Talk
    By Feifei Li,    Ching Chang,    George Kollios,    Azer Bestavros
    In Proceedings of 22nd IEEE International Conference on Data Engineering (ICDE 2006),  pages 81-92,  Atlanta, Georgia,  April,  2006.
    Abstract

    In this paper, we investigate a new approach to process queries in data stream applications. We show that reference locality characteristics of data streams could be exploited in the design of superior and flexible data stream query processing techniques. We identify two different causes of reference locality: popularity over long time scales and temporal correlations over shorter time scales. An elegant mathematical model is shown to precisely quantify the degree of those sources of locality. Furthermore, we analyze the impact of locality-awareness on achievable performance gains over traditional algorithms on applications such as MAX-subset approximate sliding window join and approximate count estimation. In a comprehensive experimental study, we compare several existing algorithms against our locality-aware algorithms over a number of real datasets. The results validate the usefulness and efficiency of our approach.

  • 2005

  • On Trip Planning Queries in Spatial Databases (Project Website), Talk
    By Feifei Li,    Dihan Cheng,    Marios Hadjieleftheriou,    George Kollios,    Shang-Hua Teng
    In Proceedings of 9th International Symposium on Spatial and Temporal Databases (SSTD 2005),  pages 273-290,  Angra dos Reis, Brazil,  August,  2005.
    Abstract

    In this paper we discuss a new type of query in Spatial Databases, called the Trip Planning Query (TPQ). Given a set of points of interest P in space, where each point belongs to a specific category, a starting point S and a destination E, TPQ retrieves the best trip that starts at S, passes through at least one point from each category, and ends atE. For example, a driver traveling from Boston to Providence might want to stop to a gas station, a bank and a post office on his way, and the goal is to provide him with the best possible route (in terms of distance, traffic, road conditions, etc.). The difficulty of this query lies in the existence of multiple choices per category. In this paper, we study fast approximation algorithms for TPQ in a metric space. We provide a number of approximation algorithms with approximation ratios that depend on either the number of categories, the maximum number of points per category or both. Therefore, for different instances of the problem, we can choose the algorithm with the best approximation ratio, since they all run in polynomial time. Furthermore, we use some of the proposed algorithms to derive efficient heuristics for large datasets stored in external memory. Finally, we give an experimental evaluation of the proposed algorithms using both synthetic and real datasets.

  • 2004

  • Approximate Aggregation Techniques for Sensor Databases (Project Website), Talk
    By Jeffrey Considine,    Feifei Li,    George Kollios,    John Byers
    In Proceedings of 20th IEEE International Conference on Data Engineering (ICDE 2004),  pages 449-460,  Boston, MA,  March,  2004.
    Abstract

    In the emerging area of sensor-based systems, a sig-nificant challenge is to develop scalable, fault-tolerant methods to extract useful information from the data the sensors collect. An approach to this data management problem is the use of sensor database systems, exemplified by TinyDB and Cougar, which allow users to perform aggregation queries such as MIN, COUNT and AVG on a sensor network. Due to power and range constraints, centralized approaches are generally impractical, so most systems use in-network aggregation to reduce network traÆc. However, these aggregation strategies become bandwidth-intensive when combined with the fault-tolerant, multi-path routing methods often used in these environments. For example, duplicate-sensitive aggregates such as SUM cannot be computed exactly using substantially less bandwidth than explicit enumeration. To avoid this expense, we investigate the use of approximate in-network aggregation using small sketches. Our contributions are as follows: 1) we generalize well known duplicate-insensitive sketches for approximating COUNTto handle SUM,2)we present and analyze methods for using sketches to produce accurate results with low communication and computation overhead, and 3) we present an extensive experimental validation of our methods.

  • Spatio-Temporal Aggregation Using Sketches (Project Website), Talk
    By Yufei Tao,    George Kollios,    Jeffrey Considine,    Feifei Li,    Dimitris Papadias
    In Proceedings of 20th IEEE International Conference on Data Engineering (ICDE 2004),  pages 214-226,  Boston, MA,  March,  2004.
    Abstract

    Several spatio-temporal applications require the retrieval of summarized information about moving objects that lie in a query region during a query interval (e.g., the number of mobile users covered by a cell, traffic volume in a district, etc.). Existing solutions have the distinct counting problem: if an object remains in the query region for several timestamps during the query interval, it will be counted multiple times in the result. The paper solves this problem by integrating spatio-temporal indexes with sketches, traditionally used for approximate query processing. The proposed techniques can also be applied to reduce the space requirements of conventional spatiotemporal data and to mine spatio-temporal association rules.

  • 2002

  • WICCAP Data Model: Mapping Physical Websites to Logical Views
    By Zehua Liu,    Feifei Li,    Wee Keong Ng
    In Proceedings of 21st International Conference on Conceptual Modeling, Springer (ER 2002),  pages 120-134,  Tampere, Finland,  October,  2002.
    Abstract

    Information sources over the WWW contain a large amount of data organized according to di®erent interests and values. Thus, it is important that facilities are there to enable users to extract information of interests in a simple and e®ective manner. To do this, information from the Web sources need to be extracted automatically according tousers' interests. However, the extraction of information requires in-depth knowledge of relevant technologies and the extraction process is slow, tedious and di±cult for ordinary users. We propose the Wiccap Data Model, an XML data model that maps Web information sources into commonly perceived logical models. Based on this data model, ordinary users are able to extract information easily and e±ciently. To accelerate the creation of data models, we also de¯ne a formal process for creating such data model and have implemented a software tool to facilitate andautomate the process of producing Wiccap Data Models.

  • 2001

  • Web Information Collection, Collaging and Programming
    By Feifei Li,    Zehua Liu,    Yangfeng Huang,    Wee Keong Ng
    In Proceedings of the 3rd IEEE International Conference on Information, Communications & Signal Process (ICICS 2001),  pages ??-??,  Singapore,  October 15-18,  2001.
    Abstract
  • Workshop

    2002

  • A Visual Tool for Building Logical Data Models of Websites
    by Zehua Liu,    Wee Keong Ng,    Feifei Li,    Ee-Peng Lim
    In Proceedings of the Fourth International Workshop on Web Information and Data Management (WIDM'02), in conjunction with the Eleventh International Conference on Information and Knowledge Management (CIKM 2002), pages 92-95, McLean, Virginia, USA,  November,  2002.  ACM Press.
    Abstract

    Information sources over the WWW contain a large amount of data organized according to different interests and values. Thus, it is important that facilities are there to enable users to extract information of interest in a simple and effective manner. To do this, We propose the Wiccap Data Model, an XML data model that maps Web information sources into commonly perceived logical models, so that information can be extracted automatically according to users' interests. To accelerate the creation of data models, we have implemented a visual tool, called the Mapping Wizard, to facilitate and automate the process of producing Wiccap Data Models. Using the tool, the time required to construct a logical data model for a given website is significantly reduced.

  • 2001

  • An Information Concierge for the Web
    by Feifei Li,    Zehua Liu,    Yangfeng Huang,    Wee Keong Ng
    In Proceedings of the First International Workshop on Internet Bots: Systems and Applications (INBOSA2001), in conjunction with the 12th International Conference on Database and Expert System Applications (DEXA'2001), pages 672--676, Munich, Germany,  September,  2001.  IEEE Computer Society.
    Abstract

    WWW information Collection, Collaging and Programming (WICCAP) system is a software system for generation of logical views of web resources and extraction of the desired information to a structured document. It is designed to enable people to obtain their interested information in a simple and effective manner as well as to make information from the WWW accessible to applications, in order to offer automation, inter-operation and Web-awareness among services. A key factor in making this system useful in practice is that it provides tools to automate and facilitate the process of constructing the logical representation of Web Sites, defining the interested information and subsequently retrieving them. In this work, we present the design of the WICCAP system and its two main components, namely Mapping Wizard and Network Extraction Agent.

  • Tech Report

    2007

  • Efficient Processing of Top-k Queries on Uncertain Databases (Project Website)
    By Ke Yi,    Feifei Li,    Divesh Srivastava,    George Kollios
    Technical Report,  June,  2007.
    Abstract
  • Randomized Synopses for Query Assurance on Data Streams (Project Website)
    By Ke Yi,    Feifei Li,    Marios Hadjieleftheriou,    Divesh Srivastava,    George Kollios
    AT&T Technical Report,  June,  2007.
    Abstract
  • 2006

  • Authenticated Index Structures for Aggregation Queries in Outsourced Databases (Project Website)
    By Feifei Li,    Marios Hadjieleftheriou,    George Kollios,    Leonid Reyzin
    BUCS-TR-2006-011,  2006.
    Abstract
  • 2004

  • On Trip Planning Queries in Spatial Databases (Project Website)
    By Feifei Li,    Dihan Cheng,    Marios Hadjieleftheriou,    George Kollios,    Shang-Hua Teng
    BUCS-TR-2004-027,  2004.
    Abstract
  • GreedyDual-Join: Locality-Aware Buffer Management for Approximate Join Processing Over Data Streams
    By Feifei Li,    Ching Cheng,    Azer Bestavros,    George Kollios
    BUCS-TR-2004-028,  2004.
    Abstract