Dong Xie
PhD student. Research Interest: big spatial data systems. Distributed systems.


Book Chapter  Journal  Conference  Workshop  Tech Report]

Journal

2019

  • SolarDB: Towards a Shared-Everything Database on Distributed Log-Structured Storage
    By Tao Zhu,    Zhuoyue Zhao,    Feifei Li,    Weining Qian,    Aoying Zhou,    Dong Xie,    Ryan Stutsman,    Haining Li,    Huiqi HU
    Vol.25, Pages 11:1-11:26, ACM Transactions on Storage (ACM TOS),  June ,  2019.
    Abstract

    Efficient transaction processing over large databases is a key requirement for many mission-critical applications. Though modern databases have achieved good performance through horizontal partitioning, their performance deteriorates when cross-partition distributed transactions have to be executed. This paper presents SolarDB, a distributed relational database system that has been successfully tested at a large commercial bank. The key features of SolarDB include: 1) a shared-everything architecture based on a two-layer log-structured merge-tree; 2) a new concurrency control algorithm that works with the log-structured storage, which ensures efficient and non-blocking transaction processing even when the storage layer is compacting data among nodes in the background; 3) fine-grained data access to effectively minimize and balance network communication within the cluster. According to our empirical evaluations on TPC-C, Smallbank and a real-world workload, SolarDB outperforms the existing shared-nothing systems by up to 50x when there are close to or more than 5% distributed transactions.

  • 2016

  • Exact and Approximate Flexible Aggregate Similarity Search
    By Feifei Li,    Ke Yi,    Yufei Tao,    Bin Yao,    Yang Li,    Dong Xie,    Min Wang
    Vol.25(4), Pages 317-338, The International Journal on Very Large Data Bases (VLDBJ),  2016.
    Abstract

    Aggregate similarity search, also known as aggregate nearest neighbor (Ann) query, finds many useful applications in spatial and multimedia databases. Given a group Q of M query objects, it retrieves from a database the objects most similar to Q, where the simi- larity is an aggregation (e.g., sum, max) of the distances between each retrieved object p and all the objects in Q. In this paper, we propose an added flexibility to the query definition, where the similarity is an aggre- gation over the distances between p and any subset of phi objects in Q for some support 0 < phi le 1. We call this new definition flexible aggregate similarity search, and accordingly refer to a query as a flexible aggre- gate nearest neighbor (Fann) query. We present algo- rithms for answering Fann queries exactly and approx- imately. Our approximation algorithms are especially appealing, which are simple, highly efficient, and work well in both low and high dimensions. They also return near-optimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experi- ments on large real and synthetic datasets from 2 to 74 dimensions have demonstrated their superior efficiency and high quality.

  • Conference

    2019

  • Closing the Gap Between Serverless and its State with Storage Functions
    By Tian Zhang,    Dong Xie,    Chinmay Kulkarni,    Ryan Stutsman
    In Proceedings of 10th ACM Symposium of Cloud Computing (SOCC 2019),  pages ??-??,  November,  2019.
    Abstract

    Serverless computing has gained attention due to its fine-grained provisioning, large-scale multi-tenancy, and on-demand scaling. However, it also forces applications to externalize state in remote storage, adding substantial overheads. To fix this “data shipping problem” we built Shredder, a low-latency multi-tenant cloud store that allows small units of computation to be performed directly within storage nodes. Storage tenants provide Shredder with JavaScript functions (or WebAssembly programs), which can interact directly with data without moving them over the network. The key challenge in Shredder is safely isolating thousands of tenant storage functions while minimizing data interaction costs. Shredder uses a unique approach where its data store and network- ing paths are implemented in native code to ensure performance, while isolated tenant functions interact with data using a V8-specific intermediate representation that avoids expensive cross-protection- domain calls and data copying. As a result, Shredder can execute 4 million remotely-invoked tenant functions per second spread over thousands of tenants with median and 99th-percentile response la- tencies of less than 50 μs and 500 μs, respectively. Our evaluation shows that Shredder achieves a 14% to 78% speedup against conven- tional remote storage when fetching items with just one to three data dependencies between them. We also demonstrate Shredder’s effectiveness in accelerating data-intensive applications, including a k-hop query on social graphs that shows orders of magnitude gain.

  • 2018

  • Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage, Talk
    By T.ao Zhu,    Zhuoyue Zhao,    Feifei Li,    Weining Qian,    Aoying Zhou,    Dong Xie,    Ryan Stutsman,    Huiqi Li,    Huihing Hu
    In Proceedings of In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC 2018),  pages 795-807,  August,  2018.
    Abstract

    Efficient transaction processing over large databases is a key requirement for many mission-critical applications. Though modern databases have achieved good performance through horizontal partitioning, their performance deteriorates when cross-partition distributed transactions have to be executed. This paper presents Solar, a distributed relational database system that has been successfully deployed at a large commercial bank. The key features of Solar include: 1) a shared-everything architecture based on a two-layer log-structured merge-tree; 2) a new concurrency control algorithm that works with the log-structured storage, which ensures efficient and non-blocking transaction processing even when the storage layer is compacting data among nodes in the background; 3) fine-grained data access to effectively minimize and balance network communication within the cluster. According to our empirical evaluations on TPC-C, Smallbank and a real-world workload, Solar outperforms the existing shared-nothing systems by up to 50x when there are close to or more than 5% distributed transactions.

  • 2017

  • Distributed Trajectory Similarity Search
    By Dong Xie,    Feifei Li,    Jeff Phillips
    In Proceedings of 43rd International Conference on Very Large Data Bases (VLDB 2017),  pages 1478-1489,  September,  2017.
    Abstract

    Mobile and sensing devices have already become ubiquitous. They have made tracking moving objects an easy task. As a result, mobile applications like Uber and many IoT projects have generated massive amounts of trajectory data that can no longer be processed by a single machine efficiently. Among the typical query operations over trajectories, similarity search is a common yet expensive operator in querying trajectory data. It is useful for applications in different domains such as traffic and transportation optimizations, weather forecast and modeling, and sports analytics. It is also a fundamental operator for many important mining operations such as clustering and classification of trajectories. In this paper, we propose a distributed query framework to process trajectory similarity search over a large set of trajectories. We have implemented the proposed framework in Spark, a popular distributed data processing engine, by carefully considering different design choices. Our query framework supports both the Hausdorff distance the Frchet distance. Extensive experiments have demonstrated the excellent scalability and query efficiency achieved by our design, compared to other methods and design alternatives.

  • 2016

  • Simba: Spatial In-Memory Big Data Analytics (Project Website)
    By Dong Xie,    Feifei Li,    Bin Yao,    Gefei Li,    Liang Zhou,    Zhongpu Chen,    Minyi Guo
    In Proceedings of In Proceedings of 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2016),  pages TBA,  San Francisco, USA,  November,  2016.
    Abstract

    We present the Simba (Spatial In-Memory Big data Analytics) system, which offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba natively extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and DataFrame API. It enables the construction of indexes over RDDs inside the engine in order to work with big spatial data and complex spatial operations. Simba also comes with an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput in big spatial data analysis. This demonstration proposal describes key ideas in the design of Simba, and presents a demonstration plan.

  • Oblivious RAM: A Dissection and Experimental Evaluation (Project Website), Talk
    By Zhao Chang,    Dong Xie,    Feifei Li
    In Proceedings of Very Large Data Bases (VLDB 2016),  pages 1113-1124,  New Delhi, India,  September,  2016.
    Abstract

    Many companies choose the cloud as their data and IT infrastructure platform. The remote access of the data brings the issue of trust, and the potential risk of compromising sensitive information should not be underestimated. Despite the use of strong encryption schemes, adversaries can still learn valuable information regarding encrypted data by observing the data access patterns. To that end, one can hide the access patterns, which may leak sensitive information, using Oblivious RAMs (ORAMs). Numerous works have proposed different ORAM constructions. Nevertheless, many such ORAM constructions are of only theoretical interest, hence, are notuseful in practice. Several more practical ORAM constructions do exist, but they have never been thoroughly compared against and tested on large databases. There are no open source implementation of these schemes, making such a study challenging to carry out (since most ORAMs are quite contrived in terms of both theoretical analysis and practical implementations).These limitations make it difficult for researchers and practitioners to choose and adopt a suitable ORAM for their applications. To address this issue, we provide a thorough study over several practical ORAM constructions, and implement them under the same library. We perform extensive experiments to provide insights into their performance characteristics with respect to efficiency, scalability, and communication cost. Lastly, we plan to release our ORAM implementations through GitHub so that the communities at large may benefit from and contribute to an open source ORAM library under one unified framework.

  • Simba: Efficient In-Memory Spatial Analytics (Project Website), Talk
    By Dong Xie,    Feifei Li,    Bin Yao,    Gefei Li,    Liang Zhou,    Minyi Guo
    In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD 2016),  pages 1071-1085,  June,  2016.
    Abstract

    Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial analytics systems are disk-based and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces the concept and construction of indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba%u2019s superior performance compared against other spatial analytics system.