Mingwang Tang
PhD Student. Graduated in Summer 2014. Current Employment: Uber.


Book Chapter  Journal  Conference  Workshop  Tech Report]

Journal

2013

  • Spatial Approximate String Search
    By Feifei Li,    Bin Yao,    Mingwang Tang,    Marios Hadjieleftheriou
    Vol.25(6), Pages 1394-1409, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE),  May,  2013.
    Abstract

    This work deals with the approximate string search in large spatial databases. Speci%uFB01cally, we investigate range queries augmented with a string similarity search predicate in both Euclidean space and road networks. We dub this query the spatial approximate string (SAS) query. In Euclidean space, we propose an approximate solution, the MHR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on the set resemblance between the query string and the q-grams from the sub-trees of index nodes. We also discuss how to estimate the selectivity of a SAS query in Euclidean space, for which we present a novel adaptive algorithm to %uFB01nd balanced partitions using both the spatial and string information stored in the tree. For queries on road networks, we propose a novel exact method, RSASSOL, which signi%uFB01cantly outperforms the baseline algorithm in practice. The RSASSOL combines the q-gram based inverted lists and the reference nodes based pruning. Extensive experiments on large real data sets demonstrate the ef%uFB01ciency and effectiveness of our approaches.

  • Conference

    2015

  • Distributed Online Tracking, Talk
    By Mingwang Tang,    Feifei Li,    Yufei Tao
    In Proceedings of 34th ACM SIGMOD International Conference on Management of Data (SIGMOD 2015),  pages 2047-2061,  Melbourne, Australia,  2015.
    Abstract

    In online tracking, an observer S receives a sequence of values, one per time instance, from a data source that is described by a function f. A tracker T wants to continuously maintain an approximation that is within an error threshold of the value f(t) at any time instance t, with small communication overhead. This problem was recently formalized and studied in, and a principled approach with optimal competitive ratio was proposed. This work extends the study of online tracking to a distributed setting, where a tracker T wants to track a function f that is computed from a set of functions {f1, . . . , fm} from m distributed observers and respective data sources. This formulation finds numerous important and natural applications, e.g., sensor networks, distributed systems, measurement networks, and pub-sub systems. We formalize this problem and present effective online algorithms for various topologies of a distributed system/network for different aggregate functions. Experiments on large real data sets demonstrate the excellent performance of our methods in practice.

  • 2014

  • Scalable Histograms on Large Probabilistic Data, Talk
    By Mingwang Tang,    Feifei Li
    In Proceedings of 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2014),  pages 631-640,  New York,  2014.
    Abstract

    Histogram construction is a fundamental problem in data management, and a good histogram supports numerous mining operations. Recent work has extended histograms to probabilistic data. However, constructing histograms for probabilistic data can be extremely expensive, and existing studies suffer from limited scalability. This work designs novel approximation methods to construct scalable histograms on probabilistic data. We show that our methods provide constant approximations compared to the optimal histograms produced by the state-of-the-art in the worst case. We also extend our methods to parallel and distributed settings so that they can run gracefully in a cluster of commodity machines. We introduced novel synopses to reduce communication cost when running our methods in such settings. Extensive experiments on large real data sets have demonstrated the superb scalability and efficiency achieved by our methods, when compared to the state-ofthe-art methods. They also achieved excellent approximation quality in practice.

  • 2012

  • Ranking Large Temporal Data, Talk
    By Jestes Jestes,    Jeff M. Phillips,    Feifei Li,    Mingwang Tang
    In Proceedings of 38th International Conference on Very Large Databases (VLDB 2012),  pages pages 1412-1423,  Istanbul, Turkey,  August,  2012.
    Abstract

    Ranking temporal data has not been studied until recently [14], even though ranking is an important operator (being promoted as a first-class citizen) in database systems [8]. However, only the instant top-k queries on temporal data were studied in [14], where objects with the k highest scores at a query time instance t are to be retrieved. The instant top-k definition clearly comes with limitations (sensitive to outliers, difficult to choose a meaningful query time t). A more flexible and general ranking operation is to rank objects based on the aggregation of their scores in a query interval, which we dub the aggregate top-k query on temporal data. For example, return the top-10 weather stations having the highest average temperature from 10/01/2010 to 10/07/2010; find the top-20 stocks having the largest total transaction volumes from02/05/2011 to 02/07/2011. This work presents a comprehensive study to this problem by designing both exact and approximate methods (with approximation quality guarantees). We also provide theoretical analysis on the construction cost, the index size, the update and the query costs of each approach. Extensive experiments on large real datasets clearly demonstrate the efficiency, the effectiveness, and the scalability of our methods compared to the baseline methods.

  • Efficient Threshold Monitoring for Distributed Probabilistic Data, Talk
    By Mingwang Tang,    Feifei Li,    Jeff M. Phillips,    Jeffrey Jestes
    In Proceedings of 28th IEEE International Conference on Data Engineering (ICDE 2012),  pages 1120-1131,  Washington DC,  April,  2012.
    Abstract

    In distributed data management, a primary concern is monitoring the distributed data and generating an alarm when a user specified constraint is violated. A particular useful instance is the threshold based constraint, which is commonly known as the distributed threshold monitoring problem. This work extends this useful and fundamental study to distributed probabilistic data that emerge in a lot of applications, where uncertainty naturally exists when massive amounts of data are produced at multiple sources in distributed, networked locations. Examples include distributed observing stations, large sensor fields, geographically separate scientific institutes/units and many more. When dealing with probabilistic data, there are two thresholds involved, the score and the probability thresholds. One must monitor both simultaneously, as such, techniques developed for deterministic data are no longer directly applicable. This work presents a comprehensive study to this problem. Our algorithms have significantly outperformed the baseline method in terms of both the communication cost (number of messages and bytes) and the running time, as shown by an extensive experimental evaluation using several, real large datasets.

  • 2011

  • Multi-Approximate-Keyword Routing in GIS Data (Project Website), Talk
    By Bin Yao,    Mingwang Tang,    Feifei Li
    In Proceedings of 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL GIS 2011),  pages 201-210,  Chicago, USA,  November,  2011.
    Abstract

    GIS data usually consist of both spatial and textual information,where the spatial component represents the location ofthe object and the textual element contains a set of stringsdescribing object in that location. For GIS data situated ona road network, shortest path search is a basic operation. Inpractice, however, users are often interested at routing whencertain constraints on the textual information have been alsoincorporated. This work complements the standard shortestpath search with multiple keywords and an approximatestring similarity function, where the goal is to find the shortestpath that passes through at least one matching objectper keyword; we dub this problem the multi-approximatekeywordrouting (makr) query. We present both exact andapproximate solutions. When the number %u03BA of query keywordsis small (e.g., %u03BA %u2264 6), the exact solution works efficiently.However, when %u03BA increases, it becomes increasinglyexpensive (especially on large GIS data). In this case, ourapproximate methods achieve superb query efficiency, excellentscalability, and high approximation quality, as indicatedin our extensive experiments on large, real datasets (up to 2million points on road networks with hundreds of thousandsof nodes and edges). We also prove that one approximatemethod has a %u03BA-approximation in the worst case.