![]() Mingwang Tang | |
[ Book Chapter Journal Conference Workshop Tech Report]
This work deals with the approximate string search in large spatial databases. Speci%uFB01cally, we investigate range queries augmented with a string similarity search predicate in both Euclidean space and road networks. We dub this query the spatial approximate string (SAS) query. In Euclidean space, we propose an approximate solution, the MHR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on the set resemblance between the query string and the q-grams from the sub-trees of index nodes. We also discuss how to estimate the selectivity of a SAS query in Euclidean space, for which we present a novel adaptive algorithm to %uFB01nd balanced partitions using both the spatial and string information stored in the tree. For queries on road networks, we propose a novel exact method, RSASSOL, which signi%uFB01cantly outperforms the baseline algorithm in practice. The RSASSOL combines the q-gram based inverted lists and the reference nodes based pruning. Extensive experiments on large real data sets demonstrate the ef%uFB01ciency and effectiveness of our approaches.
In online tracking, an observer S receives a sequence of values, one per time instance, from a data source that is described by a function f. A tracker T wants to continuously maintain an approximation that is within an error threshold of the value f(t) at any time instance t, with small communication overhead. This problem was recently formalized and studied in, and a principled approach with optimal competitive ratio was proposed. This work extends the study of online tracking to a distributed setting, where a tracker T wants to track a function f that is computed from a set of functions {f1, . . . , fm} from m distributed observers and respective data sources. This formulation finds numerous important and natural applications, e.g., sensor networks, distributed systems, measurement networks, and pub-sub systems. We formalize this problem and present effective online algorithms for various topologies of a distributed system/network for different aggregate functions. Experiments on large real data sets demonstrate the excellent performance of our methods in practice.
Histogram construction is a fundamental problem in data management, and a good histogram supports numerous mining operations. Recent work has extended histograms to probabilistic data. However, constructing histograms for probabilistic data can be extremely expensive, and existing studies suffer from limited scalability. This work designs novel approximation methods to construct scalable histograms on probabilistic data. We show that our methods provide constant approximations compared to the optimal histograms produced by the state-of-the-art in the worst case. We also extend our methods to parallel and distributed settings so that they can run gracefully in a cluster of commodity machines. We introduced novel synopses to reduce communication cost when running our methods in such settings. Extensive experiments on large real data sets have demonstrated the superb scalability and efficiency achieved by our methods, when compared to the state-ofthe-art methods. They also achieved excellent approximation quality in practice.
Ranking temporal data has not been studied until recently [14], even though ranking is an important operator (being promoted as a first-class citizen) in database systems [8]. However, only the instant top-k queries on temporal data were studied in [14], where objects with the k highest scores at a query time instance t are to be retrieved. The instant top-k definition clearly comes with limitations (sensitive to outliers, difficult to choose a meaningful query time t). A more flexible and general ranking operation is to rank objects based on the aggregation of their scores in a query interval, which we dub the aggregate top-k query on temporal data. For example, return the top-10 weather stations having the highest average temperature from 10/01/2010 to 10/07/2010; find the top-20 stocks having the largest total transaction volumes from02/05/2011 to 02/07/2011. This work presents a comprehensive study to this problem by designing both exact and approximate methods (with approximation quality guarantees). We also provide theoretical analysis on the construction cost, the index size, the update and the query costs of each approach. Extensive experiments on large real datasets clearly demonstrate the efficiency, the effectiveness, and the scalability of our methods compared to the baseline methods.
In distributed data management, a primary concern is monitoring the distributed data and generating an alarm when a user specified constraint is violated. A particular useful instance is the threshold based constraint, which is commonly known as the distributed threshold monitoring problem. This work extends this useful and fundamental study to distributed probabilistic data that emerge in a lot of applications, where uncertainty naturally exists when massive amounts of data are produced at multiple sources in distributed, networked locations. Examples include distributed observing stations, large sensor fields, geographically separate scientific institutes/units and many more. When dealing with probabilistic data, there are two thresholds involved, the score and the probability thresholds. One must monitor both simultaneously, as such, techniques developed for deterministic data are no longer directly applicable. This work presents a comprehensive study to this problem. Our algorithms have significantly outperformed the baseline method in terms of both the communication cost (number of messages and bytes) and the running time, as shown by an extensive experimental evaluation using several, real large datasets.
GIS data usually consist of both spatial and textual information,where the spatial component represents the location ofthe object and the textual element contains a set of stringsdescribing object in that location. For GIS data situated ona road network, shortest path search is a basic operation. Inpractice, however, users are often interested at routing whencertain constraints on the textual information have been alsoincorporated. This work complements the standard shortestpath search with multiple keywords and an approximatestring similarity function, where the goal is to find the shortestpath that passes through at least one matching objectper keyword; we dub this problem the multi-approximatekeywordrouting (makr) query. We present both exact andapproximate solutions. When the number %u03BA of query keywordsis small (e.g., %u03BA %u2264 6), the exact solution works efficiently.However, when %u03BA increases, it becomes increasinglyexpensive (especially on large GIS data). In this case, ourapproximate methods achieve superb query efficiency, excellentscalability, and high approximation quality, as indicatedin our extensive experiments on large, real datasets (up to 2million points on road networks with hundreds of thousandsof nodes and edges). We also prove that one approximatemethod has a %u03BA-approximation in the worst case.