[ Book Chapter Journal Conference Workshop Tech Report]
In this chapter, we survey the various approaches that have been proposed to measure privacy (and the loss of privacy). Since most privacy concerns (especially those related to health-care information) are raised in the context of legal concerns, it is instructive to view privacy from a legal perspective, rather than from purely technical considerations. It is beyond the scope of this survey\footnote{…and the expertise of the author!} to review the legal interpretations of privacy. However, one essay on privacy that appears directly relevant (and has inspired at least one paper surveyed here) is the view of privacy in terms of access that others have to us and our information, presented by Ruth Gavison. In her view, a general definition of privacy must be one that is measurable, of value, and actionable. The first property needs no explanation; the second means that the entity being considered private must be valuable, and the third property argues that from a legal perspective, only those losses of privacy are interesting that can be prosecuted. This survey, and much of the research on privacy, concerns itself with the measuring of privacy.
The notion of `depth’ has been used in statistics as a way to identify the center of the bivariate distribution given by the point set $P$ in $R^2$. We present a general framework for computing such statistical estimators, that makes extensive use of modern graphics architectures. As a result, we derive improved algorithms for a number of depth measures such location depth, simplicial depth, Oja depth, colored depth, and dynamic location depth. Our algorithms perform significantly better than currently known implementations, outperforming them by at least one order of magnitude and having a strictly better asymptotic growth rate.
Given a set of points with uncertain locations, we consider the problem of computing the probability of each point lying on the skyline, that is, the probability that it is not dominated by any other input point. If each point%u2019s uncertainty is described as a probability distribution over a discrete set of locations, we improve the best known exact solution. We also suggest why we believe our solution might be optimal. Next, we describe simple, near-linear time approximation algorithms for computing the probability of each point lying on the skyline. In addition, some of our methods can be adapted to construct data structures that can efficiently determine the probability of a query point lying on the skyline.
This work deals with the approximate string search in large spatial databases. Speci%uFB01cally, we investigate range queries augmented with a string similarity search predicate in both Euclidean space and road networks. We dub this query the spatial approximate string (SAS) query. In Euclidean space, we propose an approximate solution, the MHR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on the set resemblance between the query string and the q-grams from the sub-trees of index nodes. We also discuss how to estimate the selectivity of a SAS query in Euclidean space, for which we present a novel adaptive algorithm to %uFB01nd balanced partitions using both the spatial and string information stored in the tree. For queries on road networks, we propose a novel exact method, RSASSOL, which signi%uFB01cantly outperforms the baseline algorithm in practice. The RSASSOL combines the q-gram based inverted lists and the reference nodes based pruning. Extensive experiments on large real data sets demonstrate the ef%uFB01ciency and effectiveness of our approaches.
Query execution assurance is an important concept in defeating lazy servers in the database as a service model. We show that extending query execution assurance to outsourced databases with multiple data owners is highly inefficient. To cope with lazy servers in the distributed setting, we propose query access assurance (QAA) that focuses on IO-bound queries. The goal in QAA is to enable clients to verify that the server has honestly accessed all records that are necessary to compute the correct query answer, thus eliminating the incentives for the server to be lazy if the query cost is dominated by the IO cost in accessing these records. We formalize this concept for distributed databases, and present two efficient schemes that achieve QAA with high success probabilities. The first scheme is simple to implement and deploy, but may incur excessive server to client communication cost and verification cost at the client side, when the query selectivity or the database size increases. The second scheme is more involved, but successfully addresses the limitation of the first scheme. Our design employs a few number theory techniques. Extensive experiments demonstrate the efficiency, effectiveness and usefulness of our schemes.
Given a set of points P and a query set Q, a group enclosing query (GEQ) fetches the point p such that the maximum distance of p to all points in Q is minimized. This problem is equivalent to the Min-Max case (minimizing the maximum distance) of aggregate nearest neighbor queries for spatial databases. This work first designs a new exact solution by exploring new geometric insights, such as the minimum enclosing ball, the convex hull and the furthest voronoi diagram of the query group. To further reduce the query cost, especially when the dimensionality increases, we turn to approximation algorithms. Our main approximation algorithm has a worst case p2-approximation ratio if one can find the exact nearest neighbor of a point. In practice, its approximation ratio never exceeds 1.05 for a large number of data sets up to six dimension.We also discuss how to extend it to higher dimensions (up to 74 in our experiment) and show that it still maintains a very good approximation quality (still close to 1) and low query cost. In fixed dimensions,we extend the p2-approximation algorithm to get a (1 epsilon)-approximate solution for the GEQ problem. Both approximation algorithms have O(logN M) query cost in any fixed dimension, where N and M are the sizes of the data set P and query group Q. Extensive experiments on both synthetic and real data sets, up to 10 million points and 74 dimensions, confirm the efficiency, effectiveness and scalability of the proposed algorithms, especially their significant improvement over the state-of-the-art method.
With the advance of wireless communication technology, it is quite common for people to view maps or get related servicesfrom the handheld devices, such as mobile phones and PDAs. Range queries, as one of the most commonly used tools, are oftenposed by the users to retrieve needful information from a spatial database. However, due to the limits of communication bandwidthand hardware power of handheld devices, displaying all the results of a range query on a handheld device is neither communicationefficient nor informative to the users. This is simply because that there are often too many results returned from a range query. In viewof this problem, we present a novel idea that a concise representation of a specified size for the range query results, while incurringminimal information loss, shall be computed and returned to the user. Such a concise range query not only reduces communicationcosts, but also offers better usability to the users, providing an opportunity for interactive exploration.The usefulness of the concise range queries is confirmed by comparing it with other possible alternatives, such as sampling andclustering. Unfortunately, we prove that finding the optimal representation with minimum information loss is an NP-hard problem.Therefore, we propose several effective and non-trivial algorithms to find a good approximate result. Extensive experiments on realworlddata have demonstrated the effectiveness and efficiency of the proposed techniques.
Recently, there have been several attempts to propose definitions and algorithms for ranking queries on probabilistic data. However, these lack many intuitive properties of a top-k over deterministic data. We define several fundamental properties, including exact-k, containment, unique-rank, value-invariance, and stability, which are satisfied by ranking queries on certain data. We argue these properties should also be carefully studied in defining ranking queries in probabilistic data, and fulfilled by definition for ranking uncertain data for most applications. We propose an intuitive new ranking definition based on the observation that the ranks of a tuple across all possible worlds represent a well-founded rank distribution. We studied the ranking definitions based on the expectation, the median and other statistics of this rank distribution for a tuple and derived the expected rank, median rank and quantile rank correspondingly. We are able to prove that the expected rank, median rank and quantile rank satisfy all these properties for a ranking query. We provide efficient solutions to compute such rankings across the major models of uncertain data, such as attribute-level and tuple-level uncertainty. Finally, a comprehensive experimental study confirms the effectiveness of our approach.
The database community has devoted extensiveamount of efforts to indexing and querying temporaldata in the past decades. However, insufficientamount of attention has been paid to temporal rankingqueries. More precisely, given any time instance t, thequery asks for the top-k objects at time t with respect tosome score attribute. Some generic indexing structuresbased on R-trees do support ranking queries on temporaldata, but as they are not tailored for such queries,the performance is far from satisfactory. We presentthe Seb-tree, a simple indexing scheme that supportstemporal ranking queries much more efficiently. TheSeb-tree answers a top-k query for any time instance tin the optimal number of I/Os in expectation, namely,O(log_B(N/B k/B)) I/Os, where N is the size of the data set and B is the disk block size. The index has near-linearsize (for constant and reasonable kmax values, where kmax is the maximum value for the possible values of the query parameter k), can be constructed in near-lineartime, and also supports insertions and deletions without affecting its query performance guarantee. Most ofall, the Seb-tree is especially appealing in practice dueto its simplicity as it uses the B-tree as the only building block. Extensive experiments on a number of largedata sets, show that the Seb-tree is more than an orderof magnitude faster than the R-tree based indexes for temporal ranking queries.
Query authentication is an essential component in outsourced database (ODB) systems. This arti-cle introduces efficient index structures for authenticating aggregation queries over large data sets. First, we design an index that features good performance characteristics for static environments.Then, we propose more involved structures for the dynamic case. Our structures feature excellentperformance for authenticating queries with multiple aggregate attributes and multiple selectionpredicates. Furthermore, our techniques cover a large number of aggregate types, including distributive aggregates (such as SUM, COUNT, MIN and MAX), algebraic aggregates (such as the AVG), and holistic aggregates (such as MEDIAN and QUANTILE). We have also addressed the issue of authenticating aggregation queries efficiently when the database is encrypted to protect data confidentiality. Finally, we implemented a working prototype of the proposed techniques andexperimentally validated the effectiveness and efficiency of our methods.
Due to the overwhelming flow of information in many data stream applications, data outsourcing isa natural and effective paradigm for individual businesses to address the issue of scale. In the standard data outsourcing model, the data owner outsources streaming data to one or more third-partyservers, which answer queries posed by a potentially large number of clients on the data owner's behalf. Data outsourcing intrinsically raises issues of trust, making outsourced query assurance on data streams a problem with important practical implications. Existing solutions proposed in this model all build upon cryptographic primitives such as signatures and collision-resistant hash functions, which only work for certain types of queries, for example, simple selection/aggregation queries.In this article, we consider another common type of queries, namely, “GROUP BY, SUM” queries, which previous techniques fail to support. Our new solutions are not based on cryptographic primitives, but instead use algebraic and probabilistic techniques to compute a small synopsis on the true query result, which is then communicated to the client so as to verify the correctness of the query result returned by the server. The synopsis uses a constant amount of space irrespective of the result size, has an extremely small probability of failure, and can be maintained using no extra space when the query result changes as elements stream by. We then generalize our synopsisto allow some tolerance on the number of erroneous groups, in order to support semantic load shedding on the server.When the number of erroneous groups is indeed tolerable, the synopsis can be strengthened so that we can locate and even correct these errors. Finally, we implement our techniques and perform an empirical evaluation using live network traffic.
In the emerging area of sensor-based systems, a significant challenge is to develop scalable, fault-tolerant methods to extract useful information from the data the sensors collect. An approach tothis data management problem is the use of sensor database systems, which allow users to performaggregation queries such as MIN, COUNT, and AVG on the readings of a sensor network. In addition, more advanced queries such as frequency counting and quantile estimation can be supported. Due to energy limitations in sensor-based networks, centralized data collection is generally impractical, so most systems use in-network aggregation to reduce network traffic. However, even these aggregation strategies remain bandwidth-intensive when combined with the fault-tolerant, multipath routing methods often used in these environments. To avoid this expense, we investigate the use of approximate in-network aggregation using small sketches.We present duplicate-insensitive sketching techniques that can be implemented efficiently on small sensor devices with limited hardware support and we analyze both their performance and accuracy. Finally, we present anexperimental evaluation that validates the effectiveness of our methods.
One of the primary goals of computational anatomy is the statistical analysis of anatomical variability in large populations of images. The study of anatomical shape is inherently related to the construction of transformations of the underlying coordinate space, which map one anatomy to another. It is now well established that representing the geometry of shapes or images in Euclidean spaces undermines our ability to represent natural variability in populations. In our previous work we have extended classical statistical analysis techniques, such as averaging, principal components analysis, and regression, to Riemannian manifolds, which are more appropriate representations for describing anatomical variability. In this paper we extend the notion of robust estimation, a well established and powerful tool in traditional statistical analysis of Euclidian data, to manifold valued representations of anatomical variability. In particular, we extend the geometric median, a classic robust estimator of centrality for data in Euclidean spaces.We formulate the geometric median of data on a Riemannian manifold as the minimizer of the sum of geodesic distances to the data points.We prove existence and uniqueness of the geometric median on manifolds with non-positive sectional curvature and give sufficient conditions for uniqueness on positively curved manifolds. Generalizing the Weiszfeld procedure for finding the geometric median of Euclidean data, we present an algorithm for computing the geometric median on an arbitrary manifold. We show that this algorithm converges to the unique solution when it exists. In this paper we exemplify the robustness of the estimation technique by applying the procedure to various manifolds commonly used in the analysis of medical images. Using this approach, we also present a robust brain atlas estimation technique based on the geometric median in the space of deformable images.
Contact graphs of isothetic rectangles unify many concepts from applications including VLSI and architectural design, computational geometry, and GIS. Minimizing the area of their corresponding rectangular layouts is a key problem. We study the area-optimization problem and show that it is NP-hard to find a minimum-area rectangular layout of a given contact graph. We present O(n)-time algorithms that construct O(n^2)-area rectangular layouts for general contact graphs and O(n log n)-area rectangular layouts for trees. (For trees, this is an O(log n)-approximation algorithm.) We also present an infinite family of graphs (respectively, trees) that require Ω(n^2) (respectively, Ω(n log n))area. We derive these results by presenting a new characterization of graphs that admit rectangular layouts, using the related concept of rectangular duals. A corollary to our results relates the class of graphs that admit rectangular layouts to rectangle-of-influence drawings.
This work introduces novel polynomial algorithms for processing top-k queries in uncertain databases under the generally adopted model of x-relations. An x-relation consists of a number of x-tuples, and each x-tuple randomly instantiates into one tuple from one or more alternatives. Our results significantly improve the best known algorithms for top-k query processing in uncertain databases, in terms of both runtime and memory usage. In the single-alternative case, the new algorithms are 2 to 3 orders of magnitude faster than the previous algorithms. In the multialternative case, we introduce the first-known polynomial algorithms, while the current best algorithms have exponential complexity in both time and space. Our algorithms run in near linear or low polynomial time and cover both types of top-k queries in uncertain databases. We provide both the theoretical analysis and an extensive experimental evaluation to demonstrate the superiority of the new approaches over existing solutions.
The problem of curve matching appears in many application domains, like time series analysis, shape matching, speech recognition, and signature verification, among others. Curve matching has been studied extensively by computational geometers, and many measures of similarity have been examined, among them being the Frechet distance (sometimes referred to as the “dog-man” distance).A measure that is very closely related to the Frechet distance but has never been studied in a geometric context is the Dynamic Time Warping measure (DTW), first used in the context of speech recognition. This measure is ubiquitous in different domains, already a surprising fact because notions of similarity usually vary significantly depending on the application. However, this measure suffers from a few obvious drawbacks, resulting from the fact that it is defined between sequences of points, rather than curves and the way in which a curves is sampled to yield such a sequence can dramatically affect the quality of the result. Some attempts have been made to generalize the DTW to continuous domains, but the resulting algorithms have exponential complexity.In this paper we propose similarity measures that attempt to capture the “spirit” of dynamic time warping while being defined over continuous domains, and present efficient algorithms for computing them. Our formulation leads to a very interesting connection with finding short paths in a combinatorial manifold defined on the input chains, and in a deeper sense relate to the way light travels in a medium of variable refractivity.
In the emerging area of sensor-based systems, a significant challenge is to develop scalable, fault-tolerant methods to extract useful information from the data the sensors collect. An approach to this data management problem is the use of sensor “database” systems, which allow users to perform aggregation queries on the readings of a sensor network. Due to power and range constraints, centralized approaches are generally impractical, so most systems use in-network aggregation to reduce network traffic. However, these aggregation strategies become bandwidth intensive when combined with the fault-tolerant, multi-path routing methods often used in these environments. In order to avoid this expense, we investigate the use of approximate in-network aggregation using small sketches and we survey robust and scalable methods for computing duplicate-sensitive aggregates.
Nearest-neighbor search (NN), which returns the nearest neighbor of a query point in a set of points, is an important and widely studied problem in many fields, and it has wide range of applications. In many of them, such as sensor databases, location-based services, face recognition, and mobile data, the location of data is imprecise. We therefore study nearest neighbor queries in a probabilistic framework in which the location of each input point is specified as a probability density function. We present efficient algorithms for (i) computing all points that are nearest neighbors of a query point with nonzero probability; (ii) estimating, within a specified additive error, the probability of a point being the nearest neighbor of a query point; (iii) using it to return the point that maximizes the probability being the nearest neighbor, or all the points with probabilities greater than some threshold to be the NN. We also present a few experimental results to demonstrate the effectiveness of our approach.
We study coresets for various types of range counting queries on uncertain data. In our model each uncertain point has a probability density describing its location, sometimes defined as k distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by just examining this subset. We study three distinct types of queries. RE queries return the expected number of points in a query range. RC queries return the number of points in the range with probability at least a threshold. RQ queries returns the probability that fewer than some threshold fraction of the points are in the range. In both RC and RQ coresets the threshold is provided as part of the query. And for each type of query we provide coreset constructions with approximation-size tradeoffs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancy-based approaches on axis-aligned range queries.
We present novel adaptive log compression schemes. Results show 30% improvement on compression ratios over existing approaches.
Kernel density estimates are important for a broad variety of applications including media databases, pattern recognition, computer vision, data mining, and the sciences. Their con- struction has been well-studied, but existing techniques are expensive on massive datasets and/or only provide heuristic approximations without theoretical guarantees. We propose randomized and deterministic algorithms with quality guarantees which are orders of magnitude more ef- ficient than previous algorithms. Our algorithms do not re- quire knowledge of the kernel or its bandwidth parameter and are easily parallelizable. We demonstrate how to imple- ment our ideas in a centralized setting and in MapReduce, although our algorithms are applicable to any large-scale data processing framework. Extensive experiments on large real datasets demonstrate the quality, efficiency, and scala- bility of our techniques.
Temporal and multi-version databases often generate massive amounts of data, due to the increasing availability of large storage space and the increasing importance of mining and auditing opera- tions from historical data. For example, Google now allows users to limit and rank search results by setting a time range. These data- bases are ideal candidates for a distributed store, which offers large storage space, and parallel and distributed processing power from a cluster of (commodity) machines. A key challenge is to achieve a good load balancing algorithm for storage and processing of these data, which is done by partitioning the database. In this paper, we introduce the concept of optimal splitters for temporal and multi- version databases, which induce a partition of the input data set, and guarantee that the size of the maximum bucket be minimized among all possible configurations, given a budget for the desired number of buckets. We design efficient methods for memory- and disk-resident data respectively, and show that they significantly outperform com- peting baseline methods both theoretically and empirically on large real data sets.
In numerous real applications, uncertainty is inherently introduced when massive data are generated. Modern database management systems aim to incorporate and handle data with uncertainties as a first-class citizen, where uncertain data are represented as probabilistic relations. In my thesis, my work has focused on monitoring and summarization of large probabilistic data. Specifically, we extended the distributed threshold monitoring problem to distributed probabilistic data. Instead, we actually need to monitor the aggregated value (e.g. sum) of distributed probabilistic data against both the score threshold and the probability threshold, which make the techniques designed for deterministic data are not directly applicable. Our algorithms have significantly reduced both the communication and computation costs as shown by an extensive experimental evaluation on large real datasets. On the other hand, building histograms to summarize the distribution of certain feature in a large data set is a fundamental problem in data management. Recent work have extended this studies to probabilistic data, but their methods suffer from the limited scalability. We present novel methods to build scalable histograms over large probabilistic data using distributed and parallel algorithms. Extensive experiments on large real data sets have demonstrated the superb scalability and efficiency achieved by our implementations in MapReduce, when compared to the existing, state-of-the-art centralized methods.
The increasing popularity of the cloud drives the demands for secure queries on an encrypted database E(D) stored in the cloud. In this paper, we investigate the secure nearest neighbor (SNN) problem, in which a client issues an encrypted query point E(q) to a server and asks for an encrypted data point in E(D) that is closest to the query point, without allowing the server to learn the plaintexts of the data or the query (and its result). We show that efficient attacks exist for existing SNN methods, even though they were claimed to be secure in standard security models (such as indistinguishability under chosen plaintext or ciphertext attacks). We also establish a relationship between the SNN problem and the order-preserving encryption (OPE) problem from the cryptography field, and we show that SNN is at least as hard as OPE. Since it is impossible to construct secure OPE schemes in standard security models, our results imply that one cannot expect to find the exact (encrypted) nearest neighbor based on only E(q) and E(D). Given this hardness result, we design new SNN methods by asking the server, given only E(q) and E(D), to return a relevant (encrypted) partition E(G) from E(D) (i.e., G subseteq D), such that that E(G) is guaranteed to contain the answer for the SNN query. Our methods provide customizable tradeoff between efficiency and communication cost, and they are as secure as the encryption scheme E used to encrypt the query and the database. Note that E can be any well-established encryption schemes. The efficiency and scalability of our methods are demonstrated through extensive experiments on large real data.
Network radio frequency (RF) environment sensing (NRES) systems pinpoint and track people in buildings using changes in the signal strength measurements made by a wireless sensor network. It has been shown that such systems can locate people who do not participate in the system by wearing any radio device, even through walls, because of the changes that moving people cause to the static wireless sensor network. However, many such systems cannot locate stationary people. We present and evaluate a system which can locate stationary or moving people, without calibration, by using kernel distance to quantify the difference between two histograms of signal strength measurements. From five experiments, we show that our kernel distance-based radio tomographic localization system performs better than the state-of-the-art NRES systems in different non line-of-sight environments.
Event log processing and analysis play a key role in applications ranging from security management, IT trouble shooting, to user behavior analysis. Recent years have seen a rapid growth in system scales and the corresponding rapid increase in the amount of log event data. At the same time, as logs are found to be a valuable information source, log analysis tasks have become more sophisticated demanding both interactive exploratory query processing and batch computation. Desirable query types include selection with time ranges and value filtering criteria, join within time windows, join between log data and reference tables, and various aggregation types. In such a situation, parallel solutions are necessary, but existing parallel and distributed solutions either support limited query types or perform only batch computations on logs. With a system called LogKV, this paper reports a first study of using Key-Value stores to support log processing and analysis, exploiting the scalability, reliability, and efficiency commonly found in Key-Value store systems. LogKV contains a number of unique techniques that are needed to handle log data in terms of event ingestion, load balancing, storage optimization, and query processing. Preliminary experimental results show that LogKV is a promising solution.
We study the worst case error of kernel density estimates via subset approximation. A kernel density estimate of a distribution is the convolution of that distribution with a fixed kernel (e.g. Gaussian kernel). Given a subset (i.e. a point set) of the input distribution, we can compare the kernel density estimates of the input distribution with that of the subset and bound the worst case error. If the maximum error is eps, then this subset can be thought of as an eps-sample (aka an eps-approximation) of the range space defined with the input distribution as the ground set and the fixed kernel representing the family of ranges. Interestingly, in this case the ranges are not binary, but have a continuous range (for simplicity we focus on kernels with range of [0,1]); these allow for smoother notions of range spaces. It turns out, the use of this smoother family of range spaces has an added benefit of greatly decreasing the size required for eps-samples. For instance, in the plane the size is O((1/eps^{4/3}) log^{2/3}(1/eps)) for disks (based on VC-dimension arguments) but is only O((1/eps) sqrt{log (1/eps)}) for Gaussian kernels and for kernels with bounded slope that only affect a bounded domain. These bounds are accomplished by studying the discrepancy of these "kernel" range spaces, and here the improvement in bounds are even more pronounced. In the plane, we show the discrepancy is O(sqrt{log n}) for these kernels, whereas for balls there is a lower bound of Omega(n^{1/4}).
In distributed learning, the goal is to perform a learning task over data distributed across multiple nodes with minimal (expensive) communication. Prior work (Daum ́e III et al., 2012) proposes a general model that bounds the communication required for learning classifiers while allowing for ε training error on linearly separable data adversarially distributed across nodes. In this work, we develop key improvements and extensions to this basic model. Our first result is a two-party multiplicative-weight-update based protocol that uses O(d2 log 1/ε) words of communication to classify distributed data in arbitrary dimension d, ε-optimally. This readily extends to classification over k nodes with O(kd2 log 1/ε) words of communication. Our proposed protocol is simple to implement and is considerably more efficient than baselines compared, as demonstrated by our empirical results. In addition, we illustrate general algorithm design paradigms for doing efficient learning over distributed data. We show how to solve fixed-dimensional and high dimensional linear programming efficiently in a distributed setting where constraints may be distributed across nodes. Since many learning problems can be viewed as convex optimization problems where constraints are generated by individual points, this models many typical distributed learning scenarios. Our techniques make use of a novel connection from multipass streaming, as well as adapting the multiplicative-weight-update framework more generally to a distributed setting. As a consequence, our methods extend to the wide range of problems solvable using these techniques.
MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compactaccurate summary of data is essential. Among various data summarization tools, histograms have proven to be particularly important and useful for summarizing data, and the wavelet histogram is one of the most widely used histograms. In this paper, we investigate the problem of building wavelet histograms efficiently on large datasets in MapReduce. We measure the efficiency of the algorithms by both end-to-end running time and communication cost. We demonstrate straightforward adaptations of existing exact and approximate methods for building wavelet histograms to MapReduce clusters are highly inefficient. To that end, we design new algorithms for computing exact and approximate wavelet histograms and discuss their implementation in MapReduce. We illustrate our techniques in Hadoop, and compare to baseline solutions with extensive experiments performed in a heterogeneous Hadoop cluster of 16 nodes, using large real and synthetic datasets, up to hundreds of gigabytes. The results suggest significant (often several orders of magnitude) performance improvement achieved by our new algorithms.
Ranking temporal data has not been studied until recently [14], even though ranking is an important operator (being promoted as a first-class citizen) in database systems [8]. However, only the instant top-k queries on temporal data were studied in [14], where objects with the k highest scores at a query time instance t are to be retrieved. The instant top-k definition clearly comes with limitations (sensitive to outliers, difficult to choose a meaningful query time t). A more flexible and general ranking operation is to rank objects based on the aggregation of their scores in a query interval, which we dub the aggregate top-k query on temporal data. For example, return the top-10 weather stations having the highest average temperature from 10/01/2010 to 10/07/2010; find the top-20 stocks having the largest total transaction volumes from02/05/2011 to 02/07/2011. This work presents a comprehensive study to this problem by designing both exact and approximate methods (with approximation quality guarantees). We also provide theoretical analysis on the construction cost, the index size, the update and the query costs of each approach. Extensive experiments on large real datasets clearly demonstrate the efficiency, the effectiveness, and the scalability of our methods compared to the baseline methods.
A common problem with disk-based cloud storage services is that performance can vary greatly and become highly unpredictable in a multi-tenant environment. A fundamental reason is the interference between workloads co-located on the same physical disk. We observe that different IO patterns interfere with each other significantly, which makes the performance of different types of workloads unpredictable when they are executed concurrently. Unpredictability implies that users may not get a fair share of the system resources from the cloud services they are using. At the same time, replication is commonly used in cloud storage for high reliability. Connecting these two facts, we propose a cloud storage system designed to minimize workload interference without increasing storage costs or sacrificing the overall system throughput. Our design leverages log-structured disk layout, chain replication and a workload-based replica selection strategy to minimize interference, striking a balance between performance and fairness. Our initial results suggest that this approach is a promising way to improve the performance and predictability of cloud storage.
In many database applications, search is still executed via formbased query interfaces, which are then translated into SQL statements to find matching records. Ranking is usually not implemented unless users have explicitly indicated how to rank thematching records, e.g., in the ascending order of year. Often, this approach is neither intuitive nor user-friendly (especially with many search fields in a query form). It also requires application developers to design schema-specific query forms and develop specific user programs that understand these forms. In this work, we propose to demonstrate the ColumbuScout system that aims at quickly building and deploying a local search engine over one ormore large databases. The ColumbuScout system adopts a keyword-centric search approach. It integrates the keyword-centric principle with the latest results from approximate string search, and designs search-enginestyle ranking functions. It also introduces some of its own indexing structures and storage designs, to improve its overall efficiency and scalability. We will demonstrate that it is almost effortless for application developers to deploy ColumbuScout over any databases, and ColumbuScout is able to support search-engine-like types of search over large databases (more than 1.7 billion records in the examples we used) efficiently and effectively.
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a summary on the two data sets combined together, while preserving the error and size guarantees. This property means that the summary can be treated like other algebraic objects such as sum and max, which is especially useful for computing summaries on massive distributed data. Many data summaries are trivially mergeable by construction, most notably those based on linear transformations. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modi%uFB01cations. Speci%uFB01cally, we show that for eps-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/eps); for eps-approximate quantiles, there is a deterministic summary of size O(1/epslog(eps n)) that has a restricted form of mergeability, and a randomized one of size O(1/epslog^3/2 1/eps) with full mergeability. We also extend our results to geometric summaries such as eps-approximations and eps-kernels.
This paper revisits the classical problem of multi-query optimization in the context of RDF/SPARQL. We show that the techniques developed for relational and semi-structured data/query languages are hard, if not impossible, to be extended to account for RDF data model and graph query patterns expressed in SPARQL. In light of the NP-hardness of the multi-query optimization for SPARQL, we propose heuristic algorithms that partition the input batch of queries into groups such that each group of queries can be optimized together. An essential component of the optimization incorporates an efficient algorithm to discover the common sub-structures of multiple SPARQL queries and an effective cost model to compare candidate execution plans. Since our optimization techniques do not make any assumption about the underlying SPARQL query engine, they have the advantage of being portable across different RDF stores. The comprehensive experimental studies, performed on three popular RDF stores, show that the proposed techniques are effective, efficient and scalable.
In distributed data management, a primary concern is monitoring the distributed data and generating an alarm when a user specified constraint is violated. A particular useful instance is the threshold based constraint, which is commonly known as the distributed threshold monitoring problem. This work extends this useful and fundamental study to distributed probabilistic data that emerge in a lot of applications, where uncertainty naturally exists when massive amounts of data are produced at multiple sources in distributed, networked locations. Examples include distributed observing stations, large sensor fields, geographically separate scientific institutes/units and many more. When dealing with probabilistic data, there are two thresholds involved, the score and the probability thresholds. One must monitor both simultaneously, as such, techniques developed for deterministic data are no longer directly applicable. This work presents a comprehensive study to this problem. Our algorithms have significantly outperformed the baseline method in terms of both the communication cost (number of messages and bytes) and the running time, as shown by an extensive experimental evaluation using several, real large datasets.
We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general sampling-based solutions as well as some two-way protocols which have a provable exponential speed-up over any one-way protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node.
In data mining applications and spatial and multimedia databases, a useful tool is the kNN join, which is to produce the k nearest neighbors (NN), from a dataset S, of every point in a dataset R. Since it involves both the join and the NN search, performing kNN joins efficiently is a challenging task. Meanwhile, applications continue to witness a quick (exponential in some cases) increase in the amount of data to be processed. A popular model nowadays for large-scale data processing is the shared-nothing cluster on a number of commodity machines using MapReduce. Hence, how to execute kNN joins efficiently on large data that are stored in a MapReduce cluster is an intriguing problem that meets many practical needs. This work proposes novel (exact and approximate) algorithms in MapReduce to perform efficient parallel kNN joins on large data. We demonstrate our ideas using Hadoop. Extensive experiments in large real and synthetic datasets, with tens or hundreds of millions of records in both R and S and up to 30 dimensions, have demonstrated the efficiency, effectiveness, and scalability of our methods.
We consider a model for multiparty communication complexity which we call private message mul- tiparty communication complexity, essentially multiparty communication complexity with card-in-hand model but no blackboard. That is, any communication between two players is not witnessed by any other players. Techniques developed for this model provides simpler lower bounds in other models as well. Furthermore, this model has applications for proving lower bounds for the distributed sensing framework.
In this paper, we propose a new and accurate technique for uncertainty analysis and uncertainty visualization based on fiber orientation distribution function (ODF) glyphs, associated with high angular resolution diffusion imaging (HARDI). Our visualization applies volume rendering techniques to an ensemble of 3D ODF glyphs, which we call SIP functions of diffusion shapes, to capture their variability due to underlying uncertainty. This rendering elucidates the complex heteroscedastic structural variation in these shapes. Furthermore, we quantify the extent of this variation by measuring the fraction of the volume of these shapes, which is consistent across all noise levels, the certain volume ratio. Our uncertainty analysis and visualization framework is then applied to synthetic data, as well as to HARDI human-brain data, to study the impact of various image acquisition parameters and background noise levels on the diffusion shapes.
GIS data usually consist of both spatial and textual information,where the spatial component represents the location ofthe object and the textual element contains a set of stringsdescribing object in that location. For GIS data situated ona road network, shortest path search is a basic operation. Inpractice, however, users are often interested at routing whencertain constraints on the textual information have been alsoincorporated. This work complements the standard shortestpath search with multiple keywords and an approximatestring similarity function, where the goal is to find the shortestpath that passes through at least one matching objectper keyword; we dub this problem the multi-approximatekeywordrouting (makr) query. We present both exact andapproximate solutions. When the number %u03BA of query keywordsis small (e.g., %u03BA %u2264 6), the exact solution works efficiently.However, when %u03BA increases, it becomes increasinglyexpensive (especially on large GIS data). In this case, ourapproximate methods achieve superb query efficiency, excellentscalability, and high approximation quality, as indicatedin our extensive experiments on large, real datasets (up to 2million points on road networks with hundreds of thousandsof nodes and edges). We also prove that one approximatemethod has a %u03BA-approximation in the worst case.
In this paper, we harness the synergy between two important learning paradigms, namely, active learning and domain adaptation. We show how active learning in a target domain can leverage information from a different but related source domain. Our proposed framework, Active Learning Domain Adapted (Alda), uses source domain knowledge to transfer information that facilitates active learning in the target domain. We propose two variants of Alda: a batch B-Alda and an online O-Alda. Empirical comparisons with numerous baselines on real-world datasets establish the efficacy of the proposed methods.
We study computing with indecisive point sets. Such points have spatial uncertainty where the true location is one of a finite numberof possible locations. This data arises from probing distributions a few times or when the location is one of a few locations from a knowndatabase. In particular, we study computing distributions of geometric functions such as the radius of the smallest enclosing ball and the diameter. Surprisingly, we can compute the distribution of the radius of the smallest enclosing ball exactly in polynomial time, but computing the same distribution for the diameter is #P-hard. We generalize ourpolynomial-time algorithm to all LP-type problems. We also utilize our indecisive framework to deterministically and approximately compute on a more general class of uncertain data where the location of each point is given by a probability distribution.
The space of positive definite matrices P(n) is a Riemannian manifold with variable nonpositive curvature. It includes Euclidean space and hyperbolicspace as submanifolds, and poses significant challenges for the design of algorithms for data analysis. In this paper, we develop foundational geometric structures and algorithms for analyzing collections of such matrices. A key technical contribution of this work is the use of horoballs, a natural generalization of halfspaces for non-positively curved Riemannian manifolds. We propose generalizations of the notion of a convex hull and a center point and approximations of these structures using horoballs and based on novel decompositions of P(n). This leads to an algorithm for approximate hulls using a generalization of extents.
Aggregate similarity search, a.k.a. aggregate nearest neighbor (Ann) query, finds many useful applications in spatialand multimedia databases. Given a group Q of M query objects, it retrieves the most (or top-k) similar object to Q froma database P, where the similarity is an aggregation (e.g.,sum, max) of the distances between the retrieved object pand all the objects in Q. In this paper, we propose an addedflexibility to the query definition, where the similarity is anaggregation over the distances between p and any subset ofφM objects in Q for some support 0 < φ ≤ 1. We call thisnew definition flexible aggregate similarity (Fann) search,which generalizes the Ann problem. Next, we present algorithms for answering Fann queries exactly and approximately. Our approximation algorithms are especially appealing, which are simple, highly efficient, and work wellin both low and high dimensions. They also return nearoptimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experiments on largereal and synthetic datasets from 2 to 74 dimensions havedemonstrated their superior efficiency and high quality.
Starting with a similarity function between objects, it is possible to define a distance metric (the kernel distance) on pairs of objects, and more generally on probability distributions over them. These distance metrics have a deep basis in functional analysis and geometric measure theory, and have a rich structure that includes an isometric embedding into a Hilbert space. They have recently been applied to numerous problems in machine learning and shape analysis. In this paper, we provide the first algorithmic analysis of these distance metrics. Our main contributions are as follows: We present fast approximation algorithms for computing the kernel distance between two point sets P and Q that runs in near-linear time in the size of P ? Q (an explicit calculation would take quadratic time). We present polynomial-time algorithms for approximately minimizing the kernel distance under rigid transformation; they run in time O(n poly(1/e, log n)). We provide several general techniques for reducing complex objects to convenient sparse representations (specifically to point sets or sets of points sets) which approximately preserve the kernel distance. In particular, this allows us to reduce problems of computing the kernel distance between various types of objects such as curves, surfaces, and distributions to computing the kernel distance between point sets.
Optimal location (OL) queries are a type of spatial queries particularly useful for the strategic planning of resources.Given a set of existing facilities and a set of clients, an OL query asks for a location to build a new facility that optimizes a certain cost metric (defined based on the distances between the clients and the facilities). Several techniques have been proposed to address OL queries, assuming that all clients and facilities reside in an Lp space. In practice, however, movements between spatial locations are usually confined by the underlying road network, and hence, the actual distance between two locations can differ significantly from their Lp distance.Motivated by the deficiency of the existing techniques, this paper presents the first study on OL queries in road networks.We propose a unified framework that addresses three variants of OL queries that find important applications in practice,and we instantiate the framework with several novel query processing algorithms. We demonstrate the efficiency of our solutions through extensive experiments with real data.
This paper proposes a new distance metric between clusterings that incorporates information about the spatial distribution of points and clusters. Our approach builds on the idea of a Hilbert space-based representation of clusters as a combination of the representations of their constituent points. We use this representation and the underlying metric to design a spatially-aware consensus clustering procedure. This consensus procedure is implemented via a novel reduction to Euclidean clustering, and is both simple and efficient. All of our results apply to both soft and hard clusterings. We accompany these algorithms with a detailed experimental evaluation that demonstrates the efficiency and quality of our techniques.
In this paper, we propose an online multitask learning framework where the weight vectors are updated in an adaptive fashion based on inter-task relatedness. Our work is in contrast with earlier work on online multitask learning where the authors use a fixed interaction matrix of tasks to derive (fixed) update rules for all the tasks. In this work, we propose to update this interaction matrix itself in an adaptive fashion so that the weight vector updates are no longer fixed but are instead adaptive. Our framework can be extended to an active learning setting where the informativeness of an incoming instance across all the tasks can be evaluated using this adaptive interaction matrix. Empirical results on standardized datasets show improved performance in terms of accuracy, label complexity and number of mistakes made.
The problem of answering SPARQL queries over virtual SPARQL views is commonly encountered in a number of settings, including while enforcing security policies to access RDF data, or when integrating RDF data from disparate sources. We approach this problem by rewriting SPARQL queries over the views to equivalent queries over the underlying RDF data, thus avoiding the costs entailed by view materialization and maintenance. We show that SPARQL query rewriting combines the most challenging aspects of rewriting for the relational and XML cases: like the relational case, SPARQL query rewriting requires synthesizing multiple views; like the XML case, the size of the rewritten query is exponential to the size of the query and the views. In this paper, we present the first native query rewriting algorithm for SPARQL. For an input SPARQL query over a set of virtual SPARQL views, the rewritten query resembles a union of conjunctive queries and can be of exponential size. We propose optimizations overthe basic rewriting algorithm to (i) minimize each conjunctive query in the union; (ii) eliminate conjunctive queries with empty results from evaluation; and (iii) efficiently prune out big portions of the search space of empty rewritings. The experiments, performed on two RDF stores, show that our algorithms arescalable and independent of the underlying RDF stores. Furthermore, our optimizations have order of magnitude improvements over the basic rewriting algorithm in both the rewriting size and evaluation time.
Given a set of points with uncertain locations, we consider the problem of computing the probability of each point lying on the skyline, that is, the probability that it is not dominated by any other input point. If each point?s uncertainty is described as a probability distribution over a discrete set of locations, we improve the best known exact solution. We also suggest why we believe our solution might be optimal. Next, we describe simple, near-linear time approximation algorithms for computing the probability of each point lying on the skyline. In addition, some of our methods can be adapted to construct data structures that can eciently determine the probability of a query point lying on the skyline.
This paper evaluates features of graph coloring algorithms implemented on graphics processing units (GPUs), comparing coloring heuristics and thread decompositions. As compared to prior work on graph coloring for other parallel architectures, we find that the large number of cores and relatively high global memory bandwidth of a GPU lead to different strategies for the paral lel implementation. Specifically, we find that a simple uniform block partitioning is very effective on GPUs and our parallel coloring heuristics lead to the same or fewer colors than prior approaches for distributed-memory cluster architecture. Our a lgorithm resolves many coloring conflicts across partitioned blocks on the GPU by iterating through the coloring process, befo re returning to the CPU to resolve remaining conflicts. With this approach we get as few color (if not fewer) than the best se quential graph coloring algorithm and performance is close to the fastest sequential graph coloring algorithms which have poor color quality.
We propose an algorithm for dimensionality reduction on the simplex, mapping a set of high-dimensional distributions to a space of lower-dimensional distributions, whilst approximately preserving pairwise Hellinger distance between distributions. By introducing a restriction on the input data to distributions that are in some sense quite smooth, we can map $n$ points on the $d$-simplex to the simplex of $O(eps^{-2}log n)$ dimensions with $eps$-distortion with high probability. The techniques used rely on a classical result by Johnson and Lindenstrauss on dimensionality reduction for Euclidean point sets and require the same number of random bits as non-sparse methods proposed by Achlioptas for database-friendly dimensionality reduction.
Given a set P of n points in |R^d, an eps-kernel K subset P approximates the directional width of P in every direction within a relative (1-eps) factor. In this paper we study the stability of eps-kernels under dynamic insertion and deletion of points to P and by changing the approximation factor eps. In the first case, we say an algorithm for dynamically maintaining a eps-kernel is stable if at most O(1) points change in K as one point is inserted or deleted from P. We describe an algorithm to maintain an eps-kernel of size O(1/eps^{(d-1)/2}) in O(1/eps^{(d-1)/2} log n) time per update. Not only does our algorithm maintain a stable eps-kernel, its update time is faster than any known algorithm that maintains an eps-kernel of size O(1/eps^{(d-1)/2}). Next, we show that if there is an eps-kernel of P of size k, which may be dramatically less than O(1/eps^{(d-1)/2}), then there is an (eps/2)-kernel of P of size O(min {1/eps^{(d-1)/2}, k^{floor(d/2)} log^{d-2} (1/eps)}). Moreover, there exists a point set P in |R^d and a parameter eps > 0 such that if every eps-kernel of P has size at least k, then any (eps/2)-kernel of P has size Omega(k^{floor(d/2)}).
In this paper, we propose a unified algorithmic framework for solving many known variants of MDS. Our algorithm is a simple iterative scheme with guaranteed convergence, and is modular; by changing the internals of a single subroutine in the algorithm, we can switch cost functions and target spaces easily. In addition to the formal guarantees of convergence, our algorithms are accurate; in most cases, they converge to better quality solutions than existing methods in comparable time. Moreover, they have a small memory footprint and scale effectively for large data sets. We expect that this framework will be useful for a number of MDS variants that have not yet been studied. Our framework extends to embedding high-dimensional points lying on a sphere to points on a lower dimensional sphere, preserving geodesic distances. As a complement to this result, we also extend the Johnson-Lindenstrauss Lemma to this spherical setting, by showing that projecting to a random O((1/epsilon) log n)- dimensional sphere causes only an "epsilon distortion in the geodesic distances.
Edit distance based string similarity join is a fundamental operator in string databases. Increasingly, many applications in data cleaning, data integration, and scientific computing have to deal with fuzzy information in string attributes. Despite the intensive efforts devoted in processing (deterministic) string joins and managing probabilistic data respectively, modeling and processing probabilistic strings is still a largely unexplored territory. This work studies the string join problem in probabilistic string databases, using the expected edit distance (EED) as the similarity measure. We first discuss two probabilistic string models to capture the fuzziness in string values in real-world applications. The string-level model is complete, but may be expensive to represent and process. The character-level model has a much more succinct representation when uncertainty in strings only exists at certain positions. Since computing the EED between two probabilistic strings is prohibitively expensive, we have designed efficient and effective pruning techniques that can be easily implemented in existing relational database engines for both models. Extensive experiments on real data have demonstrated order-of-magnitude improvements of our approaches over the baseline.
Quantiles are a crucial type of order statistics in databases. Extensive research has been focused on maintaining a space-efficient structure for approximate quantile computation as the underlyingdataset is updated. The existing solutions, however, are designed to support only the current, most-updated, snapshot of the dataset. Queries on the past versions of the data cannot be answered. This paper studies the problem of historical quantile search. The objective is to enable epsilon-approximate quantile retrieval on any snapshot of the dataset in history. The problem is very important in analyzing the evolution of a distribution, monitoring the quality of services, query optimization in temporal databases, and so on. We present the first formal results in the literature. First, we prove a novel theoretical lower bound on the space cost of supporting-approximate historical quantile queries. The bound reveals the fundamental difference between answering quantile queries about the past and those about the present time. Second, we propose a structure for finding -approximate historical quantiles, and show that it consumes more space than the lower bound by only a square logarithmic factor. Extensive experiments demonstrate that in practice our technique performs much better than predicted by theory. In particular, the quantiles it returns are remarkably more accurate than the theoretical precision guarantee.
We describe algorithms for finding the regression of t, a sequence of values, to the closest sequence s by mean squared error, so that s is always increasing (isotonicity) and so the values of two consecutive points do not increase by too much (Lipschitz). The isotonicity constraint can be replaced with a unimodular constraint, where there is exactly one local maximum in s. These algorithm are generalized from sequences of values to trees of values. For each scenario we describe near-linear time algorithms.
This work presents a novel index structure, MHR-tree, for efficiently answering approximate string match queries in large spatial databases. The MHR-tree is based on the R-tree augmented with the min-wise signature and the linear hashing technique. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on set resemblance between the query string and the q-grams from the sub-trees of index nodes. MHR-tree supports a wide range of query predicates efficiently, including range and nearest neighbor queries. We also discuss how to estimate range query selectivity accurately. We present a novel adaptive algorithm for finding balancedpartitions using both the spatial and string information stored in the tree. Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our approach.
Finding the k nearest neighbors (kNN) of a query point, or a set of query points (kNN-Join) are fundamental problems in many application domains. Many previous efforts tosolve these problems focused on spatial databases or stand-alone systems, where changes to the database engine may be required,which may limit their application on large data sets that are stored in a relational database management system. Furthermore,these methods may not automatically optimize kNN queries or kNN-Joins when additional query conditions are specified. In this work, we study both the kNN query and the kNN-Join ina relational database, possibly augmented with additional query conditions. We search for relational algorithms that require no changes to the database engine. The straightforward solution uses the user-defined-function (UDF) that a query optimizer cannot optimize.We design algorithms that could be implementedby SQL operators without changes to the database engine, hence enabling the query optimizer to understand and generate the “best” query plan. Using only a small constant number of random shifts for databases in any fixed dimension, our approach guarantees to find the approximate kNN with only logarithmic number of page accesses in expectation with aconstant approximation ratio and it could be extended to find the exact kNN efficiently in any fixed dimension. Our design paradigm easily supports the kNN-Join and updates. Extensive experiments on large, real and synthetic, data sets confirm the efficiency and practicality of our approach.
We consider problems on data sets where each data point has uncertainty described by an individual probability distribution. We develop several frameworks and algorithms for calculating statistics on these uncertain data sets. Our examples focus on geometric shape tting problems. We prove approximation guarantees for the algorithms with respect to the full probability distributions. We then empirically demonstrate that our algorithms are simple and practical, solving for a constant hidden by asymptotic analysis so that a user can reliably trade speed and size for accuracy.
We present a streaming model for large scale classification (in the context of $ell_2$-SVM) by leveraging connections between learning and computational geometry. The streaming model imposes the constraint that only a single pass over the data is allowed. The $ell_2$-SVM is known to have an equivalent formulation in terms of minimum enclosing balls (MEB) and an efficient algorithm based on the idea of core sets exists (CVM) (Tsang et al., 2005) which learns a (1 $epsilon$) approximate MEB for a set of points and yield an approximate solution to corresponding SVM instance. However CVM works in batch mode requiring multiple passes over the data. We present a single-pass SVM based on the minimum enclosing ball of streaming data. We show that the MEB updates for the streaming case can be easily adapted to learn the SVM weight vector using simple Perceptron-like update equations. Our algorithm performs polylogarithmic computation at each example, requires very small and constant storage, and finds simpler solutions (measured in terms of the number of support vectors). Experimental results show that, even in such restrictive settings, we can learn efficiently in just one pass and get accuracies comparable to other state-of-the-art SVM solvers. We also discuss some open issues and possible extensions.
Ranking queries are essential tools to process large amounts of probabilistic data that encode exponentially many possible deterministic instances. In many applications where uncertainty and fuzzy information arise, data are collected from multiple sources in distributed, networked locations, e.g., distributed sensor fields with imprecise measurements, multiple scientific institutes with inconsistency in their scientific data. Due to the network delay and the economic cost associated with communicating large amounts of data over a network, a fundamental problem in these scenarios is to retrieve the global top-k tuples from all distributed sites with minimum communication cost. Using the well-founded notion of the expected rank of each tuple across all possible worlds as the basis of ranking, this work designs both communication- and computation efficient algorithms for retrieving the top-k tuples with the smallest ranks from distributed sites. Extensive experiments using both synthetic and real data sets confirm the efficiency and superiority of our algorithms over the straightforward approach of forwarding all data to the server.
In this paper, we describe a system for approximate shape matching and symmetry (rotation and reflection) detection of geometric shapes represented as point clouds. Rather than using the leastsquares distance as a measure of similarity between shapes, we use the Hausdorff distance between point sets as the underlying shape metric. This allows us to exploit methods from geometric pattern matching to return symmetries and rigid transformation matches with guaranteed error bounds on the quality of our solution. The approximation is determined by intuitive user-specified input precision and distance threshold parameters. Another important feature of our method is that it leverages FFT-based techniques for string matching to compute all approximate symmetries simultaneously. Our algorithm is simple to implement and is efficient; we present a detailed experimental study.
When dealing with massive quantities of data, topkqueries are a powerful technique for returning only the kmost relevant tuples for inspection, based on a scoring function.The problem of efficiently answering such ranking queries hasbeen studied and analyzed extensively within traditional databasesettings. The importance of the top-k is perhaps even greater inprobabilistic databases, where a relation can encode exponentiallymany possible worlds. There have been several recent attemptsto propose definitions and algorithms for ranking queries overprobabilistic data. However, these all lack many of the intuitiveproperties of a top-k over deterministic data. Specifically, wedefine a number of fundamental properties, including exact-k,containment, unique-rank, value-invariance, and stability, whichare all satisfied by ranking queries on certain data. We arguethat all these conditions should also be fulfilled by any reasonabledefinition for ranking uncertain data. Unfortunately, none of theexisting definitions is able to achieve this.To remedy this shortcoming, this work proposes an intuitivenew approach of expected rank. This uses the well-founded notionof the expected rank of each tuple across all possible worldsas the basis of the ranking. We are able to prove that, incontrast to all existing approaches, the expected rank satisfiesall the required properties for a ranking query. We provideefficient solutions to compute this ranking across the majormodels of uncertain data, such as attribute-level and tuple-leveluncertainty. For an uncertain relation of N tuples, the processingcost is O(N logN)—no worse than simply sorting the relation.In settings where there is a high cost for generating each tuple inturn, we provide pruning techniques based on probabilistic tailbounds that can terminate the search early and guarantee thatthe top-k has been found. Finally, a comprehensive experimentalstudy confirms the effectiveness of our approach.
Given a set of points $P$ and a query point q, the reverse furthest neighbor (RFN) query fetches the set of points $p in P$ such that q is their furthest neighbor among all points in $Pcup{q}$. This is the monochromatic RFN (MRFN) query. Another interesting version of RFN query is the bichromatic reverse furthest neighbor (BRFN) query. Given a set of points P, a queryset Q and a query point $q in Q$, a BRFN query fetches the set of points $pin P$ such that q is the furthest neighbor of p amongall points in Q. The RFN query has many interesting applications in spatial databases and beyond. For instance, given a large residential database (as P) and a set of potential sites (as Q) for building a chemical plant complex, the construction site should be selected as the one that has the maximum number of reverse furthest neighbors. This is an instance of the BRFN query. This paper presents the challenges associated with such queries and proposes efficient, R-tree based algorithms for both monochromatic and bichromatic versions of the RFN queries. We analyze properties of the RFN query that differentiate it from the widely studied reverse nearest neighbor queries and enable the design of novel algorithms. Our approach takes advantage of the furthest Voronoi diagrams as well as the convex hulls of either the data set P (in the MRFN case) or thequery set Q (in the BRFN case). For the BRFN queries, we also extend the analysis to the situation when Q is large in size andbecomes disk-resident. Experiments on both synthetic and real data sets confirm the efficiency and scalability of proposed algorithms over the brute-force search based approach.
Immortal DB is a transaction time database systemthat is built into a commercial database system rather than beinglayered on top. This enables it to have performance that is veryclose to the performance of an unversioned current timedatabase system. Achieving such competitive performance isessential for wide acceptance of this temporal functionality. Inthis paper we describe further performance improvements in twocritical dimensions. First Immortal DB range searchperformance is improved for current time data via improvedcurrent version storage utilization, making this performanceessentially the same as unversioned performance. Second,Immortal DB update performance is increased by furtherreducing the cost for the timestamping of versions. Finally, weshow how a simple modification, integrated into thetimestamping mechanism, can provide a foundation for auditingdatabase activity. Our algorithms have been incorporated into acommercial database engine and experiments using this databaseengine demonstrate the effectiveness of our approach.
With the advance of wireless communication technology,it is quite common for people to view maps or get relatedservices from the handheld devices, such as mobile phones andPDAs. Range queries, as one of the most commonly used tools,are often posed by the users to retrieve needful information froma spatial database. However, due to the limits of communicationbandwidth and hardware power of handheld devices, displayingall the results of a range query on a handheld device is neithercommunication efficient nor informative to the users. This issimply because that there are often too many results returnedfrom a range query. In view of this problem, we present a novelidea that a concise representation of a specified size for the rangequery results, while incurring minimal information loss, shall becomputed and returned to the user. Such a concise range querynot only reduces communication costs, but also offers betterusability to the users, providing an opportunity for interactiveexploration. The usefulness of the concise range queries isconfirmed by comparing it with other possible alternatives, suchas sampling and clustering. Then we propose algorithms to finda good concise representation.
In this work we concentrate on categorization of relational attributes based on their data type. Assuming that attribute type/characteristics are unknown or unidentifiable, we analyze and compare a variety of type-based signatures for classifying the attributes based on the semantic type of the data contained therein (e.g., router identifiers, social security numbers, email addresses). The signatures can subsequently be used for other applications as well, like clustering and indexing based on data types. This application is useful in cases where very large data collections that are generated in a distributed, ungoverned fashion end up having unknown, incomplete, inconsistent or very complex schemata and schema level meta-data. We concentrate on heuristically generating type-based attribute signatures based on both local and global computation approaches. We show experimentally that by decomposing data into q-grams and then considering signatures based on q-gram distributions, we achieve very good classification accuracy under the assumption that a large sample of the data is available for building the signatures. Then, we turn our attention to cases where a very small sample of the data is available, and hence accurately capturing the q-gram distribution of a given data type is almost impossible. We propose techniques based on dimensionality reduction and soft clustering that exploit correlations between attributes to improve classification accuracy.
When merging data from various sources, it is often the case that small variations in data format and interpretation cause traditional functional dependencies (FDs) to be violated, without there being an intrinsic violation of semantics. Examples include differing address formats, or different reported latitude/longitudes for a given address. In such cases, we would like to specify a dependency structure on the merged data that is robust to such small differences.In this paper, we define metric functional dependencies, which strictly generalize traditional FDs by allowing small differences (controlled by a metric) in values of the consequent attribute of an FD. We show that this notion satisfies many of the standard properties of functional dependencies, and we present efficient algorithms for the verification problem: determining whether a given metric FD (MFD) holds for a given relation. We show that MFDs can be combined with approximate FDs, allowing tuples with identical antecedents to map to different consequents, some of which correspond to small (acceptable) variations, with others indicating more serious data quality issues. We experimentally demonstrate the validity and efficiency of our approach on various data sets that possess different underlying metrics, and lie in multidimensional spaces.
In this paper, we explore a streaming algorithm paradigm to handle large amounts of data for NLP problems. We present an efficient low-memory method for constructing high-order approximate n-gram frequency counts. The method is based on a deterministic streaming algorithm which efficiently computes approximate frequency counts over a stream of data while employing a small memory footprint. We show that this method easily scales to billion-word monolingual corpora using a conventional (4 GB RAM) desktop machine. Statistical machine translation experimental results corroborate that the resulting high-n approximate small language model is as effective as models obtained from other count pruning methods.
For a set P of n points in R^2, the Euclidean 2-center problem computes a pair of congruent disks of the minimal radius that cover P.We extend this to the (2, k)-center problem where we compute the minimal radius pair of congruent disks to cover n-k points of P. We present a randomized algorithm with O(nk^7 log^3 n) expected running time for the (2, k)-center problem. We also study the (p,k)-center problem in R^2 under the `l_{infty}-metric. We give solutions for p = 4 in O(k^{O(1)} n log n) timeand for p = 5 in O(k^{O(1)} n log^5 n) time.
The geometric median is a classic robust estimator of centrality for data in Euclidean spaces. In this paper we formulate the geometric median of data on a Riemannian manifold as the minimizer of the sum of geodesic distances to the data points. We prove existence and uniqueness of the geometric median on manifolds with non-positive sectional curvature and give sufficient conditions for uniqueness on positively curved manifolds. Generalizing the Weiszfeld procedure for finding the geometric median of Euclidean data, we present an algorithm for computing the geometric median on an arbitrary manifold. We show that this algorithm converges to the unique solution when it exists. This method produces a robust central point for data lying on a manifold, and should have use in a variety of vision applications involving manifolds. We give examples of the geometric median computation and demonstrate its robustness for three types of manifold data: the 3D rotation group, tensor manifolds, and shape spaces.
Consider a point set D with a measure function mu : D rightarrow R. Let A be the set of subsets of D induced by containment in a shape from some geometric family (e.g. axis-aligned rectangles, half planes, balls, k-oriented polygons). We say a range space (D, A) has an epsilon-approximation P if | mu(R cap P )/mu(P ) - mu(R cap D)/mu(D) | = epsilon.max R in AWe describe algorithms for deterministically constructing discrete epsilon-approximations for continuous point sets such as distributions or terrains. Furthermore, for certain families of subsets A, such as those described by axis-aligned rectangles, we reduce the size of the epsilon-approximations by almost a square root from O( 1/epsilon^2 log (1/epsilon )) to O( 1/epsilon polylog (1/epsilon) ). This is often the first step in transforming a continuous problem into a discrete one for which combinatorial techniques can be applied. We describe applications of this result in geo-spatial analysis, biosurveillance, and sensor networks.
Computing statistical information on probabilistic data hasattracted a lot of attention recently, as the data generatedfrom a wide range of data sources are inherently fuzzy oruncertain. In this paper, we study an important statisti-cal query on probabilistic data: finding the frequent items.One straightforward approach to identify the frequent itemsin a probabilistic data set is to simply compute the expectedfrequency of an item and decide if it exceeds a certain frac-tion of the expected size of the whole data set. However,this simple definition misses important information aboutthe internal structure of the probabilistic data and the in-terplay among all the uncertain entities. Thus, we proposea new definition based on the possible world semantics thathas been widely adopted for many query types in uncertaindata management, trying to find all the items that are likelyto be frequent in a randomly generated possible world. Ourapproach naturally leads to the study of ranking frequentitems based on confidence as well.Finding likely frequent items in probabilistic data turnsout to be much more difficult. We first propose exact algo-rithms for offline data with either quadratic or cubic time.Next, we design novel sampling-based algorithms for stream-ing data to find all approximately likely frequent items withtheoretically guaranteed high probability and accuracy. Oursampling schemes consume sublinear memory and exhibitexcellent scalability. Finally, we verify the effectiveness andefficiency of our algorithms using both real and syntheticdata sets with extensive experimental evaluations.
This work introduces novel polynomial-time algorithmsfor processing top-k queries in uncertain databases,under the generally adopted model of x-relations. An x-relationconsists of a number of x-tuples, and each x-tuple randomlyinstantiates into one tuple from one or more alternatives. Ourresults signi cantly improve the best known algorithms for top-kquery processing in uncertain databases, in terms of both runningtime and memory usage. Focusing on the single-alternative case,the new algorithms are orders of magnitude faster.
The overwhelming flow of information in manydata stream applications forces many companies to outsourceto a third-party the deployment of a Data Stream Management System (DSMS) for performing desired computations. Remotecomputations intrinsically raise issues of trust, making query execution assurance on data streams a problem with practicalimplications. Consider a client observing the same data stream asa remote server (e.g., network traffic), that registers a continuousquery on the server’s DSMS, and receives answers upon request.The client needs to verify the integrity of the results usingsignificantly fewer resources than evaluating the query locally.Towards that goal, we propose a probabilistic algorithm forselection and aggregate/group-by queries, that uses constantspace irrespective of the result-set size, has low update cost,and arbitrarily small probability of failure. We generalize thisalgorithm to allow some tolerance on the number of errors permitted (irrespective of error magnitude), and also discuss thehardness of permitting arbitrary errors of small magnitude. Wealso perform an empirical evaluation using live network traffic.
In this paper, we present a measure associated with detection and inference of statistically anomalous clusters of a graph based on the likelihood test of observed and expected edgesin a subgraph. This measure is adapted from spatial scan statistics for point sets and provides quantitative assessmentfor clusters. We discuss some important properties of this statistic and its relation to modularity and Bregman divergences. We apply a simple clustering algorithm to findclusters with large values of this measure in a variety of real-world data sets, and we illustrate its ability to identifystatistically significant clusters of selected granularity.
Validation of multi-column schema matchings is essential for successful database integration. This task is especially difficult when the databases to be integrated contain little overlapping data, as is often the case in practice (e.g., customer bases of different companies). Based on the intuition that values present in different columns related by a schema matching will have similar “semantic type”, and that this can be captured using distributions over values (“statistical types”), we develop a method for validating 1-1 and compositional schema matchings. Our technique is based on three key technical ideas. First, we propose a generic measure for comparing two columns matched by a schema matching, based on a notion of information-theoretic discrepancy that generalizes the standard geometric discrepancy; this provides the basis for 1:1 matching. Second, we present an algorithm for “splitting” the string values in a column to identify substrings that are likely to match with the values in another column; this enables (multi-column) 1:m schema matching. Third, our technique provides an invalidation certificate if it fails to validate a schema matching. We complement our conceptual and algorithmic contributions with an experimental study that demonstrates the effectiveness and efficiency of our technique on a variety of database schemas and data sets.
In this paper we study the trade-offs between time series compressibility and partial information hiding and their fundamental implications on how we should introduce uncertainty about individual values by perturbing them. More specifically, if the perturbation does not have the same compressibility properties as the original data, then it can be detected and filtered out, reducing uncertainty. Thus, by making the perturbation “similar” to the original data, we can both preserve the structure of the data better, while simultaneously making breaches harder. However, as data become more compressible, a fraction of the uncertainty can be removed if true values are leaked, revealing how they were perturbed. We formalize these notions, study the above trade-offs on real data and develop practical schemes which strike a good balance and can also be extended for on-the-fly data hiding in a streaming environment.
As computer systems are essential components of many critical commercial services, the need for secure online transactions is now becoming evident. The demand for such applications, as the market grows, exceeds the capacity of individual businesses to provide fast and reliable services, making outsourcing technologies a key player in alleviating issues of scale. Consider a stock broker that needs to provide a real-time stock trading monitoring service to clients. Since the cost of multicasting this information to a large audience might become prohibitive, the broker could outsource the stock feed to third-party providers, who are in turn responsible for forwarding the appropriate sub-feed to clients. Evidently, in critical applications the integrity of the third-party should not be taken for granted. In this work we study a variety of authentication algorithms for selection and aggregation queries over sliding windows. Our algorithms enable the end-users to prove that the results provided by the third-party are correct, i.e., equal to the results that would have been computed by the original provider. Our solutions are based on Merkle hash trees over a forest of space partitioning data structures, and try to leverage key features,like update, query, signing, and authentication costs. We present detailed theoretical analysis for our solutions and empirically evaluate the proposed techniques.
We address the problem of providing scalable support for subscriptions with personalized value-based notication conditions in wideareapublish/subscribe systems. Notication conditions can be netuned by subscribers, allowing precise and exible control of when events are delivered to the subscribers. For example, a user may specify that she should be notied if and only if the price of aparticular stock moves outside a ?radius? around her last notied value. Naive techniques for handling notication conditions are notscalable. It is challenging to share subscription processing and notication dissemination of subscriptions with personalized valuebased notication conditions, because two subscriptions may seetwo completely different sequences of notications even if they specify the same radius. We develop and experimentally evaluate scalable processing and dissemination techniques for these subscriptions.Our approach uses standard network substrates for notication dissemination, and avoids pushing complex application processing into the network. Compared with other alternatives, our approach generates orders of magnitude lower network trafc, and incurs lower server processing cost.
We describe a variation of the iterative closest point (ICP) algorithm for aligning two point sets under a set of transformations. Our algorithm is superior to previous algorithms because (1) in determining the optimal alignment, it identifies and discards likely outliers in a statistically robust manner, and (2) it is guaranteed to converge to a locally optimal solution. To this end, we formalize a new distance measure, fractional root mean squared distance (FRMSD), which incorporates the fraction of inliers into the distance function. Our framework can easily incorporate most techniques and heuristics from modern registration algorithms. We experimentally validate our algorithm against previous techniques on 2 and 3 dimensional data exposed to a variety of outlier types.
We address the problem of preserving privacy in streams, which has received surprisingly limited attention. For static data, a well-studied and widely used approach is based on random perturbation of the data values. However, streams pose additional challenges. First, analysis of the data has to be performed incrementally, using limited processing time and buffer space, making batch approaches unsuitable. Second, the characteristics of streams evolve over time. Consequently, approaches based on global analysis of the data are not adequate. We show that it is possible to efficiently and effectively track the correlation and autocorrelation structure of multivariate streams and leverage it to add noise which maximally preserves privacy, in the sense that it is very hard to remove. Our techniques achieve much better results than previous static, global approaches, while requiring limited processing time and memory. We provide both a mathematical analysis and experimental evaluation on real data to validate the correctness, efficiency, and effectiveness of our algorithms.
The $k$-anonymity privacy requirement for publishing microdata requires that each equivalence class (i.e., a set of records that are indistinguishable from each other with respect to certain “identifying” attributes) contains at least $k$ records. Recently, several authors have recognized that $k$-anonymity cannot prevent attribute disclosure. The notion of $ell$-diversity has been proposed to address this; $ell$-diversity requires that each equivalence class has at least $ell$ well-represented values for each sensitive attribute.In this paper we show that $ell$-diversity has a number of limitations. In particular, it is neither necessary nor sufficient to prevent attribute disclosure. We propose a novel privacy notion called $t$-closeness, which requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold $t$). We choose to use the Earth Mover Distance measure for our $t$-closeness requirement. We discuss the rationale for $t$-closeness and illustrate its advantages through examples and experiments.
This paper deals with the problem, arising in practice, of drawing a directed graph as a collection of disjoint, isothetic rectangles, where the rectangles of the nodes of each edge must touch and where the placement of the rectangles respects the ordering of the edges. It provides characterizations for those graphs having the special type of rectangular layout known as a rectangular dual. It then characterizes the st-graphs having rectangular layouts in terms of the existence of certain planar embeddings and the non-existence of a particular subgraph.
Given a set of objects with durations (jobs) that cover a base region, can we schedule the jobs to maximize the duration the original region remains covered? We call this problem the sensor cover problem. This problem arises in the context of covering a region with sensors. For example, suppose you wish to monitor activity along a fence by sensors placed at various fixed locations. Each sensor has a range and limited battery life. The problem is to schedule when to turn on the sensors so that the fence is fully monitored for as long as possible. This one dimensional problem involves intervals on the real line. Associating a duration to each yields a set of rectangles in space and time, each specified by a pair of fixed horizontal endpoints and a height. The objective is to assign a position to each rectangle to maximize the height at which the spanning interval is fully covered. We call this one dimensional problem restricted strip covering. If we replace the covering constraint by a packing constraint, the problem is identical to dynamic storage allocation, a scheduling problem that is a restricted case of the strip packing problem. We show that the restricted strip covering problem is NP-hard and present an O(log log n)-approximation algorithm. We present better approximations or exact algorithms for some special cases. For the uniform-duration case of restricted strip covering we give a polynomial-time, exact algorithm but prove that the uniform-duration case for higher-dimensional regions is NP-hard. Finally, we consider regions that are arbitrary sets, and we present an O(log n)-approximation algorithm.
Data quality is a serious concern in every data management application, and a variety of quality measures have been proposed, e.g., accuracy, freshness and completeness, to capture common sources of data quality degradation. We identify and focus attention on a novel measure, column heterogeneity, that seeks to quantify the data quality problems that can arise when merging data from different sources. We identify desiderata that a column heterogeneity measure should intuitively satisfy, and describe our technique to quantify database column heterogeneity based on using a novel combination of cluster entropy and soft clustering. Finally, we present detailed experimental results, using diverse data sets of different types, to demonstrate that our approach provides a robust mechanism for identifying and quantifying database column heterogeneity.
Spatial scan statistics are used to determine hotspots in spatial data, and are widely used in epidemiology and biosurveillance.In recent years, there has been much effortinvested in designing efficient algorithms for finding such ?high discrepancy? regions, with methods ranging from fast heuristics for special cases, to general grid-based methods,and to efficient approximation algorithms with provable guarantees on performance and quality.In this paper, we make a number of contributions to the computational study of spatial scan statistics. First, we describea simple exact algorithm for finding the largest discrepancy region in a domain. Second, we propose a new approximation algorithm for a large class of discrepancyfunctions (including the Kulldorff scan statistic) that improves the approximation versus runtime trade-off of priormethods. Third, we extend our simple exact and our approximation algorithms to data sets which lie naturally on a grid or are accumulated onto a grid. Fourth, we conduct adetailed experimental comparison of these methods with a number of known methods, demonstrating that our approximationalgorithm has far superior performance in practice to prior methods, and exhibits a good performance-accuracy trade-off.All extant methods (including those in this paper) are suitable for data sets that are modestly sized; if data sets are of the order of millions of data points, none of thesemethods scale well. For such massive data settings, it is natural to examine whether small-space streaming algorithms might yield accurate answers. Here, we provide some negative results, showing that any streaming algorithms that even provide approximately optimal answers to the discrepancymaximization problem must use space linear in the input.
Given two sets A and B of n points each in R 2, we study the problem of computing a matching between A and B that minimizes the root mean square (rms) distance of matched pairs. We can compute an optimal matching in O(n^( 2 delta)) time, for any delta > 0, and an epsilon-approximation in time O((n/epsilon)^(3/2) (logn)^6). If the set B is allowed to move rigidly to minimize the rms distance, we can compute a rigid motion of B and a matching in O((n^4 /epsilon^(5/2))(logn)^6 ) time whose cost is within (1 epsilon) factor of the optimal one
Spatial scan statistics are used to determine hotspots in spatial data, and are widely used in epidemiology and biosurveillance. In recent years, there has been much effort invested in designing efficient algorithms for finding such %u201Chigh discrepancy%u201D regions, with methods ranging from fast heuristics for special cases, to general grid-based methods, and to efficient approximation algorithms with provable guarantees on performance and quality. In this paper, we make a number of contributions to the computational study of spatial scan statistics. First, we describe a simple exact algorithm for finding the largest discrepancy region in a domain. Second, we propose a new approximation algorithm for a large class of discrepancy functions (including the Kulldorff scan statistic) that improves the approximation versus runtime trade-off of prior methods. Third, we extend our simple exact and our approximation algorithms to data sets which lie naturally on a grid or are accumulated onto a grid. Fourth, we conduct a detailed experimental comparison of these methods with a number of known methods, demonstrating that our approximation algorithm has far superior performance in practice to prior methods, and exhibits a good performance-accuracy trade-off. All extant methods (including those in this paper) are suitable for data sets that are modestly sized; if data sets are of the order of millions of data points, none of these methods scale well. For such massive data settings, it is natural to examine whether small-space streaming algorithms might yield accurate answers. Here, we provide some negative results, showing that any streaming algorithms that even provide approximately optimal answers to the discrepancy maximization problem must use space linear in the input.
In outsourced database (ODB) systems the database owner publishes its data through a number of remote servers, with the goal of enabling clients at the edge of the network to access and query the data more efficiently. As servers might be untrusted or can be compromised, query authentication becomes an essential component of ODB systems. Exist-ing solutions for this problem concentrate mostly on static scenarios and are based on idealistic properties for certain cryptographic primitives. In this work, first we define a variety of essential and practical cost metrics associated with ODB systems. Then, we analytically evaluate a number of different approaches, in search for a solution that best lever-ages all metrics. Most importantly, we look at solutions that can handle dynamic scenarios, where owners periodi-cally update the data residing at the servers. Finally, we discuss query freshness, a new dimension in data authentication that has not been explored before. A comprehensive experimental evaluation of the proposed and existing approaches is used to validate the analytical models and verify our claims. Our findings exhibit that the proposed solutions improve performance substantially over existing approaches, both for static and dynamic environments.
In this paper, we investigate a new approach to process queries in data stream applications. We show that reference locality characteristics of data streams could be exploited in the design of superior and flexible data stream query processing techniques. We identify two different causes of reference locality: popularity over long time scales and temporal correlations over shorter time scales. An elegant mathematical model is shown to precisely quantify the degree of those sources of locality. Furthermore, we analyze the impact of locality-awareness on achievable performance gains over traditional algorithms on applications such as MAX-subset approximate sliding window join and approximate count estimation. In a comprehensive experimental study, we compare several existing algorithms against our locality-aware algorithms over a number of real datasets. The results validate the usefulness and efficiency of our approach.
Anomaly detection has important applications in bio-surveilance and environmental monitoring. When comparing measured data to data drawn from a baseline distribution, merely, finding clusters in the measured data may not actually represent true anomalies. These clusters may likely be the clusters of the baseline distribution. Hence, a discrep-ancy function is often used to examine how different measured data is to baseline data within a region. An anomalous region is thus defined to be one with high discrepancy.In this paper, we present algorithms for maximizing statistical discrepancy functions over the space of axis-parallel rectangles. We give provable approximation guarantees,both additive and relative, and our methods apply to any convex discrepancy function. Our algorithms work by connecting statistical discrepancy to combinatorial discrepancy;roughly speaking, we show that in order to maximize a convex discrepancy function over a class of shapes, one needs only maximize a linear discrepancy function over the sameset of shapes.We derive general discrepancy functions for data generated from a one- parameter exponential family. This generalizes the widely-used Kulldorff scan statistic for datafrom a Poisson distribution. We present an algorithm running in O( 1/epsilon n^2 (logn)^2) that computes the maximum discrepancy rectangle to within additive error ., for the Kulldorff scan statistic. Similar results hold for relative error and for discrepancy functions for data coming from Gaussian,Bernoulli, and gamma distributions. Prior to our work, the best known algorithms were exact and ran in time O(n^4).
In this paper we discuss a new type of query in Spatial Databases, called the Trip Planning Query (TPQ). Given a set of points of interest P in space, where each point belongs to a specific category, a starting point S and a destination E, TPQ retrieves the best trip that starts at S, passes through at least one point from each category, and ends atE. For example, a driver traveling from Boston to Providence might want to stop to a gas station, a bank and a post office on his way, and the goal is to provide him with the best possible route (in terms of distance, traffic, road conditions, etc.). The difficulty of this query lies in the existence of multiple choices per category. In this paper, we study fast approximation algorithms for TPQ in a metric space. We provide a number of approximation algorithms with approximation ratios that depend on either the number of categories, the maximum number of points per category or both. Therefore, for different instances of the problem, we can choose the algorithm with the best approximation ratio, since they all run in polynomial time. Furthermore, we use some of the proposed algorithms to derive efficient heuristics for large datasets stored in external memory. Finally, we give an experimental evaluation of the proposed algorithms using both synthetic and real datasets.
Motion planning for systems with constraints on controls or the need for relatively straight paths for real-time actions presents challenges for modern planners. This paper presents an approach which addresses these types of systems by building on existing motion planning approaches. Guided Expansive Spaces Trees are introduced to search for a low cost and relatively straight path in a space with motion constraints. Path Gradient Descent, which builds on the idea of Elastic Strips, finds the locally optimal path for an existing path. These techniques are tested on simulations of rendezvous and docking of the space shuttle to the International Space Station and of a 4-foot fan-controlled blimp in a factory setting
In the emerging area of sensor-based systems, a sig-nificant challenge is to develop scalable, fault-tolerant methods to extract useful information from the data the sensors collect. An approach to this data management problem is the use of sensor database systems, exemplified by TinyDB and Cougar, which allow users to perform aggregation queries such as MIN, COUNT and AVG on a sensor network. Due to power and range constraints, centralized approaches are generally impractical, so most systems use in-network aggregation to reduce network traÆc. However, these aggregation strategies become bandwidth-intensive when combined with the fault-tolerant, multi-path routing methods often used in these environments. For example, duplicate-sensitive aggregates such as SUM cannot be computed exactly using substantially less bandwidth than explicit enumeration. To avoid this expense, we investigate the use of approximate in-network aggregation using small sketches. Our contributions are as follows: 1) we generalize well known duplicate-insensitive sketches for approximating COUNTto handle SUM,2)we present and analyze methods for using sketches to produce accurate results with low communication and computation overhead, and 3) we present an extensive experimental validation of our methods.
Several spatio-temporal applications require the retrieval of summarized information about moving objects that lie in a query region during a query interval (e.g., the number of mobile users covered by a cell, traffic volume in a district, etc.). Existing solutions have the distinct counting problem: if an object remains in the query region for several timestamps during the query interval, it will be counted multiple times in the result. The paper solves this problem by integrating spatio-temporal indexes with sketches, traditionally used for approximate query processing. The proposed techniques can also be applied to reduce the space requirements of conventional spatiotemporal data and to mine spatio-temporal association rules.
This paper presents a probabilistic approach to solve optimal control problems with appli-cation to spacecraft proximity operations. The 6 degree-of-freedom rendezvous and dockingproblem, using impulsive control, and avoidance of known obstacles and plume impingement issolved. Our solution is then extended to real-time obstacle avoidance. The space is searchedby expanding from the start location by applying only feasible controls and coasts, reducingby nearly 50% the variables perturbed in the search. A randomized expansion techniqueexplores the search space. A gradient descent approach smoothes the path and avoids newobstacles in real-time by 'stretching' the best pre-computed path in a locally optimal manner.
This paper presents a probabilistic approach to solve optimal control problems with appli-cation to spacecraft proximity operations. The 6 degree-of-freedom rendezvous and dockingproblem, using impulsive control, and avoidance of known obstacles and plume impingement issolved. Our solution is then extended to real-time obstacle avoidance. The space is searchedby expanding from the start location by applying only feasible controls and coasts, reducingby nearly 50% the variables perturbed in the search. A randomized A* expansion techniqueexplores the search space. A gradient descent approach smoothes the path and avoids newobstacles in real-time by 'stretching' the best pre-computed path in a locally optimal manner.
Information sources over the WWW contain a large amount of data organized according to di®erent interests and values. Thus, it is important that facilities are there to enable users to extract information of interests in a simple and e®ective manner. To do this, information from the Web sources need to be extracted automatically according tousers' interests. However, the extraction of information requires in-depth knowledge of relevant technologies and the extraction process is slow, tedious and di±cult for ordinary users. We propose the Wiccap Data Model, an XML data model that maps Web information sources into commonly perceived logical models. Based on this data model, ordinary users are able to extract information easily and e±ciently. To accelerate the creation of data models, we also de¯ne a formal process for creating such data model and have implemented a software tool to facilitate andautomate the process of producing Wiccap Data Models.
Applications such as suturing in medical simulations require the modeling of knot tying in physically realistic rope. The paper describes the design and implementation of such a system. Our model uses a spline of linear springs, adaptive subdivision and a dynamics simulation. Collisions are discrete event simulated and follow the impulse model. Although some care must be taken to maintain stable knots, we demonstrate our simple model is sufficient for this task. In particular, we do not use friction or explicit constraints to maintain the knot. As examples, we tie an overhand knot and a reef knot.
Sensor network localization (SNL) is the problem of determining the locations of the sensors given sparse and usually noisy inter-communication distances among them. In this work we propose an iterative algorithm named PLACEMENT to solve the SNL problem. This iterative algorithm requires an initial estimation of the locations and in each iteration, is guaranteed to reduce the cost function. The proposed algorithm is able to take advantage of the good initial estimation of sensor locations making it suitable for localizing moving sensors, and also suitable for the refinement of the results produced by other algorithms. Our algorithm is very scalable. We have experimented with a variety of sensor networks and have shown that the proposed algorithm outperforms existing algorithms both in terms of speed and accuracy in almost all experiments. Our algorithm can embed 120,000 sensors in less than 20 minutes.
We provide a new framework for generating multiple good quality partitions (clusterings) of a single data set. Our approach decomposes this problem into two components, generating many high-quality partitions, and then grouping these partitions to obtain k representatives. The decomposition makes the approach extremely modular and allows us to optimize various criteria that control the choice of representative partitions.
The Johnson-Lindenstrauss Lemma states that a set of $n$ points may be embedded in a space of dimension $O(\log n/\eps^2)$ while preserving all pairwise distances within a factor of $(1+\epsilon)$ with high probability. It has inspired a number of proofs that extend the result, simplify it, and improve the efficiency of computing the resulting embedding. The lemma is a critical tool in the realm of dimensionality reduction and high dimensional approximate computational geometry. It is also employed for data mining in domains that analyze intrinsically high dimensional objects such as images and text. However, while algorithms for performing the dimensionality reduction have become increasingly sophisticated, there is little understanding of the behavior of these embeddings in practice. In this paper, we present the first comprehensive study of the empirical behavior of algorithms for dimensionality reduction based on the JL Lemma. Our study answers a number of important questions about the quality of the embeddings and the performance of algorithms used to compute them. Among our key results: Determining a likely range for the big-Oh constant in practice for the dimension of the target space, and demonstrating the accuracy of the predicted bounds. Finding "best in class" algorithms over wide ranges of data size and source dimensionality, and showing that these depend heavily on parameters of the data as well its sparsity. Developing the best implementation for each method, making use of non-standard optimized codes for key subroutines. Identifying critical computational bottleneck that can spur further theoretical study of efficient algorithms.
In this paper, we propose three metrics to quantify the differences between the results of diffusion tensor magnetic resonance imaging (DT-MRI) fiber tracking algorithms: the area between corresponding fibers of each bundle, the Earth Mover?s Distance (EMD) between two fiber bundle volumes, and the current distance between two fiber bundle volumes. We also discuss an interactive fiber track comparison visualization toolkit we have developed based on the three proposed fiber difference metrics and have tested on six widely-used fiber tracking algorithms. To show the effectiveness and robustness of our metrics andvisualization toolkit, we present results on both synthetic data and high resolution monkey brain DT-MRI data. Our toolkit can be used for testing the noise effects on fiber tracking analysis and visualization and to quantify the difference between any pair of DT-MRI techniques, compare single subjects within an image atlas.
In this work, we show how active learning in some (target) domain can leverage information from a different but related source) domain. We present an algorithm that harnesses the source domain data to learn the best possible initializer hypothesis for doing active learning in the target domain, resulting in improved label complexity. We also present a variant of this algorithm which additionally uses the domain divergence information to selectively query the most informative points in the target domain, leading to further reductions in label complexity. Experimental results on a variety of datasets establish the efficacy of the proposed methods.
In this paper, we address the challenges posed by large amounts of text data by exploiting the power of hashing in context of streaming data. We explore sketch techniques, especially Count-Min Sketch, which approximates the frequency of a word-pair in the corpus without explicitly storing the word-pairs themselves. We further use the idea of a conservative update with Count-Min Sketch to reduce the average relative error of its approximate counts by a factor of two. We show that it is possible to store all words and word-pairs counts computed from 37 GB of web data in just 2 billion counters (8 GB main memory). The number of these counters is upto 30 times less than the stream size which is really a big memory and space gain. In Semantic Orientation experiments, the PMI scores computed from 2 billion counters are as effective as exact PMI scores.
Multi-Dimensional Scaling (MDS) is a widely used method for embedding a given distance matrix into a low dimensional space, used both as a preprocessing step for many machine learning problems, as well as a visualization tool in its own right. In this paper, we present an incremental version of MDS (iMDS). In iMDS, d-dimensional data points are presented in a stream, and the task is to embed the current d-dimensional data point into a k-dimensional space for k < d such that distances from the current point to the previous points are preserved. Let {x1,..., xt-1} ? R k be the data points at time step t that have already been embedded into a k-dimensional space, {r1,..., rt-1} be the given distances computed in R d, then objective of iMDS is to find a point xt: xt = arg min p?R k ?t-1 i=1
Protein-protein interactions form the basis for many intercellular events. In this paper we develop a tool for understanding the structure of these interactions. Specifically, we define a method for identifying a set of structural motifs on protein-protein interface surfaces. These motifs are secondary structures, akin to alpha-helices and beta-sheets in protein structure; they describe how multiple residues form knob-into-holefeatures across the interface. These motifs are generated entirely from geometric properties and are easily annotated with additional biologicaldata. We point to the use of these motifs in analyzing hotspot residues.
Information sources over the WWW contain a large amount of data organized according to different interests and values. Thus, it is important that facilities are there to enable users to extract information of interest in a simple and effective manner. To do this, We propose the Wiccap Data Model, an XML data model that maps Web information sources into commonly perceived logical models, so that information can be extracted automatically according to users' interests. To accelerate the creation of data models, we have implemented a visual tool, called the Mapping Wizard, to facilitate and automate the process of producing Wiccap Data Models. Using the tool, the time required to construct a logical data model for a given website is significantly reduced.
WWW information Collection, Collaging and Programming (WICCAP) system is a software system for generation of logical views of web resources and extraction of the desired information to a structured document. It is designed to enable people to obtain their interested information in a simple and effective manner as well as to make information from the WWW accessible to applications, in order to offer automation, inter-operation and Web-awareness among services. A key factor in making this system useful in practice is that it provides tools to automate and facilitate the process of constructing the logical representation of Web Sites, defining the interested information and subsequently retrieving them. In this work, we present the design of the WICCAP system and its two main components, namely Mapping Wizard and Network Extraction Agent.