![]() Suresh Venkatasubramanian | |
[ Book Chapter Journal Conference Workshop Tech Report]
In this chapter, we survey the various approaches that have been proposed to measure privacy (and the loss of privacy). Since most privacy concerns (especially those related to health-care information) are raised in the context of legal concerns, it is instructive to view privacy from a legal perspective, rather than from purely technical considerations. It is beyond the scope of this survey\footnote{…and the expertise of the author!} to review the legal interpretations of privacy. However, one essay on privacy that appears directly relevant (and has inspired at least one paper surveyed here) is the view of privacy in terms of access that others have to us and our information, presented by Ruth Gavison. In her view, a general definition of privacy must be one that is measurable, of value, and actionable. The first property needs no explanation; the second means that the entity being considered private must be valuable, and the third property argues that from a legal perspective, only those losses of privacy are interesting that can be prosecuted. This survey, and much of the research on privacy, concerns itself with the measuring of privacy.
The notion of `depth’ has been used in statistics as a way to identify the center of the bivariate distribution given by the point set $P$ in $R^2$. We present a general framework for computing such statistical estimators, that makes extensive use of modern graphics architectures. As a result, we derive improved algorithms for a number of depth measures such location depth, simplicial depth, Oja depth, colored depth, and dynamic location depth. Our algorithms perform significantly better than currently known implementations, outperforming them by at least one order of magnitude and having a strictly better asymptotic growth rate.
One of the primary goals of computational anatomy is the statistical analysis of anatomical variability in large populations of images. The study of anatomical shape is inherently related to the construction of transformations of the underlying coordinate space, which map one anatomy to another. It is now well established that representing the geometry of shapes or images in Euclidean spaces undermines our ability to represent natural variability in populations. In our previous work we have extended classical statistical analysis techniques, such as averaging, principal components analysis, and regression, to Riemannian manifolds, which are more appropriate representations for describing anatomical variability. In this paper we extend the notion of robust estimation, a well established and powerful tool in traditional statistical analysis of Euclidian data, to manifold valued representations of anatomical variability. In particular, we extend the geometric median, a classic robust estimator of centrality for data in Euclidean spaces.We formulate the geometric median of data on a Riemannian manifold as the minimizer of the sum of geodesic distances to the data points.We prove existence and uniqueness of the geometric median on manifolds with non-positive sectional curvature and give sufficient conditions for uniqueness on positively curved manifolds. Generalizing the Weiszfeld procedure for finding the geometric median of Euclidean data, we present an algorithm for computing the geometric median on an arbitrary manifold. We show that this algorithm converges to the unique solution when it exists. In this paper we exemplify the robustness of the estimation technique by applying the procedure to various manifolds commonly used in the analysis of medical images. Using this approach, we also present a robust brain atlas estimation technique based on the geometric median in the space of deformable images.
Contact graphs of isothetic rectangles unify many concepts from applications including VLSI and architectural design, computational geometry, and GIS. Minimizing the area of their corresponding rectangular layouts is a key problem. We study the area-optimization problem and show that it is NP-hard to find a minimum-area rectangular layout of a given contact graph. We present O(n)-time algorithms that construct O(n^2)-area rectangular layouts for general contact graphs and O(n log n)-area rectangular layouts for trees. (For trees, this is an O(log n)-approximation algorithm.) We also present an infinite family of graphs (respectively, trees) that require Ω(n^2) (respectively, Ω(n log n))area. We derive these results by presenting a new characterization of graphs that admit rectangular layouts, using the related concept of rectangular duals. A corollary to our results relates the class of graphs that admit rectangular layouts to rectangle-of-influence drawings.
The problem of curve matching appears in many application domains, like time series analysis, shape matching, speech recognition, and signature verification, among others. Curve matching has been studied extensively by computational geometers, and many measures of similarity have been examined, among them being the Frechet distance (sometimes referred to as the “dog-man” distance).A measure that is very closely related to the Frechet distance but has never been studied in a geometric context is the Dynamic Time Warping measure (DTW), first used in the context of speech recognition. This measure is ubiquitous in different domains, already a surprising fact because notions of similarity usually vary significantly depending on the application. However, this measure suffers from a few obvious drawbacks, resulting from the fact that it is defined between sequences of points, rather than curves and the way in which a curves is sampled to yield such a sequence can dramatically affect the quality of the result. Some attempts have been made to generalize the DTW to continuous domains, but the resulting algorithms have exponential complexity.In this paper we propose similarity measures that attempt to capture the “spirit” of dynamic time warping while being defined over continuous domains, and present efficient algorithms for computing them. Our formulation leads to a very interesting connection with finding short paths in a combinatorial manifold defined on the input chains, and in a deeper sense relate to the way light travels in a medium of variable refractivity.
In the setting of streaming interactive proofs (SIPs), a client (verifier) needs to compute a given function on a massive stream of data, arriving online, but is unable to store even a small fraction of the data. It outsources the processing to a third party service (prover), but is unwilling to blindly trust answers returned by this service. Thus, the service cannot simply supply the desired answer; it must convince the verifier of its correctness via a short interaction after the stream has been seen. In this work we study “barely interactive” SIPs. Specifically, we show that two or three rounds of interaction suffice to solve several query problems — including Index, Median, Nearest Neighbor Search, Pattern Matching, and Range Counting — with polylogarithmic space and communication costs. Such efficiency with O(1) rounds of interaction was thought to be impossible based on previous work. On the other hand, we initiate a formal study of the limitations of constant-round SIPs by introducing a new hierarchy of communication models called Online Interactive Proofs (OIPs). The online nature of these models is analogous to the streaming restriction placed upon the verifier in an SIP. We give upper and lower bounds that (1) characterize, up to quadratic blowups, every finite level of the OIP hierarchy in terms of other well-known communication complexity classes, (2) separate the first four levels of the hierarchy, and (3) reveal that the hierarchy collapses to the fourth level. Our study of OIPs reveals marked contrasts and some parallels with the classic Turing Machine theory of interactive proofs, establishes limits on the power of existing techniques for developing constant-round SIPs, and provides a new characterization of (non-online) Arthur–Merlin communication in terms of an online model.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pretraining: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of shadow groups whose elements serve as close approximations. Over the shadow groups, the pretraining step, originally introduced as a mech- anism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the simplest. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender, religious practice) and an explicit description of the process. When the process is implemented using computers, determining disparate impact (and hence bias) is harder. It might not be possible to disclose the process. In addition, even if the process is open, it might be hard to elucidate in a legal setting how the algorithm makes its decisions. Instead of requiring access to the algorithm, we propose making inferences based on the data the algorithm uses. We make four contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has received relatively little attention. Second, we propose a test for disparate impact based on analyzing the information leakage of the protected class from the other data attributes. Third, we describe methods by which data might be made unbiased. Finally, we present empirical evidence supporting the effectiveness of our test for disparate impact and our approach for both masking bias and preserving relevant information in the data. Interestingly, our approach resembles some actual selection practices that have recently received legal scrutiny.
Network radio frequency (RF) environment sensing (NRES) systems pinpoint and track people in buildings using changes in the signal strength measurements made by a wireless sensor network. It has been shown that such systems can locate people who do not participate in the system by wearing any radio device, even through walls, because of the changes that moving people cause to the static wireless sensor network. However, many such systems cannot locate stationary people. We present and evaluate a system which can locate stationary or moving people, without calibration, by using kernel distance to quantify the difference between two histograms of signal strength measurements. From five experiments, we show that our kernel distance-based radio tomographic localization system performs better than the state-of-the-art NRES systems in different non line-of-sight environments.
In distributed learning, the goal is to perform a learning task over data distributed across multiple nodes with minimal (expensive) communication. Prior work (Daum ́e III et al., 2012) proposes a general model that bounds the communication required for learning classifiers while allowing for ε training error on linearly separable data adversarially distributed across nodes. In this work, we develop key improvements and extensions to this basic model. Our first result is a two-party multiplicative-weight-update based protocol that uses O(d2 log 1/ε) words of communication to classify distributed data in arbitrary dimension d, ε-optimally. This readily extends to classification over k nodes with O(kd2 log 1/ε) words of communication. Our proposed protocol is simple to implement and is considerably more efficient than baselines compared, as demonstrated by our empirical results. In addition, we illustrate general algorithm design paradigms for doing efficient learning over distributed data. We show how to solve fixed-dimensional and high dimensional linear programming efficiently in a distributed setting where constraints may be distributed across nodes. Since many learning problems can be viewed as convex optimization problems where constraints are generated by individual points, this models many typical distributed learning scenarios. Our techniques make use of a novel connection from multipass streaming, as well as adapting the multiplicative-weight-update framework more generally to a distributed setting. As a consequence, our methods extend to the wide range of problems solvable using these techniques.
We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general sampling-based solutions as well as some two-way protocols which have a provable exponential speed-up over any one-way protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node.
In this paper, we harness the synergy between two important learning paradigms, namely, active learning and domain adaptation. We show how active learning in a target domain can leverage information from a different but related source domain. Our proposed framework, Active Learning Domain Adapted (Alda), uses source domain knowledge to transfer information that facilitates active learning in the target domain. We propose two variants of Alda: a batch B-Alda and an online O-Alda. Empirical comparisons with numerous baselines on real-world datasets establish the efficacy of the proposed methods.
The space of positive definite matrices P(n) is a Riemannian manifold with variable nonpositive curvature. It includes Euclidean space and hyperbolicspace as submanifolds, and poses significant challenges for the design of algorithms for data analysis. In this paper, we develop foundational geometric structures and algorithms for analyzing collections of such matrices. A key technical contribution of this work is the use of horoballs, a natural generalization of halfspaces for non-positively curved Riemannian manifolds. We propose generalizations of the notion of a convex hull and a center point and approximations of these structures using horoballs and based on novel decompositions of P(n). This leads to an algorithm for approximate hulls using a generalization of extents.
Starting with a similarity function between objects, it is possible to define a distance metric (the kernel distance) on pairs of objects, and more generally on probability distributions over them. These distance metrics have a deep basis in functional analysis and geometric measure theory, and have a rich structure that includes an isometric embedding into a Hilbert space. They have recently been applied to numerous problems in machine learning and shape analysis. In this paper, we provide the first algorithmic analysis of these distance metrics. Our main contributions are as follows: We present fast approximation algorithms for computing the kernel distance between two point sets P and Q that runs in near-linear time in the size of P ? Q (an explicit calculation would take quadratic time). We present polynomial-time algorithms for approximately minimizing the kernel distance under rigid transformation; they run in time O(n poly(1/e, log n)). We provide several general techniques for reducing complex objects to convenient sparse representations (specifically to point sets or sets of points sets) which approximately preserve the kernel distance. In particular, this allows us to reduce problems of computing the kernel distance between various types of objects such as curves, surfaces, and distributions to computing the kernel distance between point sets.
This paper proposes a new distance metric between clusterings that incorporates information about the spatial distribution of points and clusters. Our approach builds on the idea of a Hilbert space-based representation of clusters as a combination of the representations of their constituent points. We use this representation and the underlying metric to design a spatially-aware consensus clustering procedure. This consensus procedure is implemented via a novel reduction to Euclidean clustering, and is both simple and efficient. All of our results apply to both soft and hard clusterings. We accompany these algorithms with a detailed experimental evaluation that demonstrates the efficiency and quality of our techniques.
In this paper, we propose an online multitask learning framework where the weight vectors are updated in an adaptive fashion based on inter-task relatedness. Our work is in contrast with earlier work on online multitask learning where the authors use a fixed interaction matrix of tasks to derive (fixed) update rules for all the tasks. In this work, we propose to update this interaction matrix itself in an adaptive fashion so that the weight vector updates are no longer fixed but are instead adaptive. Our framework can be extended to an active learning setting where the informativeness of an incoming instance across all the tasks can be evaluated using this adaptive interaction matrix. Empirical results on standardized datasets show improved performance in terms of accuracy, label complexity and number of mistakes made.
This paper evaluates features of graph coloring algorithms implemented on graphics processing units (GPUs), comparing coloring heuristics and thread decompositions. As compared to prior work on graph coloring for other parallel architectures, we find that the large number of cores and relatively high global memory bandwidth of a GPU lead to different strategies for the paral lel implementation. Specifically, we find that a simple uniform block partitioning is very effective on GPUs and our parallel coloring heuristics lead to the same or fewer colors than prior approaches for distributed-memory cluster architecture. Our a lgorithm resolves many coloring conflicts across partitioned blocks on the GPU by iterating through the coloring process, befo re returning to the CPU to resolve remaining conflicts. With this approach we get as few color (if not fewer) than the best se quential graph coloring algorithm and performance is close to the fastest sequential graph coloring algorithms which have poor color quality.
We propose an algorithm for dimensionality reduction on the simplex, mapping a set of high-dimensional distributions to a space of lower-dimensional distributions, whilst approximately preserving pairwise Hellinger distance between distributions. By introducing a restriction on the input data to distributions that are in some sense quite smooth, we can map $n$ points on the $d$-simplex to the simplex of $O(eps^{-2}log n)$ dimensions with $eps$-distortion with high probability. The techniques used rely on a classical result by Johnson and Lindenstrauss on dimensionality reduction for Euclidean point sets and require the same number of random bits as non-sparse methods proposed by Achlioptas for database-friendly dimensionality reduction.
In this paper, we propose a unified algorithmic framework for solving many known variants of MDS. Our algorithm is a simple iterative scheme with guaranteed convergence, and is modular; by changing the internals of a single subroutine in the algorithm, we can switch cost functions and target spaces easily. In addition to the formal guarantees of convergence, our algorithms are accurate; in most cases, they converge to better quality solutions than existing methods in comparable time. Moreover, they have a small memory footprint and scale effectively for large data sets. We expect that this framework will be useful for a number of MDS variants that have not yet been studied. Our framework extends to embedding high-dimensional points lying on a sphere to points on a lower dimensional sphere, preserving geodesic distances. As a complement to this result, we also extend the Johnson-Lindenstrauss Lemma to this spherical setting, by showing that projecting to a random O((1/epsilon) log n)- dimensional sphere causes only an "epsilon distortion in the geodesic distances.
We present a streaming model for large scale classification (in the context of $ell_2$-SVM) by leveraging connections between learning and computational geometry. The streaming model imposes the constraint that only a single pass over the data is allowed. The $ell_2$-SVM is known to have an equivalent formulation in terms of minimum enclosing balls (MEB) and an efficient algorithm based on the idea of core sets exists (CVM) (Tsang et al., 2005) which learns a (1 $epsilon$) approximate MEB for a set of points and yield an approximate solution to corresponding SVM instance. However CVM works in batch mode requiring multiple passes over the data. We present a single-pass SVM based on the minimum enclosing ball of streaming data. We show that the MEB updates for the streaming case can be easily adapted to learn the SVM weight vector using simple Perceptron-like update equations. Our algorithm performs polylogarithmic computation at each example, requires very small and constant storage, and finds simpler solutions (measured in terms of the number of support vectors). Experimental results show that, even in such restrictive settings, we can learn efficiently in just one pass and get accuracies comparable to other state-of-the-art SVM solvers. We also discuss some open issues and possible extensions.
In this paper, we describe a system for approximate shape matching and symmetry (rotation and reflection) detection of geometric shapes represented as point clouds. Rather than using the leastsquares distance as a measure of similarity between shapes, we use the Hausdorff distance between point sets as the underlying shape metric. This allows us to exploit methods from geometric pattern matching to return symmetries and rigid transformation matches with guaranteed error bounds on the quality of our solution. The approximation is determined by intuitive user-specified input precision and distance threshold parameters. Another important feature of our method is that it leverages FFT-based techniques for string matching to compute all approximate symmetries simultaneously. Our algorithm is simple to implement and is efficient; we present a detailed experimental study.
In this work we concentrate on categorization of relational attributes based on their data type. Assuming that attribute type/characteristics are unknown or unidentifiable, we analyze and compare a variety of type-based signatures for classifying the attributes based on the semantic type of the data contained therein (e.g., router identifiers, social security numbers, email addresses). The signatures can subsequently be used for other applications as well, like clustering and indexing based on data types. This application is useful in cases where very large data collections that are generated in a distributed, ungoverned fashion end up having unknown, incomplete, inconsistent or very complex schemata and schema level meta-data. We concentrate on heuristically generating type-based attribute signatures based on both local and global computation approaches. We show experimentally that by decomposing data into q-grams and then considering signatures based on q-gram distributions, we achieve very good classification accuracy under the assumption that a large sample of the data is available for building the signatures. Then, we turn our attention to cases where a very small sample of the data is available, and hence accurately capturing the q-gram distribution of a given data type is almost impossible. We propose techniques based on dimensionality reduction and soft clustering that exploit correlations between attributes to improve classification accuracy.
When merging data from various sources, it is often the case that small variations in data format and interpretation cause traditional functional dependencies (FDs) to be violated, without there being an intrinsic violation of semantics. Examples include differing address formats, or different reported latitude/longitudes for a given address. In such cases, we would like to specify a dependency structure on the merged data that is robust to such small differences.In this paper, we define metric functional dependencies, which strictly generalize traditional FDs by allowing small differences (controlled by a metric) in values of the consequent attribute of an FD. We show that this notion satisfies many of the standard properties of functional dependencies, and we present efficient algorithms for the verification problem: determining whether a given metric FD (MFD) holds for a given relation. We show that MFDs can be combined with approximate FDs, allowing tuples with identical antecedents to map to different consequents, some of which correspond to small (acceptable) variations, with others indicating more serious data quality issues. We experimentally demonstrate the validity and efficiency of our approach on various data sets that possess different underlying metrics, and lie in multidimensional spaces.
In this paper, we explore a streaming algorithm paradigm to handle large amounts of data for NLP problems. We present an efficient low-memory method for constructing high-order approximate n-gram frequency counts. The method is based on a deterministic streaming algorithm which efficiently computes approximate frequency counts over a stream of data while employing a small memory footprint. We show that this method easily scales to billion-word monolingual corpora using a conventional (4 GB RAM) desktop machine. Statistical machine translation experimental results corroborate that the resulting high-n approximate small language model is as effective as models obtained from other count pruning methods.
The geometric median is a classic robust estimator of centrality for data in Euclidean spaces. In this paper we formulate the geometric median of data on a Riemannian manifold as the minimizer of the sum of geodesic distances to the data points. We prove existence and uniqueness of the geometric median on manifolds with non-positive sectional curvature and give sufficient conditions for uniqueness on positively curved manifolds. Generalizing the Weiszfeld procedure for finding the geometric median of Euclidean data, we present an algorithm for computing the geometric median on an arbitrary manifold. We show that this algorithm converges to the unique solution when it exists. This method produces a robust central point for data lying on a manifold, and should have use in a variety of vision applications involving manifolds. We give examples of the geometric median computation and demonstrate its robustness for three types of manifold data: the 3D rotation group, tensor manifolds, and shape spaces.
Validation of multi-column schema matchings is essential for successful database integration. This task is especially difficult when the databases to be integrated contain little overlapping data, as is often the case in practice (e.g., customer bases of different companies). Based on the intuition that values present in different columns related by a schema matching will have similar “semantic type”, and that this can be captured using distributions over values (“statistical types”), we develop a method for validating 1-1 and compositional schema matchings. Our technique is based on three key technical ideas. First, we propose a generic measure for comparing two columns matched by a schema matching, based on a notion of information-theoretic discrepancy that generalizes the standard geometric discrepancy; this provides the basis for 1:1 matching. Second, we present an algorithm for “splitting” the string values in a column to identify substrings that are likely to match with the values in another column; this enables (multi-column) 1:m schema matching. Third, our technique provides an invalidation certificate if it fails to validate a schema matching. We complement our conceptual and algorithmic contributions with an experimental study that demonstrates the effectiveness and efficiency of our technique on a variety of database schemas and data sets.
The $k$-anonymity privacy requirement for publishing microdata requires that each equivalence class (i.e., a set of records that are indistinguishable from each other with respect to certain “identifying” attributes) contains at least $k$ records. Recently, several authors have recognized that $k$-anonymity cannot prevent attribute disclosure. The notion of $ell$-diversity has been proposed to address this; $ell$-diversity requires that each equivalence class has at least $ell$ well-represented values for each sensitive attribute.In this paper we show that $ell$-diversity has a number of limitations. In particular, it is neither necessary nor sufficient to prevent attribute disclosure. We propose a novel privacy notion called $t$-closeness, which requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold $t$). We choose to use the Earth Mover Distance measure for our $t$-closeness requirement. We discuss the rationale for $t$-closeness and illustrate its advantages through examples and experiments.
This paper deals with the problem, arising in practice, of drawing a directed graph as a collection of disjoint, isothetic rectangles, where the rectangles of the nodes of each edge must touch and where the placement of the rectangles respects the ordering of the edges. It provides characterizations for those graphs having the special type of rectangular layout known as a rectangular dual. It then characterizes the st-graphs having rectangular layouts in terms of the existence of certain planar embeddings and the non-existence of a particular subgraph.
Given a set of objects with durations (jobs) that cover a base region, can we schedule the jobs to maximize the duration the original region remains covered? We call this problem the sensor cover problem. This problem arises in the context of covering a region with sensors. For example, suppose you wish to monitor activity along a fence by sensors placed at various fixed locations. Each sensor has a range and limited battery life. The problem is to schedule when to turn on the sensors so that the fence is fully monitored for as long as possible. This one dimensional problem involves intervals on the real line. Associating a duration to each yields a set of rectangles in space and time, each specified by a pair of fixed horizontal endpoints and a height. The objective is to assign a position to each rectangle to maximize the height at which the spanning interval is fully covered. We call this one dimensional problem restricted strip covering. If we replace the covering constraint by a packing constraint, the problem is identical to dynamic storage allocation, a scheduling problem that is a restricted case of the strip packing problem. We show that the restricted strip covering problem is NP-hard and present an O(log log n)-approximation algorithm. We present better approximations or exact algorithms for some special cases. For the uniform-duration case of restricted strip covering we give a polynomial-time, exact algorithm but prove that the uniform-duration case for higher-dimensional regions is NP-hard. Finally, we consider regions that are arbitrary sets, and we present an O(log n)-approximation algorithm.
Data quality is a serious concern in every data management application, and a variety of quality measures have been proposed, e.g., accuracy, freshness and completeness, to capture common sources of data quality degradation. We identify and focus attention on a novel measure, column heterogeneity, that seeks to quantify the data quality problems that can arise when merging data from different sources. We identify desiderata that a column heterogeneity measure should intuitively satisfy, and describe our technique to quantify database column heterogeneity based on using a novel combination of cluster entropy and soft clustering. Finally, we present detailed experimental results, using diverse data sets of different types, to demonstrate that our approach provides a robust mechanism for identifying and quantifying database column heterogeneity.
Spatial scan statistics are used to determine hotspots in spatial data, and are widely used in epidemiology and biosurveillance.In recent years, there has been much effortinvested in designing efficient algorithms for finding such ?high discrepancy? regions, with methods ranging from fast heuristics for special cases, to general grid-based methods,and to efficient approximation algorithms with provable guarantees on performance and quality.In this paper, we make a number of contributions to the computational study of spatial scan statistics. First, we describea simple exact algorithm for finding the largest discrepancy region in a domain. Second, we propose a new approximation algorithm for a large class of discrepancyfunctions (including the Kulldorff scan statistic) that improves the approximation versus runtime trade-off of priormethods. Third, we extend our simple exact and our approximation algorithms to data sets which lie naturally on a grid or are accumulated onto a grid. Fourth, we conduct adetailed experimental comparison of these methods with a number of known methods, demonstrating that our approximationalgorithm has far superior performance in practice to prior methods, and exhibits a good performance-accuracy trade-off.All extant methods (including those in this paper) are suitable for data sets that are modestly sized; if data sets are of the order of millions of data points, none of thesemethods scale well. For such massive data settings, it is natural to examine whether small-space streaming algorithms might yield accurate answers. Here, we provide some negative results, showing that any streaming algorithms that even provide approximately optimal answers to the discrepancymaximization problem must use space linear in the input.
Spatial scan statistics are used to determine hotspots in spatial data, and are widely used in epidemiology and biosurveillance. In recent years, there has been much effort invested in designing efficient algorithms for finding such %u201Chigh discrepancy%u201D regions, with methods ranging from fast heuristics for special cases, to general grid-based methods, and to efficient approximation algorithms with provable guarantees on performance and quality. In this paper, we make a number of contributions to the computational study of spatial scan statistics. First, we describe a simple exact algorithm for finding the largest discrepancy region in a domain. Second, we propose a new approximation algorithm for a large class of discrepancy functions (including the Kulldorff scan statistic) that improves the approximation versus runtime trade-off of prior methods. Third, we extend our simple exact and our approximation algorithms to data sets which lie naturally on a grid or are accumulated onto a grid. Fourth, we conduct a detailed experimental comparison of these methods with a number of known methods, demonstrating that our approximation algorithm has far superior performance in practice to prior methods, and exhibits a good performance-accuracy trade-off. All extant methods (including those in this paper) are suitable for data sets that are modestly sized; if data sets are of the order of millions of data points, none of these methods scale well. For such massive data settings, it is natural to examine whether small-space streaming algorithms might yield accurate answers. Here, we provide some negative results, showing that any streaming algorithms that even provide approximately optimal answers to the discrepancy maximization problem must use space linear in the input.
Anomaly detection has important applications in bio-surveilance and environmental monitoring. When comparing measured data to data drawn from a baseline distribution, merely, finding clusters in the measured data may not actually represent true anomalies. These clusters may likely be the clusters of the baseline distribution. Hence, a discrep-ancy function is often used to examine how different measured data is to baseline data within a region. An anomalous region is thus defined to be one with high discrepancy.In this paper, we present algorithms for maximizing statistical discrepancy functions over the space of axis-parallel rectangles. We give provable approximation guarantees,both additive and relative, and our methods apply to any convex discrepancy function. Our algorithms work by connecting statistical discrepancy to combinatorial discrepancy;roughly speaking, we show that in order to maximize a convex discrepancy function over a class of shapes, one needs only maximize a linear discrepancy function over the sameset of shapes.We derive general discrepancy functions for data generated from a one- parameter exponential family. This generalizes the widely-used Kulldorff scan statistic for datafrom a Poisson distribution. We present an algorithm running in O( 1/epsilon n^2 (logn)^2) that computes the maximum discrepancy rectangle to within additive error ., for the Kulldorff scan statistic. Similar results hold for relative error and for discrepancy functions for data coming from Gaussian,Bernoulli, and gamma distributions. Prior to our work, the best known algorithms were exact and ran in time O(n^4).
Sensor network localization (SNL) is the problem of determining the locations of the sensors given sparse and usually noisy inter-communication distances among them. In this work we propose an iterative algorithm named PLACEMENT to solve the SNL problem. This iterative algorithm requires an initial estimation of the locations and in each iteration, is guaranteed to reduce the cost function. The proposed algorithm is able to take advantage of the good initial estimation of sensor locations making it suitable for localizing moving sensors, and also suitable for the refinement of the results produced by other algorithms. Our algorithm is very scalable. We have experimented with a variety of sensor networks and have shown that the proposed algorithm outperforms existing algorithms both in terms of speed and accuracy in almost all experiments. Our algorithm can embed 120,000 sensors in less than 20 minutes.
We provide a new framework for generating multiple good quality partitions (clusterings) of a single data set. Our approach decomposes this problem into two components, generating many high-quality partitions, and then grouping these partitions to obtain k representatives. The decomposition makes the approach extremely modular and allows us to optimize various criteria that control the choice of representative partitions.
The Johnson-Lindenstrauss Lemma states that a set of $n$ points may be embedded in a space of dimension $O(\log n/\eps^2)$ while preserving all pairwise distances within a factor of $(1+\epsilon)$ with high probability. It has inspired a number of proofs that extend the result, simplify it, and improve the efficiency of computing the resulting embedding. The lemma is a critical tool in the realm of dimensionality reduction and high dimensional approximate computational geometry. It is also employed for data mining in domains that analyze intrinsically high dimensional objects such as images and text. However, while algorithms for performing the dimensionality reduction have become increasingly sophisticated, there is little understanding of the behavior of these embeddings in practice. In this paper, we present the first comprehensive study of the empirical behavior of algorithms for dimensionality reduction based on the JL Lemma. Our study answers a number of important questions about the quality of the embeddings and the performance of algorithms used to compute them. Among our key results: Determining a likely range for the big-Oh constant in practice for the dimension of the target space, and demonstrating the accuracy of the predicted bounds. Finding "best in class" algorithms over wide ranges of data size and source dimensionality, and showing that these depend heavily on parameters of the data as well its sparsity. Developing the best implementation for each method, making use of non-standard optimized codes for key subroutines. Identifying critical computational bottleneck that can spur further theoretical study of efficient algorithms.
In this work, we show how active learning in some (target) domain can leverage information from a different but related source) domain. We present an algorithm that harnesses the source domain data to learn the best possible initializer hypothesis for doing active learning in the target domain, resulting in improved label complexity. We also present a variant of this algorithm which additionally uses the domain divergence information to selectively query the most informative points in the target domain, leading to further reductions in label complexity. Experimental results on a variety of datasets establish the efficacy of the proposed methods.
In this paper, we address the challenges posed by large amounts of text data by exploiting the power of hashing in context of streaming data. We explore sketch techniques, especially Count-Min Sketch, which approximates the frequency of a word-pair in the corpus without explicitly storing the word-pairs themselves. We further use the idea of a conservative update with Count-Min Sketch to reduce the average relative error of its approximate counts by a factor of two. We show that it is possible to store all words and word-pairs counts computed from 37 GB of web data in just 2 billion counters (8 GB main memory). The number of these counters is upto 30 times less than the stream size which is really a big memory and space gain. In Semantic Orientation experiments, the PMI scores computed from 2 billion counters are as effective as exact PMI scores.
Multi-Dimensional Scaling (MDS) is a widely used method for embedding a given distance matrix into a low dimensional space, used both as a preprocessing step for many machine learning problems, as well as a visualization tool in its own right. In this paper, we present an incremental version of MDS (iMDS). In iMDS, d-dimensional data points are presented in a stream, and the task is to embed the current d-dimensional data point into a k-dimensional space for k < d such that distances from the current point to the previous points are preserved. Let {x1,..., xt-1} ? R k be the data points at time step t that have already been embedded into a k-dimensional space, {r1,..., rt-1} be the given distances computed in R d, then objective of iMDS is to find a point xt: xt = arg min p?R k ?t-1 i=1
Bregman divergences Dϕ are a class of divergences parametrized by a convex function ϕ and include well known distance functions like ℓ22 and the Kullback-Leibler divergence. There has been extensive research on algorithms for problems like clustering and near neighbor search with respect to Bregman divergences; in all cases, the algorithms depend not just on the data size n and dimensionality d, but also on a structure constant μ≥1 that depends solely on ϕ and can grow without bound independently. In this paper, we provide the first evidence that this dependence on μ might be intrinsic. We focus on the problem of approximate near neighbor search for Bregman divergences. We show that under the cell probe model, any non-adaptive data structure (like locality-sensitive hashing) for c-approximate near-neighbor search that admits r probes must use space Ω(n1+μcr). In contrast, for LSH under ℓ1 the best bound is Ω(n1+1cr). Our new tool is a directed variant of the standard boolean noise operator. We show that a generalization of the Bonami-Beckner hypercontractivity inequality exists “in expectation” or upon restriction to certain subsets of the Hamming cube, and that this is sufficient to prove the desired isoperimetric inequality that we use in our data structure lower bound. We also present a structural result reducing the Hamming cube to a Bregman cube. This structure allows us to obtain lower bounds for problems under Bregman divergences from their ℓ1 analog. In particular, we get a (weaker) lower bound for approximate near neighbor search of the form Ω(n1+1cr) for an r-query non-adaptive data structure, and new cell probe lower bounds for a number of other near neighbor questions in Bregman space.