Jeff M. Phillips | |

[ Book Chapter Journal Conference Workshop Tech Report]

By Jeff M. Phillips, Wai Ming Tai,

Vol.abs/1802.01751,

By Jeff M. Phillips, Pingfan Tang,

Vol.abs/1804.11284,

By Michael Matheny, Jeff M. Phillips,

Vol.abs/1804.11287,

By Michael Matheny, Jeff M. Phillips,

Vol.abs/1804.11307,

By Ran Wei, Daoqin Tong, Jeff M. Phillips,

Vol.63, Pages 95-103,

By Yan Zheng, Yi Ou, Alexander Lex, Jeff M. Phillips,

Vol.abs/1709.04453,

By Jeff M. Phillips, Wai Ming Tai,

Vol.abs/1710.04325,

By Tim Sodergren, Jessica Hair, Jeff M. Phillips, Bei Wang,

Vol.abs/1710.06925,

By Pingfan Tang, Jeff M. Phillips,

Vol.abs/1609.01226,

By Di Chen, Jeff M. Phillips,

Vol.abs/1602.05350,

By Jeff M. Phillips, Pingfan Tang,

Vol.abs/1601.00630,

By Jian Pei, Leman Akoglu, Hongrae Lee, Justin J. Levandoski, Xuelong Li, Rosa Meo, Carlos Ordonez, Jeff M. Phillips, Barbara Poblete, K. Selçuk Candan, Meng Wang, Ji-Rong Wen, Li Xiong , Wenjie Zhang,

Vol.28, Pages 2535-2537,

By Mina Ghashami, Amey Desai, and Jeff M. Phillips

Vol.28, Pages 1678--1690,

Matrices have become essential data representations for many large-scale problems in data analytics, and hence matrix sketching is a critical task. Although much research has focused on improving the error/size tradeoff under various sketching paradigms, the many forms of error bounds make these ap- proaches hard to compare in theory and in practice. This paper attempts to categorize and compare most known methods under row-wise streaming updates with provable guarantees, and then to tweak some of these methods to gain practical improvements while retaining guarantees. For instance, we observe that a simple heuristic iSVD, with no guarantees, tends to outperform all known approaches in terms of size/error trade-off. We modify the best performing method with guarantees FREQUENTDIRECTIONS under the size/error trade-off to match the performance of iSVD and retain its guarantees. We also demonstrate some adversarial datasets where iSVD performs quite poorly. In comparing techniques in the time/error trade-off, techniques based on hashing or sampling tend to perform better. In this setting we modify the most studied sampling regime to retain error guarantee but obtain dramatic improvements in the time/error trade-off. Finally, we provide easy replication of our studies on APT, a new testbed which makes available not only code and datasets, but also a computing platform with fixed environmental settings.

By Mina Ghashami, Edo Liberty, Jeff M. Phillips and David P. Woodruff

Vol.0,

We describe a new algorithm called Frequent Directions for deterministic matrix sketching in the row-updates model. The algorithm is presented an arbitrary input matrix A %u2208 Rn%uFFFDd one row at a time. It performed O(dl) operations per row and maintains a sketch matrix B in R^{l%uFFFDd} such that for any k < l ||A^TA %u2212 B^TB||_2 <= ||A%u2212A_k||^2_F/(l%u2212k) and ||A%u2212pi_{B_k}(A)||^2_F <= (1 k/(l-k)) ||A%u2212A_k||^2_F. Here, A_k stands for the minimizer of ||A %u2212 A_k||_F over all rank k matrices (similarly B_k ) and pi_{B_k}(A) is the rank k matrix resulting from projecting A on the row span of B_k. We show both of these bounds are the best possible for the space allowed. The summary is mergeable, and hence trivially parallelizable. Moreover, Frequent Directions outperforms exemplar implementations of existing streaming algorithms in the space-error tradeoff.

By Jian Li, Jeff M. Phillips, Haitao Wang,

Vol.abs/1411.0194,

By Peyman Afshani, Pankaj K. Agarwal, Lars Arge, Kasper Green Larsen and Jeff M. Phillips

Vol.0, To Appear

Given a set of points with uncertain locations, we consider the problem of computing the probability of each point lying on the skyline, that is, the probability that it is not dominated by any other input point. If each point%u2019s uncertainty is described as a probability distribution over a discrete set of locations, we improve the best known exact solution. We also suggest why we believe our solution might be optimal. Next, we describe simple, near-linear time approximation algorithms for computing the probability of each point lying on the skyline. In addition, some of our methods can be adapted to construct data structures that can efficiently determine the probability of a query point lying on the skyline.

By Jeff M. Phillips, Bei Wang, Yan Zheng,

Vol.abs/1307.7760,

By Mina Ghashami, Jeff M. Phillips,

Vol.abs/1307.7454,

By Mary W. Hall, Robert M. Kirby, Feifei Li , Miriah D. Meyer, Valerio Pascucci, Jeff M. Phillips, Robert Ricci, Jacobus E. van der Merwe, Suresh Venkatasubramanian,

Vol.abs/1306.3295,

By Jeff M. Phillips,

Vol.abs/1209.6396,

By Allan Jørgensen, Maarten Löffler, Jeff M. Phillips,

Vol.abs/1205.0273,

By Jeff M. Phillips,

Vol.abs/1112.4105,

By Jeff M. Phillips, Suresh Venkatasubramanian,

Vol.abs/1103.1625,

By Sarang C. Joshi, Raj Varma Kommaraju, Jeff M. Phillips, Suresh Venkatasubramanian,

Vol.abs/1001.0591,

By Arvind Agarwal, Jeff M. Phillips, Suresh Venkatasubramanian,

Vol.abs/1003.0529,

By P. Thomas Fletcher, John Moeller, Jeff M. Phillips, Suresh Venkatasubramanian,

Vol.abs/0912.1580,

By Jeff M. Phillips,

Vol.abs/0801.2793,

By Pankaj K. Agarwal, Jeff M. Phillips,

Vol.abs/0806.4326,

By Jeff M. Phillips, Wai Ming Tai,

In Proceedings of SODA (

By Di Chen and Jeff M. Phillips

In Proceedings of Algorithmic Learning Theory (

This studies a way to linearize kernel methods for machine learning — essentially sketching the non-linear information. It studies a non-linear mapping from a low-dimensional space (where there may be no linear separator or linear regression fit) to a high-dimensional space where there would be. This is most commonly done using a Gaussian kernel, and that is what we study. We show that not only is the inner product information preserved under an additive error (this has been known for a decade) but that also the distance is preserved under a relative error. We show that this is true with the same bounds as a Johnson-Liderstrauss random project, with the small exception of some lower-order data dependent factors, and that these are required. It immediately implies a few heuristics, like k-center clustering after this mapping, now have formal error guarantees.

By Yan Zheng and Jeff M. Phillips

In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (

This project developed new data summaries for kernel regression — these have not been formally studied before. These approaches are related to kernel density estimates, for where there are several effective summaries, but only model density of the data, not an associated weight. We produce both sample complexity results as well as deterministic approaches which adapt to the structure of the data. We demonstrate these approaches on a variety of 1-dimensional (e.g., time series), 2-dimensional (e.g., spatial) and high-dimensional data sets. It can summarize sets of enormous size down to a few megabytes with minimal loss of accuracy.

By Michael Matheny, Raghvendra Singh, Kaiqiang Wang, Liang Zhang and Jeff M. Phillips

In Proceedings of ACM International Conference on Advances in Geographic Information Systems (

Finding anomalous regions within spatial data sets is a central task for biosurveillance, homeland security, policy making, and many other important areas. These communities have mainly settled on spatial scan statistics as a rigorous way to discover regions where a measured quantity (e.g., crime) is statistically significant in its difference from a baseline population. However, most common approaches are ineffi- cient and thus, can only be run with very modest data sizes (a few thousand data points) or make assumptions on the geographic distributions of the data. We address these challenges by designing, exploring, and analyzing sample-then-scan algorithms. These algorithms randomly sample data at two scales, one to define regions and the other to approximate the counts in these regions. Our experiments demonstrate that these algorithms are efficient and accurate independent of the size of the original data set, and our analysis explains why this is the case. For the first time, these sample-then-scan algorithms allow spatial scan statistics to run on a million or more data points without making assumptions on the spatial distribution of the data. Moreover, our experiments and analysis give insight into when it is appropriate to trust the various types of spatial anomalies when the data is modeled as a random sample from a larger but unknown data set.

By Mina Ghashami, Edo Liberty, and Jeff M. Phillips

In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (

This paper describes Sparse Frequent Directions, a variant of Frequent Directions for sketch- ing sparse matrices. It resembles the original algorithm in many ways: both receive the rows of an input matrix A^{nd} one by one in the streaming setting and compute a small sketch B in R^{l x d}. Both share the same strong (provably optimal) asymptotic guarantees with respect to the space- accuracy tradeoff in the streaming setting. However, unlike Frequent Directions which runs in O(ndl) time regardless of the sparsity of the input matrix A, Sparse Frequent Directions runs in O~(nnz(A)*l nl^2) time. Our analysis loosens the dependence on computing the Singular Value Decomposition (SVD) as a black box within the Frequent Directions algorithm. Our bounds require recent results on the properties of fast approximate SVD computations. Finally, we empirically demonstrate that these asymptotic improvements are practical and significant on real and synthetic data.

By Pingfan Tang, Jeff M. Phillips,

In Proceedings of NIPS (

By Lingxiao Huang, Jian Li, Jeff M. Phillips, Haitao Wang,

In Proceedings of ESA (

By Mina Ghashami, Daniel Perry, and Jeff M. Phillips

In Proceedings of International Conference on Artificial Intelligence and Statistics (

Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture non-linear structures within large data sets, and is a central tool in data analysis and learning. To allow for non-linear relations, typically a full nn kernel matrix is constructed over n data points, but this requires too much space and time for large values of n. Techniques such as the Nystrom method and random feature maps can help towards this goal, but they do not explicitly maintain the basis vectors in a stream and take more space than desired. We propose a new approach for streaming KPCA which maintains a small set of basis elements in a stream, requiring space only logarithmic in n, and also improves the dependence on the error parameter. Our technique combines together random feature maps with recent advances in matrix sketching, it has guaranteed spectral norm error bounds with respect to the original kernel matrix, and it compares favorably in practice to state-of-the-art approaches.

By Jeff M. Phillips, Bei Wang, Yan Zheng,

In Proceedings of Symposium on Computational Geometry (

By Jeff M. Phillips, Yan Zheng,

In Proceedings of ALT (

By Yan Zheng, Jeff M. Phillips,

In Proceedings of KDD (

By Mina Ghashami, Jeff M. Phillips,

In Proceedings of SODA (

By Pankaj K. Agarwal, Boris Aronov, Sariel Har-Peled, Jeff M. Phillips, Ke Yi, and Wuzhou Zhang

(To Appear) In Proceedings of 32nd ACM Symposium on Principles of Database Systems (

Nearest-neighbor search (NN), which returns the nearest neighbor of a query point in a set of points, is an important and widely studied problem in many fields, and it has wide range of applications. In many of them, such as sensor databases, location-based services, face recognition, and mobile data, the location of data is imprecise. We therefore study nearest neighbor queries in a probabilistic framework in which the location of each input point is specified as a probability density function. We present efficient algorithms for (i) computing all points that are nearest neighbors of a query point with nonzero probability; (ii) estimating, within a specified additive error, the probability of a point being the nearest neighbor of a query point; (iii) using it to return the point that maximizes the probability being the nearest neighbor, or all the points with probabilities greater than some threshold to be the NN. We also present a few experimental results to demonstrate the effectiveness of our approach.

By Amirali Abdullah, Samira Daruki, and Jeff M. Phillips

(To Appear) In Proceedings of 29th Annual ACM Symposium on Computational Geometry (

We study coresets for various types of range counting queries on uncertain data. In our model each uncertain point has a probability density describing its location, sometimes defined as k distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by just examining this subset. We study three distinct types of queries. RE queries return the expected number of points in a query range. RC queries return the number of points in the range with probability at least a threshold. RQ queries returns the probability that fewer than some threshold fraction of the points are in the range. In both RC and RQ coresets the threshold is provided as part of the query. And for each type of query we provide coreset constructions with approximation-size tradeoffs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancy-based approaches on axis-aligned range queries.

By Yan Zheng, Jeffrey Jestes, Jeff M. Phillips, Feifei Li

In Proceedings of ACM SIGMOD International Conference on Management of Data (

Kernel density estimates are important for a broad variety of applications including media databases, pattern recognition, computer vision, data mining, and the sciences. Their con- struction has been well-studied, but existing techniques are expensive on massive datasets and/or only provide heuristic approximations without theoretical guarantees. We propose randomized and deterministic algorithms with quality guarantees which are orders of magnitude more ef- ficient than previous algorithms. Our algorithms do not re- quire knowledge of the kernel or its bandwidth parameter and are easily parallelizable. We demonstrate how to imple- ment our ideas in a centralized setting and in MapReduce, although our algorithms are applicable to any large-scale data processing framework. Extensive experiments on large real datasets demonstrate the quality, efficiency, and scala- bility of our techniques.

By Yang Zhao, Neal Patwari, Jeff M. Phillips, and Suresh Venkatasubramanian

(To Appear) In Proceedings of 12th ACM-IEEE Conference on Information Processing in Sensor Networks (

Network radio frequency (RF) environment sensing (NRES) systems pinpoint and track people in buildings using changes in the signal strength measurements made by a wireless sensor network. It has been shown that such systems can locate people who do not participate in the system by wearing any radio device, even through walls, because of the changes that moving people cause to the static wireless sensor network. However, many such systems cannot locate stationary people. We present and evaluate a system which can locate stationary or moving people, without calibration, by using kernel distance to quantify the difference between two histograms of signal strength measurements. From five experiments, we show that our kernel distance-based radio tomographic localization system performs better than the state-of-the-art NRES systems in different non line-of-sight environments.

By Jeff M. Phillips

In Proceedings of 24th Annual ACM-SIAM Symposium on Discrete Algorithms (

We study the worst case error of kernel density estimates via subset approximation. A kernel density estimate of a distribution is the convolution of that distribution with a fixed kernel (e.g. Gaussian kernel). Given a subset (i.e. a point set) of the input distribution, we can compare the kernel density estimates of the input distribution with that of the subset and bound the worst case error. If the maximum error is eps, then this subset can be thought of as an eps-sample (aka an eps-approximation) of the range space defined with the input distribution as the ground set and the fixed kernel representing the family of ranges. Interestingly, in this case the ranges are not binary, but have a continuous range (for simplicity we focus on kernels with range of [0,1]); these allow for smoother notions of range spaces. It turns out, the use of this smoother family of range spaces has an added benefit of greatly decreasing the size required for eps-samples. For instance, in the plane the size is O((1/eps^{4/3}) log^{2/3}(1/eps)) for disks (based on VC-dimension arguments) but is only O((1/eps) sqrt{log (1/eps)}) for Gaussian kernels and for kernels with bounded slope that only affect a bounded domain. These bounds are accomplished by studying the discrepancy of these "kernel" range spaces, and here the improvement in bounds are even more pronounced. In the plane, we show the discrepancy is O(sqrt{log n}) for these kernels, whereas for balls there is a lower bound of Omega(n^{1/4}).

By Jeff M. Phillips,

In Proceedings of SODA (

By Hal Daume III, Jeff M. Phillips, Avishek Saha, and Suresh Venkatasubramanian

In Proceedings of 23rd International Conference on Algorithmic Learning Theory (

In distributed learning, the goal is to perform a learning task over data distributed across multiple nodes with minimal (expensive) communication. Prior work (Daum ́e III et al., 2012) proposes a general model that bounds the communication required for learning classifiers while allowing for ε training error on linearly separable data adversarially distributed across nodes. In this work, we develop key improvements and extensions to this basic model. Our first result is a two-party multiplicative-weight-update based protocol that uses O(d2 log 1/ε) words of communication to classify distributed data in arbitrary dimension d, ε-optimally. This readily extends to classification over k nodes with O(kd2 log 1/ε) words of communication. Our proposed protocol is simple to implement and is considerably more efficient than baselines compared, as demonstrated by our empirical results. In addition, we illustrate general algorithm design paradigms for doing efficient learning over distributed data. We show how to solve fixed-dimensional and high dimensional linear programming efficiently in a distributed setting where constraints may be distributed across nodes. Since many learning problems can be viewed as convex optimization problems where constraints are generated by individual points, this models many typical distributed learning scenarios. Our techniques make use of a novel connection from multipass streaming, as well as adapting the multiplicative-weight-update framework more generally to a distributed setting. As a consequence, our methods extend to the wide range of problems solvable using these techniques.

By Jestes Jestes, Jeff M. Phillips, Feifei Li, Mingwang Tang

In Proceedings of 38th International Conference on Very Large Databases (

Ranking temporal data has not been studied until recently [14], even though ranking is an important operator (being promoted as a first-class citizen) in database systems [8]. However, only the instant top-k queries on temporal data were studied in [14], where objects with the k highest scores at a query time instance t are to be retrieved. The instant top-k definition clearly comes with limitations (sensitive to outliers, difficult to choose a meaningful query time t). A more flexible and general ranking operation is to rank objects based on the aggregation of their scores in a query interval, which we dub the aggregate top-k query on temporal data. For example, return the top-10 weather stations having the highest average temperature from 10/01/2010 to 10/07/2010; find the top-20 stocks having the largest total transaction volumes from02/05/2011 to 02/07/2011. This work presents a comprehensive study to this problem by designing both exact and approximate methods (with approximation quality guarantees). We also provide theoretical analysis on the construction cost, the index size, the update and the query costs of each approach. Extensive experiments on large real datasets clearly demonstrate the efficiency, the effectiveness, and the scalability of our methods compared to the baseline methods.

By Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi

(To Appear) In Proceedings of ACM Symposium on Principals of Database Systems (

We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a summary on the two data sets combined together, while preserving the error and size guarantees. This property means that the summary can be treated like other algebraic objects such as sum and max, which is especially useful for computing summaries on massive distributed data. Many data summaries are trivially mergeable by construction, most notably those based on linear transformations. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modi%uFB01cations. Speci%uFB01cally, we show that for eps-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/eps); for eps-approximate quantiles, there is a deterministic summary of size O(1/epslog(eps n)) that has a restricted form of mergeability, and a randomized one of size O(1/epslog^3/2 1/eps) with full mergeability. We also extend our results to geometric summaries such as eps-approximations and eps-kernels.

By Mingwang Tang, Feifei Li, Jeff M. Phillips, Jeffrey Jestes

In Proceedings of 28th IEEE International Conference on Data Engineering (

In distributed data management, a primary concern is monitoring the distributed data and generating an alarm when a user specified constraint is violated. A particular useful instance is the threshold based constraint, which is commonly known as the distributed threshold monitoring problem. This work extends this useful and fundamental study to distributed probabilistic data that emerge in a lot of applications, where uncertainty naturally exists when massive amounts of data are produced at multiple sources in distributed, networked locations. Examples include distributed observing stations, large sensor fields, geographically separate scientific institutes/units and many more. When dealing with probabilistic data, there are two thresholds involved, the score and the probability thresholds. One must monitor both simultaneously, as such, techniques developed for deterministic data are no longer directly applicable. This work presents a comprehensive study to this problem. Our algorithms have significantly outperformed the baseline method in terms of both the communication cost (number of messages and bytes) and the running time, as shown by an extensive experimental evaluation using several, real large datasets.

By Hal Daume III, Jeff M. Phillips, Avishek Saha, Suresh Venkatasubramanian

In Proceedings of 15th Interntational Conference on Artificial Intelligence and Statistics (

We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general sampling-based solutions as well as some two-way protocols which have a provable exponential speed-up over any one-way protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node.

By Jeff M. Phillips, Elad Verbin, and Qin Zhang

In Proceedings of 23th Annual ACM-SIAM Symposium on Discrete Algorithms (

We consider a model for multiparty communication complexity which we call private message mul- tiparty communication complexity, essentially multiparty communication complexity with card-in-hand model but no blackboard. That is, any communication between two players is not witnessed by any other players. Techniques developed for this model provides simpler lower bounds in other models as well. Furthermore, this model has applications for proving lower bounds for the distributed sensing framework.

By Fangxiang Jiao, Jeff M. Phillips, Yaniv Gur, and Chris R. Johnson

In Proceedings of 5th IEEE Pacific Visualization Symposium (

In this paper, we propose a new and accurate technique for uncertainty analysis and uncertainty visualization based on fiber orientation distribution function (ODF) glyphs, associated with high angular resolution diffusion imaging (HARDI). Our visualization applies volume rendering techniques to an ensemble of 3D ODF glyphs, which we call SIP functions of diffusion shapes, to capture their variability due to underlying uncertainty. This rendering elucidates the complex heteroscedastic structural variation in these shapes. Furthermore, we quantify the extent of this variation by measuring the fraction of the volume of these shapes, which is consistent across all noise levels, the certain volume ratio. Our uncertainty analysis and visualization framework is then applied to synthetic data, as well as to HARDI human-brain data, to study the impact of various image acquisition parameters and background noise levels on the diffusion shapes.

By Allan G. Jorgensen, Maarten Loffler, Jeff M. Phillips

In Proceedings of 12th Algorithms and Data Structure Symposium (

We study computing with indecisive point sets. Such points have spatial uncertainty where the true location is one of a finite numberof possible locations. This data arises from probing distributions a few times or when the location is one of a few locations from a knowndatabase. In particular, we study computing distributions of geometric functions such as the radius of the smallest enclosing ball and the diameter. Surprisingly, we can compute the distribution of the radius of the smallest enclosing ball exactly in polynomial time, but computing the same distribution for the diameter is #P-hard. We generalize ourpolynomial-time algorithm to all LP-type problems. We also utilize our indecisive framework to deterministically and approximately compute on a more general class of uncertain data where the location of each point is given by a probability distribution.

By P. Thomas Fletcher, John Moeller, Jeff M. Phillips, Suresh Venkatasubramanian

In Proceedings of 12th Algorithms and Data Structure Symposium (

The space of positive definite matrices P(n) is a Riemannian manifold with variable nonpositive curvature. It includes Euclidean space and hyperbolicspace as submanifolds, and poses significant challenges for the design of algorithms for data analysis. In this paper, we develop foundational geometric structures and algorithms for analyzing collections of such matrices. A key technical contribution of this work is the use of horoballs, a natural generalization of halfspaces for non-positively curved Riemannian manifolds. We propose generalizations of the notion of a convex hull and a center point and approximations of these structures using horoballs and based on novel decompositions of P(n). This leads to an algorithm for approximate hulls using a generalization of extents.

By Sarang Joshi, Raj Varma Kommaraju, Jeff M. Phillips, Suresh Venkatasubramanian

In Proceedings of 27th Annual Symposium on Computational Geometry (

Starting with a similarity function between objects, it is possible to define a distance metric (the kernel distance) on pairs of objects, and more generally on probability distributions over them. These distance metrics have a deep basis in functional analysis and geometric measure theory, and have a rich structure that includes an isometric embedding into a Hilbert space. They have recently been applied to numerous problems in machine learning and shape analysis. In this paper, we provide the first algorithmic analysis of these distance metrics. Our main contributions are as follows: We present fast approximation algorithms for computing the kernel distance between two point sets P and Q that runs in near-linear time in the size of P ? Q (an explicit calculation would take quadratic time). We present polynomial-time algorithms for approximately minimizing the kernel distance under rigid transformation; they run in time O(n poly(1/e, log n)). We provide several general techniques for reducing complex objects to convenient sparse representations (specifically to point sets or sets of points sets) which approximately preserve the kernel distance. In particular, this allows us to reduce problems of computing the kernel distance between various types of objects such as curves, surfaces, and distributions to computing the kernel distance between point sets.

By Jeff M. Phillips, Parasaran Raman, and Suresh Venkatasubramanian

In Proceedings of 10th SIAM Intenational Conference on Data Mining (

This paper proposes a new distance metric between clusterings that incorporates information about the spatial distribution of points and clusters. Our approach builds on the idea of a Hilbert space-based representation of clusters as a combination of the representations of their constituent points. We use this representation and the underlying metric to design a spatially-aware consensus clustering procedure. This consensus procedure is implemented via a novel reduction to Euclidean clustering, and is both simple and efficient. All of our results apply to both soft and hard clusterings. We accompany these algorithms with a detailed experimental evaluation that demonstrates the efficiency and quality of our techniques.

By Peyman Afshani, Pankaj K. Agarwal, Lars Arge, Kasper Green Larsen, and Jeff M. Phillips

In Proceedings of 14th International Conference on Database Theory (

Given a set of points with uncertain locations, we consider the problem of computing the probability of each point lying on the skyline, that is, the probability that it is not dominated by any other input point. If each point?s uncertainty is described as a probability distribution over a discrete set of locations, we improve the best known exact solution. We also suggest why we believe our solution might be optimal. Next, we describe simple, near-linear time approximation algorithms for computing the probability of each point lying on the skyline. In addition, some of our methods can be adapted to construct data structures that can eciently determine the probability of a query point lying on the skyline.

By Rasmus J. Kyng, Jeff M. Phillips, Suresh Venkatasubramanian

In Proceedings of the 20th Fall Workshop on Computational Geometry (

We propose an algorithm for dimensionality reduction on the simplex, mapping a set of high-dimensional distributions to a space of lower-dimensional distributions, whilst approximately preserving pairwise Hellinger distance between distributions. By introducing a restriction on the input data to distributions that are in some sense quite smooth, we can map $n$ points on the $d$-simplex to the simplex of $O(eps^{-2}log n)$ dimensions with $eps$-distortion with high probability. The techniques used rely on a classical result by Johnson and Lindenstrauss on dimensionality reduction for Euclidean point sets and require the same number of random bits as non-sparse methods proposed by Achlioptas for database-friendly dimensionality reduction.

By Pankaj K. Agarwal, Jeff M. Phillips, Hai Yu

In Proceedings of 18th Annual European Symposium on Algorithms (

Given a set P of n points in |R^d, an eps-kernel K subset P approximates the directional width of P in every direction within a relative (1-eps) factor. In this paper we study the stability of eps-kernels under dynamic insertion and deletion of points to P and by changing the approximation factor eps. In the first case, we say an algorithm for dynamically maintaining a eps-kernel is stable if at most O(1) points change in K as one point is inserted or deleted from P. We describe an algorithm to maintain an eps-kernel of size O(1/eps^{(d-1)/2}) in O(1/eps^{(d-1)/2} log n) time per update. Not only does our algorithm maintain a stable eps-kernel, its update time is faster than any known algorithm that maintains an eps-kernel of size O(1/eps^{(d-1)/2}). Next, we show that if there is an eps-kernel of P of size k, which may be dramatically less than O(1/eps^{(d-1)/2}), then there is an (eps/2)-kernel of P of size O(min {1/eps^{(d-1)/2}, k^{floor(d/2)} log^{d-2} (1/eps)}). Moreover, there exists a point set P in |R^d and a parameter eps > 0 such that if every eps-kernel of P has size at least k, then any (eps/2)-kernel of P has size Omega(k^{floor(d/2)}).

By Arvind Agarwal, Jeff M. Phillips, Suresh Venkatasubramanian

In Proceedings of 16th Annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (

In this paper, we propose a unified algorithmic framework for solving many known variants of MDS. Our algorithm is a simple iterative scheme with guaranteed convergence, and is modular; by changing the internals of a single subroutine in the algorithm, we can switch cost functions and target spaces easily. In addition to the formal guarantees of convergence, our algorithms are accurate; in most cases, they converge to better quality solutions than existing methods in comparable time. Moreover, they have a small memory footprint and scale effectively for large data sets. We expect that this framework will be useful for a number of MDS variants that have not yet been studied. Our framework extends to embedding high-dimensional points lying on a sphere to points on a lower dimensional sphere, preserving geodesic distances. As a complement to this result, we also extend the Johnson-Lindenstrauss Lemma to this spherical setting, by showing that projecting to a random O((1/epsilon) log n)- dimensional sphere causes only an "epsilon distortion in the geodesic distances.

By Pankaj K. Agarwal, Jeff M. Phillips, Bardia Sadri

In Proceedings of 9th Latin American Theoretical Informatics Symposium (

We describe algorithms for finding the regression of t, a sequence of values, to the closest sequence s by mean squared error, so that s is always increasing (isotonicity) and so the values of two consecutive points do not increase by too much (Lipschitz). The isotonicity constraint can be replaced with a unimodular constraint, where there is exactly one local maximum in s. These algorithm are generalized from sequences of values to trees of values. For each scenario we describe near-linear time algorithms.

By Pankaj K. Agarwal, Jeff M. Phillips, Hai Yu,

In Proceedings of ESA (1) (

By Maarten Loffler, Jeff M. Phillips

In Proceedings of 17th Annual European Symposium on Algorithms (

We consider problems on data sets where each data point has uncertainty described by an individual probability distribution. We develop several frameworks and algorithms for calculating statistics on these uncertain data sets. Our examples focus on geometric shape tting problems. We prove approximation guarantees for the algorithms with respect to the full probability distributions. We then empirically demonstrate that our algorithms are simple and practical, solving for a constant hidden by asymptotic analysis so that a user can reliably trade speed and size for accuracy.

By Pankaj K. Agarwal, Jeff M. Phillips

In Proceedings of 16th Annual European Symposium on Algorithms (

For a set P of n points in R^2, the Euclidean 2-center problem computes a pair of congruent disks of the minimal radius that cover P.We extend this to the (2, k)-center problem where we compute the minimal radius pair of congruent disks to cover n-k points of P. We present a randomized algorithm with O(nk^7 log^3 n) expected running time for the (2, k)-center problem. We also study the (p,k)-center problem in R^2 under the `l_{infty}-metric. We give solutions for p = 4 in O(k^{O(1)} n log n) timeand for p = 5 in O(k^{O(1)} n log^5 n) time.

By Jeff M. Phillips

In Proceedings of 35th International Colloquium on Automata, Languages, and Programming (

Consider a point set D with a measure function mu : D rightarrow R. Let A be the set of subsets of D induced by containment in a shape from some geometric family (e.g. axis-aligned rectangles, half planes, balls, k-oriented polygons). We say a range space (D, A) has an epsilon-approximation P if | mu(R cap P )/mu(P ) - mu(R cap D)/mu(D) | = epsilon.max R in AWe describe algorithms for deterministically constructing discrete epsilon-approximations for continuous point sets such as distributions or terrains. Furthermore, for certain families of subsets A, such as those described by axis-aligned rectangles, we reduce the size of the epsilon-approximations by almost a square root from O( 1/epsilon^2 log (1/epsilon )) to O( 1/epsilon polylog (1/epsilon) ). This is often the first step in transforming a continuous problem into a discrete one for which combinatorial techniques can be applied. We describe applications of this result in geo-spatial analysis, biosurveillance, and sensor networks.

By Bei Wang, Jeff M. Phillips, Robert Schrieber, Dennis Wilkinson, Nina Mishra, Robert Tarjan

In Proceedings of 8th SIAM Intenational Conference on Data Mining (

In this paper, we present a measure associated with detection and inference of statistically anomalous clusters of a graph based on the likelihood test of observed and expected edgesin a subgraph. This measure is adapted from spatial scan statistics for point sets and provides quantitative assessmentfor clusters. We discuss some important properties of this statistic and its relation to modularity and Bregman divergences. We apply a simple clustering algorithm to findclusters with large values of this measure in a variety of real-world data sets, and we illustrate its ability to identifystatistically significant clusters of selected granularity.

By Pankaj K. Agarwal, Jeff M. Phillips,

In Proceedings of ESA (

By Badrish Chandramouli, Jeff M. Phillips, Jun Yang

In Proceedings of 33rd Intenational Conference on Very Large Data Bases (

We address the problem of providing scalable support for subscriptions with personalized value-based notication conditions in wideareapublish/subscribe systems. Notication conditions can be netuned by subscribers, allowing precise and exible control of when events are delivered to the subscribers. For example, a user may specify that she should be notied if and only if the price of aparticular stock moves outside a ?radius? around her last notied value. Naive techniques for handling notication conditions are notscalable. It is challenging to share subscription processing and notication dissemination of subscriptions with personalized valuebased notication conditions, because two subscriptions may seetwo completely different sequences of notications even if they specify the same radius. We develop and experimentally evaluate scalable processing and dissemination techniques for these subscriptions.Our approach uses standard network substrates for notication dissemination, and avoids pushing complex application processing into the network. Compared with other alternatives, our approach generates orders of magnitude lower network trafc, and incurs lower server processing cost.

By Jeff M. Phillips, Ran Liu, Carlo Tomasi

In Proceedings of 6th International Conference on 3-D Digital Imaging and Modeling (

We describe a variation of the iterative closest point (ICP) algorithm for aligning two point sets under a set of transformations. Our algorithm is superior to previous algorithms because (1) in determining the optimal alignment, it identifies and discards likely outliers in a statistically robust manner, and (2) it is guaranteed to converge to a locally optimal solution. To this end, we formalize a new distance measure, fractional root mean squared distance (FRMSD), which incorporates the fraction of inliers into the distance function. Our framework can easily incorporate most techniques and heuristics from modern registration algorithms. We experimentally validate our algorithm against previous techniques on 2 and 3 dimensional data exposed to a variety of outlier types.

By Deepak Agarwal, Andrew McGregor, Jeff M. Phillips, Suresh Venkatasubramanian, Zhengyuan Zhu

In Proceedings of 12th Annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (

Spatial scan statistics are used to determine hotspots in spatial data, and are widely used in epidemiology and biosurveillance.In recent years, there has been much effortinvested in designing efficient algorithms for finding such ?high discrepancy? regions, with methods ranging from fast heuristics for special cases, to general grid-based methods,and to efficient approximation algorithms with provable guarantees on performance and quality.In this paper, we make a number of contributions to the computational study of spatial scan statistics. First, we describea simple exact algorithm for finding the largest discrepancy region in a domain. Second, we propose a new approximation algorithm for a large class of discrepancyfunctions (including the Kulldorff scan statistic) that improves the approximation versus runtime trade-off of priormethods. Third, we extend our simple exact and our approximation algorithms to data sets which lie naturally on a grid or are accumulated onto a grid. Fourth, we conduct adetailed experimental comparison of these methods with a number of known methods, demonstrating that our approximationalgorithm has far superior performance in practice to prior methods, and exhibits a good performance-accuracy trade-off.All extant methods (including those in this paper) are suitable for data sets that are modestly sized; if data sets are of the order of millions of data points, none of thesemethods scale well. For such massive data settings, it is natural to examine whether small-space streaming algorithms might yield accurate answers. Here, we provide some negative results, showing that any streaming algorithms that even provide approximately optimal answers to the discrepancymaximization problem must use space linear in the input.

By Pankaj K. Agarwal, Jeff M. Phillips

In Proceedings of 18th Canadian Conference on Computational Geometry (

Given two sets A and B of n points each in R 2, we study the problem of computing a matching between A and B that minimizes the root mean square (rms) distance of matched pairs. We can compute an optimal matching in O(n^( 2 delta)) time, for any delta > 0, and an epsilon-approximation in time O((n/epsilon)^(3/2) (logn)^6). If the set B is allowed to move rigidly to minimize the rms distance, we can compute a rigid motion of B and a matching in O((n^4 /epsilon^(5/2))(logn)^6 ) time whose cost is within (1 epsilon) factor of the optimal one

By Deepak Agarwal, Andrew McGregor, Jeff M. Phillips, Suresh Venkatasubramanian, Zhengyuan Zhu

In Proceedings of The Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (

Spatial scan statistics are used to determine hotspots in spatial data, and are widely used in epidemiology and biosurveillance. In recent years, there has been much effort invested in designing efficient algorithms for finding such %u201Chigh discrepancy%u201D regions, with methods ranging from fast heuristics for special cases, to general grid-based methods, and to efficient approximation algorithms with provable guarantees on performance and quality. In this paper, we make a number of contributions to the computational study of spatial scan statistics. First, we describe a simple exact algorithm for finding the largest discrepancy region in a domain. Second, we propose a new approximation algorithm for a large class of discrepancy functions (including the Kulldorff scan statistic) that improves the approximation versus runtime trade-off of prior methods. Third, we extend our simple exact and our approximation algorithms to data sets which lie naturally on a grid or are accumulated onto a grid. Fourth, we conduct a detailed experimental comparison of these methods with a number of known methods, demonstrating that our approximation algorithm has far superior performance in practice to prior methods, and exhibits a good performance-accuracy trade-off. All extant methods (including those in this paper) are suitable for data sets that are modestly sized; if data sets are of the order of millions of data points, none of these methods scale well. For such massive data settings, it is natural to examine whether small-space streaming algorithms might yield accurate answers. Here, we provide some negative results, showing that any streaming algorithms that even provide approximately optimal answers to the discrepancy maximization problem must use space linear in the input.

By Deepak Agarwal, Jeff M. Phillips, Suresh Venkatasubramanian

In Proceedings of 17th Annual ACM-SIAM Symposium on Discrete Algorithms (

Anomaly detection has important applications in bio-surveilance and environmental monitoring. When comparing measured data to data drawn from a baseline distribution, merely, finding clusters in the measured data may not actually represent true anomalies. These clusters may likely be the clusters of the baseline distribution. Hence, a discrep-ancy function is often used to examine how different measured data is to baseline data within a region. An anomalous region is thus defined to be one with high discrepancy.In this paper, we present algorithms for maximizing statistical discrepancy functions over the space of axis-parallel rectangles. We give provable approximation guarantees,both additive and relative, and our methods apply to any convex discrepancy function. Our algorithms work by connecting statistical discrepancy to combinatorial discrepancy;roughly speaking, we show that in order to maximize a convex discrepancy function over a class of shapes, one needs only maximize a linear discrepancy function over the sameset of shapes.We derive general discrepancy functions for data generated from a one- parameter exponential family. This generalizes the widely-used Kulldorff scan statistic for datafrom a Poisson distribution. We present an algorithm running in O( 1/epsilon n^2 (logn)^2) that computes the maximum discrepancy rectangle to within additive error ., for the Kulldorff scan statistic. Similar results hold for relative error and for discrepancy functions for data coming from Gaussian,Bernoulli, and gamma distributions. Prior to our work, the best known algorithms were exact and ran in time O(n^4).

By Jeff M. Phillips, Pankaj K. Agarwal,

In Proceedings of CCCG (

By Jeff M. Phillips, Nazareth Bedrossian, and Lydia E. Kavraki

In Proceedings of IEEE International Conference on Robotics and Automation (

Motion planning for systems with constraints on controls or the need for relatively straight paths for real-time actions presents challenges for modern planners. This paper presents an approach which addresses these types of systems by building on existing motion planning approaches. Guided Expansive Spaces Trees are introduced to search for a low cost and relatively straight path in a space with motion constraints. Path Gradient Descent, which builds on the idea of Elastic Strips, finds the locally optimal path for an existing path. These techniques are tested on simulations of rendezvous and docking of the space shuttle to the International Space Station and of a 4-foot fan-controlled blimp in a factory setting

By Jeff M. Phillips, Lydia E. Kavraki, and Nazareth Bedrossian

In Proceedings of AIAA Guidance, Navigation, and Control (

This paper presents a probabilistic approach to solve optimal control problems with appli-cation to spacecraft proximity operations. The 6 degree-of-freedom rendezvous and dockingproblem, using impulsive control, and avoidance of known obstacles and plume impingement issolved. Our solution is then extended to real-time obstacle avoidance. The space is searchedby expanding from the start location by applying only feasible controls and coasts, reducingby nearly 50% the variables perturbed in the search. A randomized expansion techniqueexplores the search space. A gradient descent approach smoothes the path and avoids newobstacles in real-time by 'stretching' the best pre-computed path in a locally optimal manner.

By Jeff M. Phillips, Lydia E. Kavraki, and Nazareth Bedrossian

In Proceedings of AAS/AIAA Space Flight Mechanics Meeting, pages ??-??, Ponce, Puerto Rico, February, 2003.

This paper presents a probabilistic approach to solve optimal control problems with appli-cation to spacecraft proximity operations. The 6 degree-of-freedom rendezvous and dockingproblem, using impulsive control, and avoidance of known obstacles and plume impingement issolved. Our solution is then extended to real-time obstacle avoidance. The space is searchedby expanding from the start location by applying only feasible controls and coasts, reducingby nearly 50% the variables perturbed in the search. A randomized A* expansion techniqueexplores the search space. A gradient descent approach smoothes the path and avoids newobstacles in real-time by 'stretching' the best pre-computed path in a locally optimal manner.

By Jeff M. Phillips, Andrew M. Ladd, Lydia E. Kavraki

In Proceedings of IEEE International Conference on Robotics and Automation (

Applications such as suturing in medical simulations require the modeling of knot tying in physically realistic rope. The paper describes the design and implementation of such a system. Our model uses a spline of linear springs, adaptive subdivision and a dynamics simulation. Collisions are discrete event simulated and follow the impulse model. Although some care must be taken to maintain stable knots, we demonstrate our simple model is sufficient for this task. In particular, we do not use friction or explicit constraints to maintain the knot. As examples, we tie an overhand knot and a reef knot.

by Arvind Agarwal, Hal Daume III, Jeff M. Phillips, and Suresh Venkatasubramanian

In Proceedings of

Sensor network localization (SNL) is the problem of determining the locations of the sensors given sparse and usually noisy inter-communication distances among them. In this work we propose an iterative algorithm named PLACEMENT to solve the SNL problem. This iterative algorithm requires an initial estimation of the locations and in each iteration, is guaranteed to reduce the cost function. The proposed algorithm is able to take advantage of the good initial estimation of sensor locations making it suitable for localizing moving sensors, and also suitable for the refinement of the results produced by other algorithms. Our algorithm is very scalable. We have experimented with a variety of sensor networks and have shown that the proposed algorithm outperforms existing algorithms both in terms of speed and accuracy in almost all experiments. Our algorithm can embed 120,000 sensors in less than 20 minutes.

by Jeff M. Phillips, Parasaran Raman, Suresh Venkatasubramanian

In Proceedings of

We provide a new framework for generating multiple good quality partitions (clusterings) of a single data set. Our approach decomposes this problem into two components, generating many high-quality partitions, and then grouping these partitions to obtain k representatives. The decomposition makes the approach extremely modular and allows us to optimize various criteria that control the choice of representative partitions.

by Fangxiang Jiao, Jeff M. Phillips, Jeroen Stinstra, Jens Krueger, Raj Varma Kummaraju, Edward Hsu, Julie Korenberg, Chris R. Johnson

In Proceedings of

In this paper, we propose three metrics to quantify the differences between the results of diffusion tensor magnetic resonance imaging (DT-MRI) fiber tracking algorithms: the area between corresponding fibers of each bundle, the Earth Mover?s Distance (EMD) between two fiber bundle volumes, and the current distance between two fiber bundle volumes. We also discuss an interactive fiber track comparison visualization toolkit we have developed based on the three proposed fiber difference metrics and have tested on six widely-used fiber tracking algorithms. To show the effectiveness and robustness of our metrics andvisualization toolkit, we present results on both synthetic data and high resolution monkey brain DT-MRI data. Our toolkit can be used for testing the noise effects on fiber tracking analysis and visualization and to quantify the difference between any pair of DT-MRI techniques, compare single subjects within an image atlas.

by Arvind Agarwal, Jeff M. Phillips, Hal Daume III, Suresh Venkatasubramanian

In Proceedings of

Multi-Dimensional Scaling (MDS) is a widely used method for embedding a given distance matrix into a low dimensional space, used both as a preprocessing step for many machine learning problems, as well as a visualization tool in its own right. In this paper, we present an incremental version of MDS (iMDS). In iMDS, d-dimensional data points are presented in a stream, and the task is to embed the current d-dimensional data point into a k-dimensional space for k < d such that distances from the current point to the previous points are preserved. Let {x1,..., xt-1} ? R k be the data points at time step t that have already been embedded into a k-dimensional space, {r1,..., rt-1} be the given distances computed in R d, then objective of iMDS is to find a point xt: xt = arg min p?R k ?t-1 i=1

by Jeff M. Phillips, Johannes Rudolph, Pankaj K. Agarwal

In Proceedings of

Protein-protein interactions form the basis for many intercellular events. In this paper we develop a tool for understanding the structure of these interactions. Specifically, we define a method for identifying a set of structural motifs on protein-protein interface surfaces. These motifs are secondary structures, akin to alpha-helices and beta-sheets in protein structure; they describe how multiple residues form knob-into-holefeatures across the interface. These motifs are generated entirely from geometric properties and are easily annotated with additional biologicaldata. We point to the use of these motifs in analyzing hotspot residues.