Suresh VenkatasubramanianProfessor

### Book Chapter

#### 2008

• Clustering on streams
by Suresh Venkatasubramanian
Springer, ISBN 978-0-387-35544-3,  March,  2008.
###### Abstract
• Privacy-Preserving Data Mining: Models and Algorithms
by Suresh Venkatasubramanian
Springer, ISBN 978-0-387-70991-8,  March,  2008.
###### Abstract

In this chapter, we survey the various approaches that have been proposed to measure privacy (and the loss of privacy). Since most privacy concerns (especially those related to health-care information) are raised in the context of legal concerns, it is instructive to view privacy from a legal perspective, rather than from purely technical considerations. It is beyond the scope of this survey\footnote{…and the expertise of the author!} to review the legal interpretations of privacy. However, one essay on privacy that appears directly relevant (and has inspired at least one paper surveyed here) is the view of privacy in terms of access that others have to us and our information, presented by Ruth Gavison. In her view, a general definition of privacy must be one that is measurable, of value, and actionable. The first property needs no explanation; the second means that the entity being considered private must be valuable, and the third property argues that from a legal perspective, only those losses of privacy are interesting that can be prosecuted. This survey, and much of the research on privacy, concerns itself with the measuring of privacy.

• #### 2006

• Statistical Data Depth and the Graphics Hardware
by Nabil Mustafa,    Shankar Krishnan,    Suresh Venkatasubramanian
Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications, ISBN-10: 0-8218-3596-3,  April,  2006.
###### Abstract

The notion of `depth’ has been used in statistics as a way to identify the center of the bivariate distribution given by the point set $P$ in $R^2$. We present a general framework for computing such statistical estimators, that makes extensive use of modern graphics architectures. As a result, we derive improved algorithms for a number of depth measures such location depth, simplicial depth, Oja depth, colored depth, and dynamic location depth. Our algorithms perform significantly better than currently known implementations, outperforming them by at least one order of magnitude and having a strictly better asymptotic growth rate.

• ### Journal

#### 2018

• A comparative study of fairness-enhancing interventions in machine learning. (Project Website)
By Sorelle A. Friedler,   Carlos Scheidegger,   Suresh Venkatasubramanian,   Sonam Choudhary,   Evan P. Hamilton,   Derek Roth,
Vol.abs/1802.04422, CoRR (CORR 2018),  2018.
###### Abstract
• Sublinear Algorithms for MAXCUT and Correlation Clustering. (Project Website)
Vol.abs/1802.06992, CoRR (CORR 2018),  2018.
• #### 2017

• Runaway Feedback Loops in Predictive Policing. (Project Website)
By Danielle Ensign,   Sorelle A. Friedler,   Scott Neville,   Carlos Eduardo Scheidegger,   Suresh Venkatasubramanian,
Vol.abs/1706.09847, CoRR (CORR 2017),  2017.
###### Abstract
• Fair Pipelines. (Project Website)
By Amanda Bower,   Sarah N. Kitchen,   Laura Niss,   Martin J. Strauss,   Alexander Vargas,   Suresh Venkatasubramanian,
Vol.abs/1707.00391, CoRR (CORR 2017),  2017.
• #### 2016

• On the (im)possibility of fairness. (Project Website)
By Sorelle A. Friedler,   Carlos Scheidegger,   Suresh Venkatasubramanian,
Vol.abs/1609.07236, CoRR (CORR 2016),  2016.
###### Abstract
• A Unified View of Localized Kernel Learning. (Project Website)
By John Moeller,   Sarathkrishna Swaminathan,   Suresh Venkatasubramanian,
Vol.abs/1603.01374, CoRR (CORR 2016),  2016.
###### Abstract
• Streaming Verification of Graph Properties. (Project Website)
By Amirali Abdullah,   Samira Daruki,   Chitradeep Dutta Roy,   Suresh Venkatasubramanian,
Vol.abs/1602.08162, CoRR (CORR 2016),  2016.
###### Abstract
• Auditing Black-box Models by Obscuring Features. (Project Website)
By Philip Adler,   Casey Falk,   Sorelle A. Friedler,   Gabriel Rybeck,   Carlos Scheidegger,   Brandon Smith,   Suresh Venkatasubramanian,
Vol.abs/1602.07043, CoRR (CORR 2016),  2016.
• #### 2015

• A Group Theoretic Perspective on Unsupervised Deep Learning. (Project Website)
By Arnab Paul,   Suresh Venkatasubramanian,
Vol.abs/1504.02462, CoRR (CORR 2015),  2015.
###### Abstract
• Streaming Verification in Data Analysis. (Project Website)
By Samira Daruki,   Justin Thaler,   Suresh Venkatasubramanian,
Vol.abs/1509.05514, CoRR (CORR 2015),  2015.
###### Abstract
• Sketching, Embedding, and Dimensionality Reduction for Information Spaces. (Project Website)
By Amirali Abdullah,   Ravi Kumar ,   Andrew McGregor ,   Sergei Vassilvitskii,   Suresh Venkatasubramanian,
Vol.abs/1503.05225, CoRR (CORR 2015),  2015.
• #### 2014

• Why does Deep Learning work? - A perspective from Group Theory. (Project Website)
By Arnab Paul,   Suresh Venkatasubramanian,
Vol.abs/1412.6621, CoRR (CORR 2014),  2014.
###### Abstract
• Multiple Target Tracking with RF Sensor Networks. (Project Website)
By Maurizio Bocca,   Ossi Kaltiokallio,   Neal Patwari,   Suresh Venkatasubramanian,
Vol.13, Pages 1787-1800, IEEE Trans. Mob. Comput. (TMC 2014),  2014.
• #### 2013

• Computational geometry column 55: new developments in nonnegative matrix factorization. (Project Website)
By Suresh Venkatasubramanian,
Vol.44, Pages 70-78, SIGACT News (SIGACT 2013),  2013.
###### Abstract
• Moving heaven and earth: distances between distributions. (Project Website)
By Suresh Venkatasubramanian,
Vol.44, Pages 56-68, SIGACT News (SIGACT 2013),  2013.
###### Abstract
• On minimizing budget and time in influence propagation over social networks. (Project Website)
By Amit Goyal ,   Francesco Bonchi,   Laks V. S. Lakshmanan,   Suresh Venkatasubramanian,
Vol.3, Pages 179-192, Social Netw. Analys. Mining (SNAM 2013),  2013.
###### Abstract
• Multiple Target Tracking with RF Sensor Networks (Project Website)
By Maurizio Bocca,   Ossi Kaltiokallio,   Neal Patwari,   Suresh Venkatasubramanian,
Vol.abs/1302.4720, CoRR (CORR 2013),  2013.
###### Abstract
• Power to the Points: Validating Data Memberships in Clusterings (Project Website)
By Parasaran Raman,   Suresh Venkatasubramanian,
Vol.abs/1305.4757, CoRR (CORR 2013),  2013.
###### Abstract
• On Interactivity in Arthur-Merlin Communication and Stream Computation. (Project Website)
By Amit Chakrabarti,   Graham Cormode,   Andrew McGregor ,   Justin Thaler,   Suresh Venkatasubramanian,
Vol.20, Pages 180, Electronic Colloquium on Computational Complexity (ECCC) (ECCC 2013),  2013.
###### Abstract
• The many stages of writing a paper, and how to close the deal. (Project Website)
By Suresh Venkatasubramanian,
###### Abstract
• Rethinking Abstractions for Big Data: Why, Where, How, and What. (Project Website)
By Mary W. Hall,   Robert M. Kirby,   Feifei Li ,   Miriah D. Meyer,   Valerio Pascucci,   Jeff M. Phillips,   Robert Ricci,   Jacobus E. van der Merwe,   Suresh Venkatasubramanian,
Vol.abs/1306.3295, CoRR (CORR 2013),  2013.
• #### 2012

• Fast Multiple Kernel Learning With Multiplicative Weight Updates (Project Website)
By John Moeller,   Parasaran Raman,   Avishek Saha,   Suresh Venkatasubramanian,
Vol.abs/1206.5580, CoRR (CORR 2012),  2012.
• #### 2011

• A Gentle Introduction to the Kernel Distance (Project Website)
By Jeff M. Phillips,   Suresh Venkatasubramanian,
Vol.abs/1103.1625, CoRR (CORR 2011),  2011.
• #### 2010

• Questions answered. in theory.: http://cstheory.stackexchange.com/. (Project Website)
By Dave Clarke,   David Eppstein,   Kaveh Ghasemloo,   Lev Reyzin,   András Z. Salamon,   Peter W. Shor,   Aaron Sterling,   Suresh Venkatasubramanian,
Vol.41, Pages 58-60, SIGACT News (SIGACT 2010),  2010.
###### Abstract
• Closeness: A New Privacy Measure for Data Publishing. (Project Website)
By Ninghui Li,   Tiancheng Li,   Suresh Venkatasubramanian,
Vol.22, Pages 943-956, IEEE Trans. Knowl. Data Eng. (TKDE 2010),  2010.
###### Abstract
• Matching Shapes Using the Current Distance (Project Website)
By Sarang C. Joshi,   Raj Varma Kommaraju,   Jeff M. Phillips,   Suresh Venkatasubramanian,
Vol.abs/1001.0591, CoRR (CORR 2010),  2010.
###### Abstract
• A Unified Algorithmic Framework for Multi-Dimensional Scaling (Project Website)
By Arvind Agarwal,   Jeff M. Phillips,   Suresh Venkatasubramanian,
Vol.abs/1003.0529, CoRR (CORR 2010),  2010.
• #### 2009

• Information Theory For Data Management. (Project Website)
By Divesh Srivastava,   Suresh Venkatasubramanian,
Vol.2, Pages 1662-1663, PVLDB (PVLDB 2009),  2009.
###### Abstract
• Sublinear estimation of entropy and information distances. (Project Website)
By Sudipto Guha,   Andrew McGregor ,   Suresh Venkatasubramanian,
Vol.5, Pages 35:1-35:16, ACM Trans. Algorithms (TALG 2009),  2009.
###### Abstract
• Computing Hulls And Centerpoints In Positive Definite Space (Project Website)
By P. Thomas Fletcher,   John Moeller,   Jeff M. Phillips,   Suresh Venkatasubramanian,
Vol.abs/0912.1580, CoRR (CORR 2009),  2009.
• #### 2008

• The Geometric Median on Riemannian Manifolds with Application to Robust Atlas Estimation
By Thomas Fletcher,    Suresh Venkatasubramanian,    Sarang Joshi
Vol.45, Pages S143-S152, Neuroimage,  November,  2008.
###### Abstract

One of the primary goals of computational anatomy is the statistical analysis of anatomical variability in large populations of images. The study of anatomical shape is inherently related to the construction of transformations of the underlying coordinate space, which map one anatomy to another. It is now well established that representing the geometry of shapes or images in Euclidean spaces undermines our ability to represent natural variability in populations. In our previous work we have extended classical statistical analysis techniques, such as averaging, principal components analysis, and regression, to Riemannian manifolds, which are more appropriate representations for describing anatomical variability. In this paper we extend the notion of robust estimation, a well established and powerful tool in traditional statistical analysis of Euclidian data, to manifold valued representations of anatomical variability. In particular, we extend the geometric median, a classic robust estimator of centrality for data in Euclidean spaces.We formulate the geometric median of data on a Riemannian manifold as the minimizer of the sum of geodesic distances to the data points.We prove existence and uniqueness of the geometric median on manifolds with non-positive sectional curvature and give sufficient conditions for uniqueness on positively curved manifolds. Generalizing the Weiszfeld procedure for finding the geometric median of Euclidean data, we present an algorithm for computing the geometric median on an arbitrary manifold. We show that this algorithm converges to the unique solution when it exists. In this paper we exemplify the robustness of the estimation technique by applying the procedure to various manifolds commonly used in the analysis of medical images. Using this approach, we also present a robust brain atlas estimation technique based on the geometric median in the space of deformable images.

• Rectangular layouts and contact graphs
By Adam L. Buchsbaum,    Emden R. Gansner,    Cecilia M. Procopiuc,    Suresh Venkatasubramanian
Vol.4, Pages 8:1-8:28, ACM Transactions on Algorithms (TALG),  March,  2008.
###### Abstract

Contact graphs of isothetic rectangles unify many concepts from applications including VLSI and architectural design, computational geometry, and GIS. Minimizing the area of their corresponding rectangular layouts is a key problem. We study the area-optimization problem and show that it is NP-hard to find a minimum-area rectangular layout of a given contact graph. We present O(n)-time algorithms that construct O(n^2)-area rectangular layouts for general contact graphs and O(n log n)-area rectangular layouts for trees. (For trees, this is an O(log n)-approximation algorithm.) We also present an infinite family of graphs (respectively, trees) that require Ω(n^2) (respectively, Ω(n log n))area. We derive these results by presenting a new characterization of graphs that admit rectangular layouts, using the related concept of rectangular duals. A corollary to our results relates the class of graphs that admit rectangular layouts to rectangle-of-influence drawings.

• #### 2007

• Curve Matching, Time Warping, and Light Fields
By Alon Efrat,    Quanfu Fan,    Suresh Venkatasubramanian
Vol.27, Pages 203-216, Journal of Mathematical Imaging and Vision (JMIV),  April,  2007.
###### Abstract

The problem of curve matching appears in many application domains, like time series analysis, shape matching, speech recognition, and signature verification, among others. Curve matching has been studied extensively by computational geometers, and many measures of similarity have been examined, among them being the Frechet distance (sometimes referred to as the “dog-man” distance).A measure that is very closely related to the Frechet distance but has never been studied in a geometric context is the Dynamic Time Warping measure (DTW), first used in the context of speech recognition. This measure is ubiquitous in different domains, already a surprising fact because notions of similarity usually vary significantly depending on the application. However, this measure suffers from a few obvious drawbacks, resulting from the fact that it is defined between sequences of points, rather than curves and the way in which a curves is sampled to yield such a sequence can dramatically affect the quality of the result. Some attempts have been made to generalize the DTW to continuous domains, but the resulting algorithms have exponential complexity.In this paper we propose similarity measures that attempt to capture the “spirit” of dynamic time warping while being defined over continuous domains, and present efficient algorithms for computing them. Our formulation leads to a very interesting connection with finding short paths in a combinatorial manifold defined on the input chains, and in a deeper sense relate to the way light travels in a medium of variable refractivity.

• Optimisation-on-a-manifold for global registration of multiple 3D point sets. (Project Website)
By Shankar Krishnan,   Pei Yean Lee,   John B. Moore,   Suresh Venkatasubramanian,
Vol.3, Pages 319-340, IJISTA (IJISTA 2007),  2007.
###### Abstract
• Curve Matching, Time Warping, and Light Fields: New Algorithms for Computing Similarity between Curves. (Project Website)
By Alon Efrat,   Quanfu Fan,   Suresh Venkatasubramanian,
Vol.27, Pages 203-216, Journal of Mathematical Imaging and Vision (JMIV 2007),  2007.
• #### 2006

• Dynamic simplification and visualization of large maps. (Project Website)
By Nabil H. Mustafa,   Shankar Krishnan,   Gokul Varadhan,   Suresh Venkatasubramanian,
Vol.20, Pages 273-302, International Journal of Geographical Information Science (GIS 2006),  2006.
• #### 2005

• Streaming and Sublinear Approximation of Entropy and Information Distances (Project Website)
By Sudipto Guha,   Andrew McGregor ,   Suresh Venkatasubramanian,
Vol.abs/cs/0508122, CoRR (CORR 2005),  2005.
###### Abstract
• vLOD: High-Fidelity Walkthrough of Large Virtual Environments. (Project Website)
By Jatin Chhugani,   Budirijanto Purnomo,   Shankar Krishnan,   Jonathan D. Cohen,   Suresh Venkatasubramanian,   David S. Johnson,   Subodh Kumar,
Vol.11, Pages 35-47, IEEE Trans. Vis. Comput. Graph. (TVCG 2005),  2005.
• #### 2004

• Pattern Matching for Sets of Segments. (Project Website)
By Alon Efrat,   Piotr Indyk,   Suresh Venkatasubramanian,
Vol.40, Pages 147-160, Algorithmica (ALGORITHMICA 2004),  2004.
• #### 2003

• Approximate congruence in nearly linear time. (Project Website)
By Piotr Indyk,   Suresh Venkatasubramanian,
Vol.24, Pages 115-128, Comput. Geom. (COMGEO 2003),  2003.
###### Abstract
• The Graphics Card as a Streaming Computer (Project Website)
By Suresh Venkatasubramanian,
Vol.cs.GR/0310002, CoRR (CORR 2003),  2003.
###### Abstract
• Combinatorial and Experimental Methods for Approximate Point Pattern Matching. (Project Website)
By Martin Gavrilov,   Piotr Indyk,   Rajeev Motwani,   Suresh Venkatasubramanian,
Vol.38, Pages 59-90, Algorithmica (ALGORITHMICA 2003),  2003.
• #### 2002

• Discrete mathematical problems with medical applications DIMACS volume 55. (Project Website)
By Suresh Venkatasubramanian,
Vol.33, Pages 9-11, SIGACT News (SIGACT 2002),  2002.
• #### 2000

• Pattern Matching for sets of segments (Project Website)
By Alon Efrat,   Piotr Indyk,   Suresh Venkatasubramanian,
Vol.cs.CG/0009013, CoRR (CORR 2000),  2000.
• #### 1999

• A theory repository on the Web: a proposal. (Project Website)
By Suresh Venkatasubramanian,
Vol.30, Pages 91-95, SIGACT News (SIGACT 1999),  1999.
• #### 1998

• RAPID: Randomized pharmacophore identification for drug design. (Project Website)
By Paul W. Finn,   Lydia E. Kavraki,   Jean-Claude Latombe,   Rajeev Motwani,   Christian R. Shelton,   Suresh Venkatasubramanian,   Andrew Chi-Chih Yao,
Vol.10, Pages 263-272, Comput. Geom. (COMGEO 1998),  1998.
###### Abstract
By Krishna Bharat,   Andrei Z. Broder,   Monika Rauch Henzinger,   Puneet Kumar,   Suresh Venkatasubramanian,
Vol.30, Pages 469-477, Computer Networks (CN 1998),  1998.
• #### 1996

• Efficient Indexing for Broadcast Based Wireless Systems.
By Narayanan Shivakumar,   Suresh Venkatasubramanian,
Vol.1, Pages 433-446, MONET (MONET 1996),  1996.
• #### 1989

• Algorithms for Weighted Graph Problems on The Modified Cellular Graph Automaton.
By Suresh Venkatasubramanian,   Kamala Krithivasan,   C. Pandu Rangan,
Vol.23, Pages 251-279, ITA (ITA 1989),  1989.
• ### Conference

#### 2018

• Decision making with limited feedback. (Project Website)
By Danielle Ensign,   Frielder A. Sorelle,   Scott Neville,   Carlos Eduardo Scheidegger,   Suresh Venkatasubramanian,
In Proceedings of ALT (ALT 2018),  pages 359-367,  2018.
###### Abstract
• Runaway Feedback Loops in Predictive Policing. (Project Website)
By Danielle Ensign,   Sorelle A. Friedler,   Scott Neville,   Carlos Scheidegger,   Suresh Venkatasubramanian,
In Proceedings of FAT (FAT 2018),  pages 160-171,  2018.
###### Abstract
• Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments, ALENEX 2018, New Orleans, LA, USA, January 7-8, 2018. (Project Website)
By Rasmus Pagh,   Suresh Venkatasubramanian,
In Proceedings of ALENEX (ALENEX 2018),  pages ??-??,  2018.
• #### 2017

• The Complexity of Explaining Neural Networks Through (group) Invariants. (Project Website)
By Danielle Ensign,   Scott Neville,   Arnab Paul,   Suresh Venkatasubramanian,
In Proceedings of ALT (ALT 2017),  pages 341-359,  2017.
###### Abstract
• Computational Philosophy: On Fairness in Automated Decision Making. (Project Website)
By Suresh Venkatasubramanian,
In Proceedings of ISAAC (ISAAC 2017),  pages 2:1-2:1,  2017.
• #### 2016

• Sketching, Embedding and Dimensionality Reduction in Information Theoretic Spaces. (Project Website)
By Amirali Abdullah,   Ravi Kumar ,   Andrew McGregor ,   Sergei Vassilvitskii,   Suresh Venkatasubramanian,
In Proceedings of AISTATS (AISTATS 2016),  pages 948-956,  2016.
###### Abstract
• Auditing Black-Box Models for Indirect Influence. (Project Website)
By Philip Adler,   Casey Falk,   Sorelle A. Friedler,   Gabriel Rybeck,   Carlos Scheidegger,   Brandon Smith,   Suresh Venkatasubramanian,
In Proceedings of ICDM (ICDM 2016),  pages 1-10,  2016.
###### Abstract
• Streaming Verification of Graph Properties. (Project Website)
By Amirali Abdullah,   Samira Daruki,   Chitradeep Dutta Roy,   Suresh Venkatasubramanian,
In Proceedings of ISAAC (ISAAC 2016),  pages 3:1-3:14,  2016.
###### Abstract
• Continuous Kernel Learning. (Project Website)
By John Moeller,   Vivek Srikumar,   Sarathkrishna Swaminathan,   Suresh Venkatasubramanian,   Dustin Webb,
In Proceedings of ECML/PKDD (2) (PKDD 2016),  pages 657-673,  2016.
###### Abstract
• A Unified View of Localized Kernel Learning. (Project Website)
By John Moeller,   Sarathkrishna Swaminathan,   Suresh Venkatasubramanian,
In Proceedings of SDM (SDM 2016),  pages 252-260,  2016.
• #### 2015

• Verifiable stream computation and Arthur-Merlin communication (Project Website)
By Amit Chakrabarti,    Graham Cormode,    Andrew McGregor,    Justin Thaler and Suresh Venkatasubramanian
In Proceedings of Computational Complexity Conference (ECCC),  pages ??-??,  July,  2015.
###### Abstract

In the setting of streaming interactive proofs (SIPs), a client (verifier) needs to compute a given function on a massive stream of data, arriving online, but is unable to store even a small fraction of the data. It outsources the processing to a third party service (prover), but is unwilling to blindly trust answers returned by this service. Thus, the service cannot simply supply the desired answer; it must convince the verifier of its correctness via a short interaction after the stream has been seen. In this work we study “barely interactive” SIPs. Specifically, we show that two or three rounds of interaction suffice to solve several query problems — including Index, Median, Nearest Neighbor Search, Pattern Matching, and Range Counting — with polylogarithmic space and communication costs. Such efficiency with O(1) rounds of interaction was thought to be impossible based on previous work. On the other hand, we initiate a formal study of the limitations of constant-round SIPs by introducing a new hierarchy of communication models called Online Interactive Proofs (OIPs). The online nature of these models is analogous to the streaming restriction placed upon the verifier in an SIP. We give upper and lower bounds that (1) characterize, up to quadratic blowups, every finite level of the OIP hierarchy in terms of other well-known communication complexity classes, (2) separate the first four levels of the hierarchy, and (3) reveal that the hierarchy collapses to the fourth level. Our study of OIPs reveals marked contrasts and some parallels with the classic Turing Machine theory of interactive proofs, establishes limits on the power of existing techniques for developing constant-round SIPs, and provides a new characterization of (non-online) Arthur–Merlin communication in terms of an online model.

• Streaming Verification in Data Analysis. (Project Website)
By Samira Daruki,   Justin Thaler,   Suresh Venkatasubramanian,
In Proceedings of ISAAC (ISAAC 2015),  pages 715-726,  2015.
###### Abstract
• Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, BC, Canada, April 30 - May 2, 2015 (Project Website)
By Suresh Venkatasubramanian,   Jieping Ye,
In Proceedings of SDM (SDM 2015),  pages ??-??,  2015.
• #### 2014

• Why does unsupervised deep learning work ? A perspective from group theory (Project Website)
By Arnab Paul and Suresh Venkatasubramanian
In Proceedings of International Conference on Learning Representations (ICLR),  pages ??-??,  San Diego CA,  December,  2014.
###### Abstract

Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pretraining: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of shadow groups whose elements serve as close approximations. Over the shadow groups, the pretraining step, originally introduced as a mech- anism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the simplest. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.

• Certifying and removing disparate impact (Project Website)
By Sorelle Friedler,    Carlos Scheidegger and Suresh Venkatasubramanian
In Proceedings of 2015 ACM SIGKDD Conference on Knowledge Discovery and Data Mining,  pages ??-??,  December,  2014.
###### Abstract

What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender, religious practice) and an explicit description of the process. When the process is implemented using computers, determining disparate impact (and hence bias) is harder. It might not be possible to disclose the process. In addition, even if the process is open, it might be hard to elucidate in a legal setting how the algorithm makes its decisions. Instead of requiring access to the algorithm, we propose making inferences based on the data the algorithm uses. We make four contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has received relatively little attention. Second, we propose a test for disparate impact based on analyzing the information leakage of the protected class from the other data attributes. Third, we describe methods by which data might be made unbiased. Finally, we present empirical evidence supporting the effectiveness of our test for disparate impact and our approach for both masking bias and preserving relevant information in the data. Interestingly, our approach resembles some actual selection practices that have recently received legal scrutiny.

• A Geometric Algorithm for Scalable Multiple Kernel Learning. (Project Website)
By John Moeller,   Parasaran Raman,   Suresh Venkatasubramanian,   Avishek Saha,
In Proceedings of AISTATS (AISTATS 2014),  pages 633-642,  2014.
• #### 2013

• Radio Tomographic Imaging and Tracking of Stationary and Moving People via Kernel Distance
By Yang Zhao,    Neal Patwari,    Jeff M. Phillips,    and Suresh Venkatasubramanian
(To Appear) In Proceedings of 12th ACM-IEEE Conference on Information Processing in Sensor Networks (IPSN),  pages ??-??,  April,  2013.
###### Abstract

Network radio frequency (RF) environment sensing (NRES) systems pinpoint and track people in buildings using changes in the signal strength measurements made by a wireless sensor network. It has been shown that such systems can locate people who do not participate in the system by wearing any radio device, even through walls, because of the changes that moving people cause to the static wireless sensor network. However, many such systems cannot locate stationary people. We present and evaluate a system which can locate stationary or moving people, without calibration, by using kernel distance to quantify the difference between two histograms of signal strength measurements. From five experiments, we show that our kernel distance-based radio tomographic localization system performs better than the state-of-the-art NRES systems in different non line-of-sight environments.

• Clustering With Center Constraints. (Project Website)
By Parinya Chalermsook,   Suresh Venkatasubramanian,
In Proceedings of FSTTCS (FSTTCS 2013),  pages 401-412,  2013.
###### Abstract
• Track estimation using link line crossing information in wireless networks. (Project Website)
By Peter Hillyard,   Samira Daruki,   Neal Patwari,   Suresh Venkatasubramanian,
In Proceedings of GlobalSIP (GLOBALSIP 2013),  pages 1037-1040,  2013.
###### Abstract
• Power to the Points: Validating Data Memberships in Clusterings. (Project Website)
By Parasaran Raman,   Suresh Venkatasubramanian,
In Proceedings of ICDM (ICDM 2013),  pages 617-626,  2013.
• #### 2012

• Efficient Protocols for Distributed Classification and Optimization
By Hal Daume III,    Jeff M. Phillips,    Avishek Saha,    and Suresh Venkatasubramanian
In Proceedings of 23rd International Conference on Algorithmic Learning Theory (ALT),  pages ??-??,  October,  2012.
###### Abstract

In distributed learning, the goal is to perform a learning task over data distributed across multiple nodes with minimal (expensive) communication. Prior work (Daum ́e III et al., 2012) proposes a general model that bounds the communication required for learning classifiers while allowing for ε training error on linearly separable data adversarially distributed across nodes. In this work, we develop key improvements and extensions to this basic model. Our first result is a two-party multiplicative-weight-update based protocol that uses O(d2 log 1/ε) words of communication to classify distributed data in arbitrary dimension d, ε-optimally. This readily extends to classification over k nodes with O(kd2 log 1/ε) words of communication. Our proposed protocol is simple to implement and is considerably more efficient than baselines compared, as demonstrated by our empirical results. In addition, we illustrate general algorithm design paradigms for doing efficient learning over distributed data. We show how to solve fixed-dimensional and high dimensional linear programming efficiently in a distributed setting where constraints may be distributed across nodes. Since many learning problems can be viewed as convex optimization problems where constraints are generated by individual points, this models many typical distributed learning scenarios. Our techniques make use of a novel connection from multipass streaming, as well as adapting the multiplicative-weight-update framework more generally to a distributed setting. As a consequence, our methods extend to the wide range of problems solvable using these techniques.

• Approximate bregman near neighbors in sublinear time: beyond the triangle inequality
By Amirali Abdullah,    John Moeller,    Suresh Venkatasubramanian
In Proceedings of ACM Symposium on Computational Geometry (SOCG),  pages 31-40,  June,  2012.
###### Abstract
• Protocols for Learning Classifiers on Distributed Data
By Hal Daume III,    Jeff M. Phillips,    Avishek Saha,    Suresh Venkatasubramanian
In Proceedings of 15th Interntational Conference on Artificial Intelligence and Statistics (AISTATS 2012),  pages ??-??,  La Palma, Canary Islands,  April,  2012.
###### Abstract

We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general sampling-based solutions as well as some two-way protocols which have a provable exponential speed-up over any one-way protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node.

• #### 2011

By Avishek Saha,    Piyush Rai,    Hal Daumé III,    Suresh Venkatasubramanian,    Scott L. DuVall
In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2011),  pages ??-??,  Athens, Greece,  September,  2011.
###### Abstract

In this paper, we harness the synergy between two important learning paradigms, namely, active learning and domain adaptation. We show how active learning in a target domain can leverage information from a different but related source domain. Our proposed framework, Active Learning Domain Adapted (Alda), uses source domain knowledge to transfer information that facilitates active learning in the target domain. We propose two variants of Alda: a batch B-Alda and an online O-Alda. Empirical comparisons with numerous baselines on real-world datasets establish the efficacy of the proposed methods.

• Horoball Hulls and Extents in Positive Definite Space
By P. Thomas Fletcher,    John Moeller,    Jeff M. Phillips,    Suresh Venkatasubramanian
In Proceedings of 12th Algorithms and Data Structure Symposium (WADS 2011),  pages 386-398,   New York City, New York, USA.,  August ,  2011.
###### Abstract

The space of positive definite matrices P(n) is a Riemannian manifold with variable nonpositive curvature. It includes Euclidean space and hyperbolicspace as submanifolds, and poses significant challenges for the design of algorithms for data analysis. In this paper, we develop foundational geometric structures and algorithms for analyzing collections of such matrices. A key technical contribution of this work is the use of horoballs, a natural generalization of halfspaces for non-positively curved Riemannian manifolds. We propose generalizations of the notion of a convex hull and a center point and approximations of these structures using horoballs and based on novel decompositions of P(n). This leads to an algorithm for approximate hulls using a generalization of extents.

• Comparing Distributions and Shapes Using the Kernel Distance
By Sarang Joshi,    Raj Varma Kommaraju,    Jeff M. Phillips,    Suresh Venkatasubramanian
In Proceedings of 27th Annual Symposium on Computational Geometry (SOCG 2011),  pages 47-56,  Paris, France,  June,  2011.
###### Abstract

Starting with a similarity function between objects, it is possible to define a distance metric (the kernel distance) on pairs of objects, and more generally on probability distributions over them. These distance metrics have a deep basis in functional analysis and geometric measure theory, and have a rich structure that includes an isometric embedding into a Hilbert space. They have recently been applied to numerous problems in machine learning and shape analysis. In this paper, we provide the first algorithmic analysis of these distance metrics. Our main contributions are as follows: We present fast approximation algorithms for computing the kernel distance between two point sets P and Q that runs in near-linear time in the size of P ? Q (an explicit calculation would take quadratic time). We present polynomial-time algorithms for approximately minimizing the kernel distance under rigid transformation; they run in time O(n poly(1/e, log n)). We provide several general techniques for reducing complex objects to convenient sparse representations (specifically to point sets or sets of points sets) which approximately preserve the kernel distance. In particular, this allows us to reduce problems of computing the kernel distance between various types of objects such as curves, surfaces, and distributions to computing the kernel distance between point sets.

• Spatially-Aware Comparison and Consensus for Clusterings
By Jeff M. Phillips,    Parasaran Raman,    and Suresh Venkatasubramanian
In Proceedings of 10th SIAM Intenational Conference on Data Mining (SDM 2011),  pages 307-318,  Mesa, Arizona, USA,   April,  2011.
###### Abstract

This paper proposes a new distance metric between clusterings that incorporates information about the spatial distribution of points and clusters. Our approach builds on the idea of a Hilbert space-based representation of clusters as a combination of the representations of their constituent points. We use this representation and the underlying metric to design a spatially-aware consensus clustering procedure. This consensus procedure is implemented via a novel reduction to Euclidean clustering, and is both simple and efficient. All of our results apply to both soft and hard clusterings. We accompany these algorithms with a detailed experimental evaluation that demonstrates the efficiency and quality of our techniques.

• Online Learning of Multiple Tasks and Their Relationships
By Hal Daume III,    Piyush Rai,    Avishek Saha,    Suresh Venkatasubramanian
In Proceedings of Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011),  pages 643-651,  Ft. Lauderdale, FL, USA,  April,  2011.
###### Abstract

• Evaluating Graph Colorings on the GPU (poster)
By A. V. Pascal Grosset,    Peihong Zhu,    Shusen Liu,    Suresh Venkatasubramanian,    Mary Hall
In Proceedings of 16th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPOPP 2011),  pages 297-298,  San Antonio, TX, USA,  February,  2011.
###### Abstract

This paper evaluates features of graph coloring algorithms implemented on graphics processing units (GPUs), comparing coloring heuristics and thread decompositions. As compared to prior work on graph coloring for other parallel architectures, we find that the large number of cores and relatively high global memory bandwidth of a GPU lead to different strategies for the paral lel implementation. Specifically, we find that a simple uniform block partitioning is very effective on GPUs and our parallel coloring heuristics lead to the same or fewer colors than prior approaches for distributed-memory cluster architecture. Our a lgorithm resolves many coloring conflicts across partitioned blocks on the GPU by iterating through the coloring process, befo re returning to the CPU to resolve remaining conflicts. With this approach we get as few color (if not fewer) than the best se quential graph coloring algorithm and performance is close to the fastest sequential graph coloring algorithms which have poor color quality.

• Evaluating graph coloring on GPUs. (Project Website)
By Andre Vincent Pascal Grosset,   Peihong Zhu,   Shusen Liu,   Suresh Venkatasubramanian,   Mary W. Hall,
In Proceedings of PPOPP (PPOPP 2011),  pages 297-298,  2011.
• #### 2010

• Johnson-Lindenstrauss Dimensionality Reduction on the Simplex
By Rasmus J. Kyng,    Jeff M. Phillips,    Suresh Venkatasubramanian
In Proceedings of the 20th Fall Workshop on Computational Geometry (FWCG 2010),  pages 1-4,  NY, USA,  October,  2010.
###### Abstract

We propose an algorithm for dimensionality reduction on the simplex, mapping a set of high-dimensional distributions to a space of lower-dimensional distributions, whilst approximately preserving pairwise Hellinger distance between distributions. By introducing a restriction on the input data to distributions that are in some sense quite smooth, we can map $n$ points on the $d$-simplex to the simplex of $O(eps^{-2}log n)$ dimensions with $eps$-distortion with high probability. The techniques used rely on a classical result by Johnson and Lindenstrauss on dimensionality reduction for Euclidean point sets and require the same number of random bits as non-sparse methods proposed by Achlioptas for database-friendly dimensionality reduction.

• Universal Multi-Dimensional Scaling
By Arvind Agarwal,    Jeff M. Phillips,    Suresh Venkatasubramanian
In Proceedings of 16th Annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD),  pages ??-??,  Washington, DC , USA,  August,  2010.
###### Abstract

In this paper, we propose a unified algorithmic framework for solving many known variants of MDS. Our algorithm is a simple iterative scheme with guaranteed convergence, and is modular; by changing the internals of a single subroutine in the algorithm, we can switch cost functions and target spaces easily. In addition to the formal guarantees of convergence, our algorithms are accurate; in most cases, they converge to better quality solutions than existing methods in comparable time. Moreover, they have a small memory footprint and scale effectively for large data sets. We expect that this framework will be useful for a number of MDS variants that have not yet been studied. Our framework extends to embedding high-dimensional points lying on a sphere to points on a lower dimensional sphere, preserving geodesic distances. As a complement to this result, we also extend the Johnson-Lindenstrauss Lemma to this spherical setting, by showing that projecting to a random O((1/epsilon) log n)- dimensional sphere causes only an "epsilon distortion in the geodesic distances.

• Information theory for data management. (Project Website)
By Divesh Srivastava,   Suresh Venkatasubramanian,
In Proceedings of SIGMOD Conference (SIGMOD 2010),  pages 1255-1256,  2010.
• #### 2009

• Streamed Learning: One-Pass SVMs
By Piyush Rai,    Hal Daume III,    Suresh Venkatasubramanian
In Proceedings of Twenty-First International Joint Conference on Artificial Intelligence (IJCAI 2009),  pages 1211-1216,  Pasadena, CA, USA,  July,  2009.
###### Abstract

We present a streaming model for large scale classification (in the context of $ell_2$-SVM) by leveraging connections between learning and computational geometry. The streaming model imposes the constraint that only a single pass over the data is allowed. The $ell_2$-SVM is known to have an equivalent formulation in terms of minimum enclosing balls (MEB) and an efficient algorithm based on the idea of core sets exists (CVM) (Tsang et al., 2005) which learns a (1 $epsilon$) approximate MEB for a set of points and yield an approximate solution to corresponding SVM instance. However CVM works in batch mode requiring multiple passes over the data. We present a single-pass SVM based on the minimum enclosing ball of streaming data. We show that the MEB updates for the streaming case can be easily adapted to learn the SVM weight vector using simple Perceptron-like update equations. Our algorithm performs polylogarithmic computation at each example, requires very small and constant storage, and finds simpler solutions (measured in terms of the number of support vectors). Experimental results show that, even in such restrictive settings, we can learn efficiently in just one pass and get accuracies comparable to other state-of-the-art SVM solvers. We also discuss some open issues and possible extensions.

• Approximate Shape Matching And Symmetry Detection for 3D Shapes With Guaranteed Error Bounds
By Shankar Krishnan,    Suresh Venkatasubramanian
In Proceedings of IEEE International Conference on Shape Modeling and Applications (SMI 2009),  pages 44-51,  Beijing, China,  June,  2009.
###### Abstract

In this paper, we describe a system for approximate shape matching and symmetry (rotation and reﬂection) detection of geometric shapes represented as point clouds. Rather than using the leastsquares distance as a measure of similarity between shapes, we use the Hausdorff distance between point sets as the underlying shape metric. This allows us to exploit methods from geometric pattern matching to return symmetries and rigid transformation matches with guaranteed error bounds on the quality of our solution. The approximation is determined by intuitive user-speciﬁed input precision and distance threshold parameters. Another important feature of our method is that it leverages FFT-based techniques for string matching to compute all approximate symmetries simultaneously. Our algorithm is simple to implement and is efﬁcient; we present a detailed experimental study.

• Type-Based Categorization of Relational Attributes
In Proceedings of 12th International Conference on Extending Database Technology (EDBT 2009),  pages 84-95,  Saint-Petersburg, Russia,  March,  2009.
###### Abstract

In this work we concentrate on categorization of relational attributes based on their data type. Assuming that attribute type/characteristics are unknown or unidentifiable, we analyze and compare a variety of type-based signatures for classifying the attributes based on the semantic type of the data contained therein (e.g., router identifiers, social security numbers, email addresses). The signatures can subsequently be used for other applications as well, like clustering and indexing based on data types. This application is useful in cases where very large data collections that are generated in a distributed, ungoverned fashion end up having unknown, incomplete, inconsistent or very complex schemata and schema level meta-data. We concentrate on heuristically generating type-based attribute signatures based on both local and global computation approaches. We show experimentally that by decomposing data into q-grams and then considering signatures based on q-gram distributions, we achieve very good classification accuracy under the assumption that a large sample of the data is available for building the signatures. Then, we turn our attention to cases where a very small sample of the data is available, and hence accurately capturing the q-gram distribution of a given data type is almost impossible. We propose techniques based on dimensionality reduction and soft clustering that exploit correlations between attributes to improve classification accuracy.

• Metric Functional Dependencies
By Nick Koudas,    Avishek Saha,    Divesh Srivastava,    Suresh Venkatasubramanian
In Proceedings of 25th International Conference on Data Engineering (ICDE 2009),  pages 1275-1278,  Shanghai, China,  March,  2009.
###### Abstract

When merging data from various sources, it is often the case that small variations in data format and interpretation cause traditional functional dependencies (FDs) to be violated, without there being an intrinsic violation of semantics. Examples include differing address formats, or different reported latitude/longitudes for a given address. In such cases, we would like to specify a dependency structure on the merged data that is robust to such small differences.In this paper, we define metric functional dependencies, which strictly generalize traditional FDs by allowing small differences (controlled by a metric) in values of the consequent attribute of an FD. We show that this notion satisfies many of the standard properties of functional dependencies, and we present efficient algorithms for the verification problem: determining whether a given metric FD (MFD) holds for a given relation. We show that MFDs can be combined with approximate FDs, allowing tuples with identical antecedents to map to different consequents, some of which correspond to small (acceptable) variations, with others indicating more serious data quality issues. We experimentally demonstrate the validity and efficiency of our approach on various data sets that possess different underlying metrics, and lie in multidimensional spaces.

• Streaming for large scale NLP: Language Modelling
By Amit Goyal,    Hal Daume III,    Suresh Venkatasubramanian
In Proceedings of North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL-HLT),  pages 512-520,  Boulder, Colorado, USA,  2009.
###### Abstract

In this paper, we explore a streaming algorithm paradigm to handle large amounts of data for NLP problems. We present an efficient low-memory method for constructing high-order approximate n-gram frequency counts. The method is based on a deterministic streaming algorithm which efficiently computes approximate frequency counts over a stream of data while employing a small memory footprint. We show that this method easily scales to billion-word monolingual corpora using a conventional (4 GB RAM) desktop machine. Statistical machine translation experimental results corroborate that the resulting high-n approximate small language model is as effective as models obtained from other count pruning methods.

• Change (Detection) You Can Believe in: Finding Distributional Shifts in Data Streams. (Project Website)
By Tamraparni Dasu,   Shankar Krishnan,   Dongyu Lin,   Suresh Venkatasubramanian,   Kevin Yi,
In Proceedings of IDA (IDA 2009),  pages 21-34,  2009.
• #### 2008

• Robust Statistics on Riemannian Manifolds via the Geometric Median
By Thomas Fletcher,    Sarang Joshi,    Suresh Venkatasubramanian
In Proceedings of Computer Vision and Pattern Recognition (CVPR 2008),  pages 1-8,  Anchorage, Alaska, USA,  August,  2008.
###### Abstract

The geometric median is a classic robust estimator of centrality for data in Euclidean spaces. In this paper we formulate the geometric median of data on a Riemannian manifold as the minimizer of the sum of geodesic distances to the data points. We prove existence and uniqueness of the geometric median on manifolds with non-positive sectional curvature and give sufficient conditions for uniqueness on positively curved manifolds. Generalizing the Weiszfeld procedure for finding the geometric median of Euclidean data, we present an algorithm for computing the geometric median on an arbitrary manifold. We show that this algorithm converges to the unique solution when it exists. This method produces a robust central point for data lying on a manifold, and should have use in a variety of vision applications involving manifolds. We give examples of the geometric median computation and demonstrate its robustness for three types of manifold data: the 3D rotation group, tensor manifolds, and shape spaces.

• Validating Multi-column Schema Matchings by Type
By Bing Tian Dai,    Nick Koudas,    Divesh Srivastava,    Anthony K. H. Tung,    Suresh Venkatasubramanian
In Proceedings of 24th International Conference on Data Engineering (ICDE 2008),  pages 120-129,  Cancun, Mexico,  April,  2008.
###### Abstract

Validation of multi-column schema matchings is essential for successful database integration. This task is especially difficult when the databases to be integrated contain little overlapping data, as is often the case in practice (e.g., customer bases of different companies). Based on the intuition that values present in different columns related by a schema matching will have similar “semantic type”, and that this can be captured using distributions over values (“statistical types”), we develop a method for validating 1-1 and compositional schema matchings. Our technique is based on three key technical ideas. First, we propose a generic measure for comparing two columns matched by a schema matching, based on a notion of information-theoretic discrepancy that generalizes the standard geometric discrepancy; this provides the basis for 1:1 matching. Second, we present an algorithm for “splitting” the string values in a column to identify substrings that are likely to match with the values in another column; this enables (multi-column) 1:m schema matching. Third, our technique provides an invalidation certificate if it fails to validate a schema matching. We complement our conceptual and algorithmic contributions with an experimental study that demonstrates the effectiveness and efficiency of our technique on a variety of database schemas and data sets.

• #### 2007

• t-closeness: Privacy Beyond k-Anonymity and l-Diversity
By Ninghui Li,    Tiancheng Li,    Suresh Venkatasubramanian
In Proceedings of 23rd International Conference on Data Engineering (ICDE 2007),  pages 106-115,  Istanbul, Turkey,  April,  2007.
###### Abstract

The $k$-anonymity privacy requirement for publishing microdata requires that each equivalence class (i.e., a set of records that are indistinguishable from each other with respect to certain “identifying” attributes) contains at least $k$ records. Recently, several authors have recognized that $k$-anonymity cannot prevent attribute disclosure. The notion of $ell$-diversity has been proposed to address this; $ell$-diversity requires that each equivalence class has at least $ell$ well-represented values for each sensitive attribute.In this paper we show that $ell$-diversity has a number of limitations. In particular, it is neither necessary nor sufficient to prevent attribute disclosure. We propose a novel privacy notion called $t$-closeness, which requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold $t$). We choose to use the Earth Mover Distance measure for our $t$-closeness requirement. We discuss the rationale for $t$-closeness and illustrate its advantages through examples and experiments.

• Directed Graphs and Rectangular Layouts
By Adam Buchsbaum,    Emden Gansner,    Suresh Venkatasubramanian
In Proceedings of 2007 Asia-Pacific Symposium on Visualisation (APVIS 2007),  pages 61-64,  Sydney, NSW, Australia,  February,  2007.
###### Abstract

This paper deals with the problem, arising in practice, of drawing a directed graph as a collection of disjoint, isothetic rectangles, where the rectangles of the nodes of each edge must touch and where the placement of the rectangles respects the ordering of the edges. It provides characterizations for those graphs having the special type of rectangular layout known as a rectangular dual. It then characterizes the st-graphs having rectangular layouts in terms of the existence of certain planar embeddings and the non-existence of a particular subgraph.

• Restricted Strip Covering and the Sensor Cover Problem
By Adam Buchsbaum,    Alon Efrat,    Shaili Jain,    Suresh Venkatasubramanian,    Kevin Yi
In Proceedings of 18th ACM-SIAM Symposium on Discrete Algorithms (SODA 2007),  pages 1056-1063,  New Orleans, Louisiana, USA,  January ,  2007.
###### Abstract

Given a set of objects with durations (jobs) that cover a base region, can we schedule the jobs to maximize the duration the original region remains covered? We call this problem the sensor cover problem. This problem arises in the context of covering a region with sensors. For example, suppose you wish to monitor activity along a fence by sensors placed at various fixed locations. Each sensor has a range and limited battery life. The problem is to schedule when to turn on the sensors so that the fence is fully monitored for as long as possible. This one dimensional problem involves intervals on the real line. Associating a duration to each yields a set of rectangles in space and time, each specified by a pair of fixed horizontal endpoints and a height. The objective is to assign a position to each rectangle to maximize the height at which the spanning interval is fully covered. We call this one dimensional problem restricted strip covering. If we replace the covering constraint by a packing constraint, the problem is identical to dynamic storage allocation, a scheduling problem that is a restricted case of the strip packing problem. We show that the restricted strip covering problem is NP-hard and present an O(log log n)-approximation algorithm. We present better approximations or exact algorithms for some special cases. For the uniform-duration case of restricted strip covering we give a polynomial-time, exact algorithm but prove that the uniform-duration case for higher-dimensional regions is NP-hard. Finally, we consider regions that are arbitrary sets, and we present an O(log n)-approximation algorithm.

• #### 2006

• Rapid Identification of Column Heterogeneity
By Bing Tian Dai,    Nick Koudas,    Beng Chin Ooi,    Divesh Srivastava,    Suresh Venkatasubramanian
In Proceedings of IEEE International Conference on Data Mining (ICDM 2006),  pages 159-170,  Hong Kong, China,  December,  2006.
###### Abstract

Data quality is a serious concern in every data management application, and a variety of quality measures have been proposed, e.g., accuracy, freshness and completeness, to capture common sources of data quality degradation. We identify and focus attention on a novel measure, column heterogeneity, that seeks to quantify the data quality problems that can arise when merging data from different sources. We identify desiderata that a column heterogeneity measure should intuitively satisfy, and describe our technique to quantify database column heterogeneity based on using a novel combination of cluster entropy and soft clustering. Finally, we present detailed experimental results, using diverse data sets of different types, to demonstrate that our approach provides a robust mechanism for identifying and quantifying database column heterogeneity.

• Spatial Scan Statistics: Approximations and Performance Study
By Deepak Agarwal,    Andrew McGregor,    Jeff M. Phillips,    Suresh Venkatasubramanian,    Zhengyuan Zhu
In Proceedings of 12th Annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006),  pages 24-33,  Philadelphia, USA,  August,  2006.
###### Abstract

Spatial scan statistics are used to determine hotspots in spatial data, and are widely used in epidemiology and biosurveillance.In recent years, there has been much effortinvested in designing efficient algorithms for finding such ?high discrepancy? regions, with methods ranging from fast heuristics for special cases, to general grid-based methods,and to efficient approximation algorithms with provable guarantees on performance and quality.In this paper, we make a number of contributions to the computational study of spatial scan statistics. First, we describea simple exact algorithm for finding the largest discrepancy region in a domain. Second, we propose a new approximation algorithm for a large class of discrepancyfunctions (including the Kulldorff scan statistic) that improves the approximation versus runtime trade-off of priormethods. Third, we extend our simple exact and our approximation algorithms to data sets which lie naturally on a grid or are accumulated onto a grid. Fourth, we conduct adetailed experimental comparison of these methods with a number of known methods, demonstrating that our approximationalgorithm has far superior performance in practice to prior methods, and exhibits a good performance-accuracy trade-off.All extant methods (including those in this paper) are suitable for data sets that are modestly sized; if data sets are of the order of millions of data points, none of thesemethods scale well. For such massive data settings, it is natural to examine whether small-space streaming algorithms might yield accurate answers. Here, we provide some negative results, showing that any streaming algorithms that even provide approximately optimal answers to the discrepancymaximization problem must use space linear in the input.

• Spatial Scan Statistics: Approximations and Performance Study
By Deepak Agarwal,    Andrew McGregor,    Jeff M. Phillips,    Suresh Venkatasubramanian,    Zhengyuan Zhu
In Proceedings of The Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006),  pages 24-33,  Philadelphia, USA,  August,  2006.
###### Abstract

Spatial scan statistics are used to determine hotspots in spatial data, and are widely used in epidemiology and biosurveillance. In recent years, there has been much effort invested in designing efficient algorithms for finding such %u201Chigh discrepancy%u201D regions, with methods ranging from fast heuristics for special cases, to general grid-based methods, and to efficient approximation algorithms with provable guarantees on performance and quality. In this paper, we make a number of contributions to the computational study of spatial scan statistics. First, we describe a simple exact algorithm for finding the largest discrepancy region in a domain. Second, we propose a new approximation algorithm for a large class of discrepancy functions (including the Kulldorff scan statistic) that improves the approximation versus runtime trade-off of prior methods. Third, we extend our simple exact and our approximation algorithms to data sets which lie naturally on a grid or are accumulated onto a grid. Fourth, we conduct a detailed experimental comparison of these methods with a number of known methods, demonstrating that our approximation algorithm has far superior performance in practice to prior methods, and exhibits a good performance-accuracy trade-off. All extant methods (including those in this paper) are suitable for data sets that are modestly sized; if data sets are of the order of millions of data points, none of these methods scale well. For such massive data settings, it is natural to examine whether small-space streaming algorithms might yield accurate answers. Here, we provide some negative results, showing that any streaming algorithms that even provide approximately optimal answers to the discrepancy maximization problem must use space linear in the input.

• The Hunting of the Bump: On Maximizing Statistical Discrepancy
By Deepak Agarwal,    Jeff M. Phillips,    Suresh Venkatasubramanian
In Proceedings of 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2006),  pages 1137-1146,   Miami, USA,  January ,  2006.
###### Abstract

Anomaly detection has important applications in bio-surveilance and environmental monitoring. When comparing measured data to data drawn from a baseline distribution, merely, finding clusters in the measured data may not actually represent true anomalies. These clusters may likely be the clusters of the baseline distribution. Hence, a discrep-ancy function is often used to examine how different measured data is to baseline data within a region. An anomalous region is thus defined to be one with high discrepancy.In this paper, we present algorithms for maximizing statistical discrepancy functions over the space of axis-parallel rectangles. We give provable approximation guarantees,both additive and relative, and our methods apply to any convex discrepancy function. Our algorithms work by connecting statistical discrepancy to combinatorial discrepancy;roughly speaking, we show that in order to maximize a convex discrepancy function over a class of shapes, one needs only maximize a linear discrepancy function over the sameset of shapes.We derive general discrepancy functions for data generated from a one- parameter exponential family. This generalizes the widely-used Kulldorff scan statistic for datafrom a Poisson distribution. We present an algorithm running in O( 1/epsilon n^2 (logn)^2) that computes the maximum discrepancy rectangle to within additive error ., for the Kulldorff scan statistic. Similar results hold for relative error and for discrepancy functions for data coming from Gaussian,Bernoulli, and gamma distributions. Prior to our work, the best known algorithms were exact and ran in time O(n^4).

• Column Heterogeneity as a Measure of Data Quality. (Project Website)
By Bing Tian Dai,   Nick Koudas,   Beng Chin Ooi,   Divesh Srivastava,   Suresh Venkatasubramanian,
In Proceedings of CleanDB (CLEANDB 2006),  pages ??-??,  2006.
###### Abstract
• Streaming and sublinear approximation of entropy and information distances. (Project Website)
By Sudipto Guha,   Andrew McGregor ,   Suresh Venkatasubramanian,
In Proceedings of SODA (SODA 2006),  pages 733-742,  2006.
• #### 2005

• Hardware-Assisted Natural Neighbor Interpolation. (Project Website)
By Quanfu Fan,   Alon Efrat,   Vladlen Koltun,   Shankar Krishnan,   Suresh Venkatasubramanian,
In Proceedings of ALENEX/ANALCO (ALENEX 2005),  pages 111-120,  2005.
###### Abstract
• On stationarity in Internet measurements through an information-theoretic lens. (Project Website)
By Balachander Krishnamurthy,   Suresh Venkatasubramanian,   Harsha V. Madhyastha,
In Proceedings of ICDE Workshops (ICDE 2005),  pages 1185,  2005.
###### Abstract
• Global Registration of Multiple 3D Point Sets via Optimization-on-a-Manifold. (Project Website)
By Shankar Krishnan,   Pei Yean Lee,   John B. Moore,   Suresh Venkatasubramanian,
In Proceedings of Symposium on Geometry Processing (SGP 2005),  pages 187-196,  2005.
• #### 2004

• Compressing Large Boolean Matrices using Reordering Techniques. (Project Website)
By David S. Johnson,   Shankar Krishnan,   Jatin Chhugani,   Subodh Kumar,   Suresh Venkatasubramanian,
In Proceedings of VLDB (VLDB 2004),  pages 13-23,  2004.
• #### 2003

• Application of the two-sided depth test to CSG rendering. (Project Website)
By Sudipto Guha,   Shankar Krishnan,   Kamesh Munagala,   Suresh Venkatasubramanian,
In Proceedings of SI3D (SI3D 2003),  pages 177-180,  2003.
###### Abstract
• Streaming Geometric Optimization Using Graphics Hardware. (Project Website)
By Pankaj K. Agarwal,   Shankar Krishnan,   Nabil H. Mustafa,   Suresh Venkatasubramanian,
In Proceedings of ESA (ESA 2003),  pages 544-555,  2003.
• #### 2002

• Hardware-assisted computation of depth contours. (Project Website)
By Shankar Krishnan,   Nabil H. Mustafa,   Suresh Venkatasubramanian,
In Proceedings of SODA (SODA 2002),  pages 558-567,  2002.
• #### 2001

• Pattern matching for sets of segments. (Project Website)
By Alon Efrat,   Piotr Indyk,   Suresh Venkatasubramanian,
In Proceedings of SODA (SODA 2001),  pages 295-304,  2001.
###### Abstract
• Hardware-assisted view-dependent map simplification. (Project Website)
By Nabil H. Mustafa,   Eleftherios Koutsofios,   Shankar Krishnan,   Suresh Venkatasubramanian,
In Proceedings of Symposium on Computational Geometry (COMPGEOM 2001),  pages 50-59,  2001.
• #### 2000

• On the decidability of accessibility problems (extended abstract). (Project Website)
By Rajeev Motwani,   Rina Panigrahy,   Vijay A. Saraswat,   Suresh Venkatasubramanian,
In Proceedings of STOC (STOC 2000),  pages 306-315,  2000.
###### Abstract
• On external memory graph traversal. (Project Website)
By Adam L. Buchsbaum,   Michael H. Goldwasser,   Suresh Venkatasubramanian,   Jeffery Westbrook,
In Proceedings of SODA (SODA 2000),  pages 859-860,  2000.
###### Abstract
• Approximate congruence in nearly linear time. (Project Website)
By Piotr Indyk,   Suresh Venkatasubramanian,
In Proceedings of SODA (SODA 2000),  pages 354-360,  2000.
• #### 1999

• Geometric Matching Under Noise: Combinatorial Bounds and Algorithms. (Project Website)
By Piotr Indyk,   Rajeev Motwani,   Suresh Venkatasubramanian,
In Proceedings of SODA (SODA 1999),  pages 457-465,  1999.
###### Abstract
• Geometric Pattern Matching: A Performance Study. (Project Website)
By Martin Gavrilov,   Piotr Indyk,   Rajeev Motwani,   Suresh Venkatasubramanian,
In Proceedings of Symposium on Computational Geometry (COMPGEOM 1999),  pages 79-85,  1999.
• #### 1998

• Proximity Search in Databases. (Project Website)
By Roy Goldman,   Narayanan Shivakumar,   Suresh Venkatasubramanian,   Hector Garcia-Molina,
In Proceedings of VLDB (VLDB 1998),  pages 26-37,  1998.
• #### 1997

• Storage Management for Evolving Databases. (Project Website)
By Jon M. Kleinberg,   Rajeev Motwani,   Prabhakar Raghavan,   Suresh Venkatasubramanian,
In Proceedings of FOCS (FOCS 1997),  pages 353-362,  1997.
###### Abstract
• RAPID: Randomized Pharmacophore Identification for Drug Design. (Project Website)
By Paul W. Finn,   Lydia E. Kavraki,   Jean-Claude Latombe,   Rajeev Motwani,   Christian R. Shelton,   Suresh Venkatasubramanian,   Andrew Chi-Chih Yao,
In Proceedings of Symposium on Computational Geometry (COMPGEOM 1997),  pages 324-333,  1997.
• #### 1996

• Geometric Manipulation of Flexible Ligands. (Project Website)
By Paul W. Finn,   Dan Halperin,   Lydia E. Kavraki,   Jean-Claude Latombe,   Rajeev Motwani,   Christian R. Shelton,   Suresh Venkatasubramanian,
In Proceedings of WACG (WACG 1996),  pages 67-78,  1996.
• ### Workshop

#### 2012

• Sensor Network Localization for Moving Sensors
by Arvind Agarwal,    Hal Daume III,    Jeff M. Phillips,    and Suresh Venkatasubramanian
In Proceedings of the 2nd IEEE ICDM International Workshop on Data Mining in Networks, in conjunction with the ICDM (DaMNet), pages , ,  December,  2012.  .
###### Abstract

Sensor network localization (SNL) is the problem of determining the locations of the sensors given sparse and usually noisy inter-communication distances among them. In this work we propose an iterative algorithm named PLACEMENT to solve the SNL problem. This iterative algorithm requires an initial estimation of the locations and in each iteration, is guaranteed to reduce the cost function. The proposed algorithm is able to take advantage of the good initial estimation of sensor locations making it suitable for localizing moving sensors, and also suitable for the refinement of the results produced by other algorithms. Our algorithm is very scalable. We have experimented with a variety of sensor networks and have shown that the proposed algorithm outperforms existing algorithms both in terms of speed and accuracy in almost all experiments. Our algorithm can embed 120,000 sensors in less than 20 minutes.

• #### 2011

• Generating a Diverse Set of High-Quality Clusterings
by Jeff M. Phillips,    Parasaran Raman,    Suresh Venkatasubramanian
In Proceedings of the the 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings, in conjunction with the ECML/PKDD 2011 (MultiClust 2011), pages ??-??, Athens, Greece,  September,  2011.  .
###### Abstract

We provide a new framework for generating multiple good quality partitions (clusterings) of a single data set. Our approach decomposes this problem into two components, generating many high-quality partitions, and then grouping these partitions to obtain k representatives. The decomposition makes the approach extremely modular and allows us to optimize various criteria that control the choice of representative partitions.

• The Johnson-Lindenstrauss Transform: An Empirical Study
by Suresh Venkatasubramanian,    Qiushi Wang
In Proceedings of the Workshop on Algorithms Engineering and Experimentation, in conjunction with the SODA 2011 (ALENEX11), pages 164-173, San Francisco, California, USA,  January ,  2011.  SIAM.
###### Abstract

The Johnson-Lindenstrauss Lemma states that a set of $n$ points may be embedded in a space of dimension $O(\log n/\eps^2)$ while preserving all pairwise distances within a factor of $(1+\epsilon)$ with high probability. It has inspired a number of proofs that extend the result, simplify it, and improve the efficiency of computing the resulting embedding. The lemma is a critical tool in the realm of dimensionality reduction and high dimensional approximate computational geometry. It is also employed for data mining in domains that analyze intrinsically high dimensional objects such as images and text. However, while algorithms for performing the dimensionality reduction have become increasingly sophisticated, there is little understanding of the behavior of these embeddings in practice. In this paper, we present the first comprehensive study of the empirical behavior of algorithms for dimensionality reduction based on the JL Lemma. Our study answers a number of important questions about the quality of the embeddings and the performance of algorithms used to compute them. Among our key results: Determining a likely range for the big-Oh constant in practice for the dimension of the target space, and demonstrating the accuracy of the predicted bounds. Finding "best in class" algorithms over wide ranges of data size and source dimensionality, and showing that these depend heavily on parameters of the data as well its sparsity. Developing the best implementation for each method, making use of non-standard optimized codes for key subroutines. Identifying critical computational bottleneck that can spur further theoretical study of efficient algorithms.

• #### 2010

• Domain Adaptation Meets Active Learning
by Piyush Rai,    Avishek Saha,    Hal Daume III,    Suresh Venkatasubramanian
In Proceedings of the Workshop on Active Learning For NLP, in conjunction with the NAACL-HLT (ALNLP 2010), pages 27-32, Los Angeles, CA, USA,  June,  2010.  ACL.
###### Abstract

In this work, we show how active learning in some (target) domain can leverage information from a different but related source) domain. We present an algorithm that harnesses the source domain data to learn the best possible initializer hypothesis for doing active learning in the target domain, resulting in improved label complexity. We also present a variant of this algorithm which additionally uses the domain divergence information to selectively query the most informative points in the target domain, leading to further reductions in label complexity. Experimental results on a variety of datasets establish the efficacy of the proposed methods.

• Sketching Techniques for Large-Scale NLP
by Amit Goyal,    Jagadeesh Jagarlamudi,    Hal Daume III,    Suresh Venkatasubramanian
In Proceedings of the 6th Web as Corpus Workshop, in conjunction with the NAACL-HLT (WAC 2010), pages 17-25, Los Angeles, CA, USA,  June,  2010.  ACL.
###### Abstract

In this paper, we address the challenges posed by large amounts of text data by exploiting the power of hashing in context of streaming data. We explore sketch techniques, especially Count-Min Sketch, which approximates the frequency of a word-pair in the corpus without explicitly storing the word-pairs themselves. We further use the idea of a conservative update with Count-Min Sketch to reduce the average relative error of its approximate counts by a factor of two. We show that it is possible to store all words and word-pairs counts computed from 37 GB of web data in just 2 billion counters (8 GB main memory). The number of these counters is upto 30 times less than the stream size which is really a big memory and space gain. In Semantic Orientation experiments, the PMI scores computed from 2 billion counters are as effective as exact PMI scores.

• Incremental Multi-Dimensional Scaling
by Arvind Agarwal,    Jeff M. Phillips,    Hal Daume III,    Suresh Venkatasubramanian
In Proceedings of the The Learning Workshop at Snowbird, in conjunction with the (), pages , ,  April ,  2010.  .
###### Abstract

Multi-Dimensional Scaling (MDS) is a widely used method for embedding a given distance matrix into a low dimensional space, used both as a preprocessing step for many machine learning problems, as well as a visualization tool in its own right. In this paper, we present an incremental version of MDS (iMDS). In iMDS, d-dimensional data points are presented in a stream, and the task is to embed the current d-dimensional data point into a k-dimensional space for k < d such that distances from the current point to the previous points are preserved. Let {x1,..., xt-1} ? R k be the data points at time step t that have already been embedded into a k-dimensional space, {r1,..., rt-1} be the given distances computed in R d, then objective of iMDS is to find a point xt: xt = arg min p?R k ?t-1 i=1

• ### Tech Report

#### 2014

• A directed isoperimetric inequality with application to Bregman near neighbor lower bounds (Project Website)
By Amirali Abdullah,    Suresh Venkatasubramanian
,  April,  2014.
###### Abstract

Bregman divergences Dϕ are a class of divergences parametrized by a convex function ϕ and include well known distance functions like ℓ22 and the Kullback-Leibler divergence. There has been extensive research on algorithms for problems like clustering and near neighbor search with respect to Bregman divergences; in all cases, the algorithms depend not just on the data size n and dimensionality d, but also on a structure constant μ≥1 that depends solely on ϕ and can grow without bound independently. In this paper, we provide the first evidence that this dependence on μ might be intrinsic. We focus on the problem of approximate near neighbor search for Bregman divergences. We show that under the cell probe model, any non-adaptive data structure (like locality-sensitive hashing) for c-approximate near-neighbor search that admits r probes must use space Ω(n1+μcr). In contrast, for LSH under ℓ1 the best bound is Ω(n1+1cr). Our new tool is a directed variant of the standard boolean noise operator. We show that a generalization of the Bonami-Beckner hypercontractivity inequality exists “in expectation” or upon restriction to certain subsets of the Hamming cube, and that this is sufficient to prove the desired isoperimetric inequality that we use in our data structure lower bound. We also present a structural result reducing the Hamming cube to a Bregman cube. This structure allows us to obtain lower bounds for problems under Bregman divergences from their ℓ1 analog. In particular, we get a (weaker) lower bound for approximate near neighbor search of the form Ω(n1+1cr) for an r-query non-adaptive data structure, and new cell probe lower bounds for a number of other near neighbor questions in Bregman space.