Dr. Fayyad will give a historical summary of AI and ML and discuss what has worked, what hasn’t and why. He will proceed to emphasize and explain how data is the real enabler of practical applications of AI, and will pose the question of how to build data assets that enable successful use for AI and ML. Dr. Fayyad will draw on several case studies from multiple fields including: online advertising, financial services, and healthcare. He will discuss how to manage data in an organization and how to turn it into an asset that contributes to solving real problems and opening new revenue opportunities. Usama Fayyad Bio: Usama Fayyad is co-founder and chief technology officer (CTO) at OODA Health. OODA Health was founded in 2017 and is focused on removing waste and inefficiency from the healthcare administrative system by using an innovative approach to change the paradigm between payers, patients and providers. As leader of the technology team, Usama is responsible for designing and building the OODA platform to leverage AI/ML in service of delivering a retail payment experience for providers and members. Usama is also chairman at Open Insights, a technology and strategic consulting firm he founded in 2008, that helps enterprises deploy data-driven solutions to grow revenue from data assets. Previous to OODA Health, Usama served as global chief data officer and group managing director at Barclays in London, after he launched the largest tech startup accelerator in MENA as Executive Chairman of Oasis500 in Jordan. His background includes chairman and CEO roles at several startups, including Blue Kangaroo Corp, DMX Group, and digiMine Inc (Audience Science). Usama was the first person to hold the chief data officer title when Yahoo! acquired his second startup in 2004. At Yahoo! he built the Strategic Data Solutions group and founded Yahoo! Research Labs, where early work on big data led to open source and the launch of Hadoop. He has held leadership roles at Microsoft and founded the Machine Learning Systems group at NASA's Jet Propulsion Laboratory, where his work on machine learning resulted in the top Excellence in Research award from Caltech, and a US Government medal from NASA. Usama has published over 100 technical articles on data mining, data science, AI/ML, and databases. He holds over 30 patents and is a fellow of both the Association for the Advancement of Artificial Intelligence (AAAI) and the Association for Computer Machinery (ACM). Usama earned his PhD in Engineering in AI and Machine Learning from the University of Michigan Ann Arbor. He has edited two books on data mining and served as editor-in-chief of two key industry journals. He also served on the boards or advisory boards of several private and public companies including: Criteo, Invensense, RapidMiner, Stella.AI, Virsec, Silniva, Abe.AI, NetSeer, Choicestream and Medio. On the academic front he is on advisory boards of the Data Science Institute at Imperial College, AAI at UTS, and The University of Michigan College of Engineering National advisory board
Markov Chain Monte Carlo originated from the classic paper of Metropolis, et al(1953). In this talk, we will briefly go through the MCMC, and dive into the Hamiltonian Monte Carlo. Beginning with the Hamiltonian Dynamics, continuing with the reasoning of why the algorithm works, we will go through the advantages and disadvantages over traditional MCMC, the implementation detail of algorithm, and other related topics.
Recent advances in computational tools and technologies have enabled major strides in the fields of artificial intelligence and computer vision. Although achieving excellent performance in the tasks they have designed for, state-of-the-art artificial systems lack flexibility, autonomy, and adaptability properties when encountered by a new environment or given a different task or operated in the presence of noise. These deficits are especially critical in functioning robustly and efficiently in complex, real-world settings. Our brain nonetheless demonstrates enormous generality and flexibility to the extent that it hides the complexity of the task or environment by processing them effortlessly, accurately, and in real time. Despite significant progress in neuroscience, however, we still know little about how such robust processing and perception arise from brain circuits. Against this backdrop, both computer science and neuroscience have long looked to each other for exchanging tools and ideas for inspiration and innovation. In this talk, we review the interplay between neuroscience and computational science, focusing on connections between visual neuroscience and artificial vision and imaging systems, with examples from the active projects in the Vision Computation Lab (directed by Dr. Nategh), which integrate advances in both computational science and neuroscience to explore innovative solutions for similar problems in both fields.
Inspired by the works of Forman on discrete Morse theory, which is a combinatorial adaptation to cell complexes of classical Morse theory on manifolds, we introduce a discrete analogue of the stratified Morse theory of Goresky and MacPherson (1988). We describe the basics of this theory and prove fundamental theorems relating the topology of a general simplicial complex with the critical simplices of a discrete stratified Morse function on the complex. We also provide an algorithm that constructs a discrete stratified Morse function out of an arbitrary function defined on a finite simplicial complex; this is different from simply constructing a discrete Morse function on such a complex. We borrow Forman's idea of a "user's guide," where we give simple examples to convey the utility of our theory in data analysis. This is joint work with Kevin Knudson at the University of Florida, https://arxiv.org/abs/1801.03183.
Graphs are commonly used to encode relationships among entities, yet, their abstractness makes them incredibly difficult to visually analyze. At the same time, topological data analysis is an emerging area in exploratory data analysis, and one of its main tools, persistent homology, has become a popular technique to study the structure of complex, high-dimensional data. This talk will focus on 2 projects that use persistent homology to support visual analysis of graphs. In the first project, we use persistent homology to quantify structural changes in time-varying graphs. More specifically, we transform time-varying graph instances into a metric space, extract topological features using persistent homology, and compare those features over time, providing a visualization that helps to identify patterns of behavior. In the second project, we address the graph drawing challenge by leveraging topological features of a graph as derived information for interactive graph drawing. This approach enables studying the substructures of a force-directed graph layout by adding forces to the layout using an interactive persistence barcode that is connected to topological features. In addition to these projects, I’ll discuss some future directions for the work. This work is in collaboration with Mustafa Hajij (USF), Ashley Suh (USF), Bei Wang (Utah), and Carlos Scheidegger (Arizona).
Damien will talk about the company and their technology. He plans to bring some engineers to do some more in depth demos as well. As I understand they receive live feeds of the entirety of large social media sites (maybe I can’t say publicly which ones), all traffic cams, Waze, and many other sources. They geo-locate all signals, and in real time run classifiers on them to detect events (e.g., shootings, fires, etc) and quickly broadcast them. They apparently have strong results already, but there are many more challenges in computer vision, NLP, as well as other areas of ML and data processing. They recently (3 months ago) moved to Park City, and are looking to hire substantially and build new connections.
High-throughput imaging assays generate thousands to millions of images of cells under a large variety of genetic and chemical perturbations. Despite following best practices, the statistics of these images drift from experiment to experiment and even plate to plate within an experiment due to many factors and require correction before data from different experiments can be compared. In this talk, I will introduce the work being done at Recursion, and then show how this correction can be accomplished via an adversarial training task using control perturbations on each plate.
Ever since the work of Minsky and Papert, it has been thought that neural networks derive their effectiveness by finding representations of the data that are invariant with respect to the task. In other words, the representations eliminate components of the data that vary in a way that is irrelevant. These invariants are naturally expressed with respect to group operations, and thus understanding of these groups is key to explaining the effectiveness of the neural network. Moreover, a line of work in deep learning has shown that explicit knowledge of group invariants can lead to more effective training results. In this paper, we investigate the difficulty of discovering anything about these implicit invariants. Unfortunately, our main results are negative: we show that a variety of questions around investigating invariant representations are NP-hard, even in approximate settings. Moreover, these results do not depend on the kind of architecture used: in fact, our results follow as soon as the network architecture is powerful enough to be universal. The key idea behind our results is that if we can find the symmetries of a problem then we can solve it. This is joint work with Suresh Venkatasubramanian, Danielle Ensign, and Arnab Paul. Also to be presented at ALT 2017.
The Department of Workforce Services (DWS) of the state of Utah has had a long history of data science and leveraging research results to return value to the taxpayers and citizens of the state. We will explore the most contemporary implementations of data science for the purpose of informing: business decisions, the federal government, and the public. The 2017 Unemployment Insurance profiling model was created to improve on the existing profiling model; for the purpose of providing a better customer service experience, and decreasing operational costs. Two recent projects focused on leveraging data science to provide evidence for or against systemic discrimination in the services provided. Lastly, the department has placed a considerable amount of resources in creating data visualizations for the purpose of informing the public of the economic status of the state and provide in-depth analysis of specific research topics. These are several examples of data science being utilized at DWS, this is not an exhaustive list, and shows a commitment by the department to leverage data science for the betterment of DWS and the service provided therein.
Finding anomalous regions within spatial data sets is a central task for biosurveillance, homeland security, policy making, and many other important areas.These communities have mainly settled on spatial scan statistics as a rigorous way to discover regions where a measured quantity (e.g., crime) is statistically significant in its difference from a baseline population. However, most common approaches are inefficient and thus, can only be run with very modest data sizes (a few thousand data points) or make assumptions on the geographic distributions of the data. Moreover many of these approaches are limited to point based data sets, but in reality the data might be aggregated to regions, defined by the path of a trajectory, or associated with some other object. In addition to these challenges the data might be noisy, or the regional anomaly could be distributed in a smooth fashion over the interest region. The massive data sets and diverse applications suggest a wealth of different yet related approaches based around a common framework of sampling and approximation techniques.
Mobile and sensing devices have already become ubiquitous. They have made tracking moving objects an easy task. As a result, mobile applications like Uber and many IoT projects have generated massive amounts of trajectory data that can no longer be processed by a single machine efficiently. Among the typical query operations over trajectories, similarity search is a common yet expensive operator in querying trajectory data. It is useful for applications in different domains such as traffic and transportation optimizations, weather forecast and modeling, and sports analytics. It is also a fundamental operator for many important mining operations such as clustering and classification of trajectories. In this paper, we propose a distributed query framework to process trajectory similarity search over a large set of trajectories. We have implemented the proposed framework in Spark, a popular distributed data processing engine, by carefully considering different design choices. Our query framework supports both the Hausdorff distance the Fréchet distance. Extensive experiments have demonstrated the excellent scalability and query efficiency achieved by our design, compared to other methods and design alternatives.
Extensive DNA sequence data have made it possible toreconstruct human evolutionary history in unprecedented detail. Weintroduce a method to study the past several hundred thousand years.Our results imply that (1) the Neanderthal population was large andsubdivided, (2) Neanderthals and Denisovans separated early in theMiddle Pleistocene, and (3) their ancestors suffered a bottleneck inpopulation size after separating from the modern lineage. They also(4) support previous estimates of gene flow from Neanderthals intomodern Eurasians. These results suggest an archaic human diasporaearly in the Middle Pleistocene.
As a generalization of the use of graphs to describe pairwise interactions, simplicial complexes can be used to model higher-order interactions between three or more objects in complex systems. There has been a recent surge in activity for the development of data analysis methods applicable to simplicial complexes, including techniques based on computational topology, higher-order random processes, generalized Cheeger inequalities, isoperimetric inequalities, and spectral methods. In particular, spectral learning methods (e.g. label propagation and clustering) that directly operate on simplicial complexes represent a new direction emerging from the confluence of computational topology and machine learning. Similar to the challenges faced by massive graphs, computational methods that operate on simplicial complexes are severely limited by computational costs associated with massive datasets. To apply spectral methods in learning to massive datasets modeled as simplicial complexes, we work towards the sparsification of simplicial complexes based on preserving the spectrum of the associated Laplacian operators. We show that the theory of Spielman and Srivastava for the sparsification of graphs extends to the generality of simplicial complexes via the up Laplacian. In particular, we introduce a generalized effective resistance for simplexes; provide an algorithm for sparsifying simplicial complexes at a fixed dimension; and gives a specific version of the generalized Cheeger inequalities for weighted simplicial complexes under the sparsified setting. In addition, we demonstrate via experiments the preservation of up Laplacian during sparsification, as well as the utility of sparsification with respect to spectral clustering. This is a joint work with Braxton Osting and Sourabh Palande.
Weather-related research requires synthesizing vast amounts of data that need archival solutions that areboth economical and viable. For example, environmental observations from tens of thousands of locations in the United States have been retrieved, archived, and disseminated continuously over the past 20 years as part of the MesoWest project at the University of Utah (http://mesowest.utah.edu). Using Amazon cloud services, this information is being applied now forprotection of life and property and educational, public service, and commercial applications. While this data stream is not particularly large in terms of data volume, the hundredsof variables, diverse reporting practices, and varying data quality require improved solutions to manage and analyze the information.Since early 2015, thousands of two-dimensional gridded fields (eachone containing over 1.9 million values over the contiguous United States) have been obtained from the High-Resolution Rapid Refresh(HRRR) data assimilation and forecast modeling system. The growing archive (now exceeding 40 Tbytes) of efficiently compressed grids is stored onthe CHPC Pando object data store for retrospective analyses of meteorological conditions during high-impact weather events such as wildfires and assessing the accuracy of the HRRR forecasts. The CHPC storage system is proving to be a scalable, reliable, extensible, affordable, and usable archive solution for our research. The Open Science Grid platform is being used to efficiently parallelize the processing of the HRRR output toobtain unique perspectives on the weather analyzed at the millions of grid points nationwide.
Over 6 years ago the White House announced the Materials Genome Initiative (MGI) asking computational materials scientists and experimentalists to find ways to “discover, develop, manufacture, and deploy materials twice as fast at a fraction of the cost.” High throughput computation and experiments have made some progress but we are still far from the MGI goal. However, the emerging field of Materials Informatics offers a completely new and underutilized approach via machine learning for materials problems. In this talk, I’ll present some of the new data-driven tools we are developing with Citrine Informatics such as the “Materials Recommendation Engine” to reduce the risk associated with exploring chemical white space for new, interesting materials. Using this new tool we explore novel new thermoelectric compositions and present validation for the model using leave-one-out cross validation of a large database of known materials. We also synthesize and characterize an entirely new material, RE12Co5Bi (RE=Gd, Er), to validate the model. Finally, an extension of materials informatics into adjacent technologies is described.
Sublinear algorithms for graph problems typically use space (or time) sublinear in the number of edges, but not the number of vertices. They are based on either edge sampling or linear projections that effective sparsity the graph. Lower bounds for many problems (based on consider very sparse graphs) suggest that these results cannot be improved. However, if we consider dense instances of graph problems there are algorithms that run in time sublinear (and in fact almost constant) in the number of vertices based on sampling. Can we bridge the gap between the (hard) sparse instances and the (easy) dense instances from the perspective of sublinear computation? We show that for two such problems (MAX CUT and correlation clustering) we can produce truly sublinear (o(n)) sized core sets for graphs that have edge density at least n^delta, for any delta > 0. The basis for our result is the analysis of random induced subprograms of linear programming formulations of these problems, and for any such graph, we obtain a core set of size $n^{1-delta}$. These results also lead to 2-pass streaming algorithms for (1 epsilon)-approximations for these problems. Joint work with Aditya Bhaskara and Samira Daruki.
We develop a new data-dependent distance for regression problems to compare two regressors (hyperplanes that fit or divide a data set). Most distances between objects attempt to capture the intrinsic geometry of these objects and measure how similar that geometry is. However, we argue that existing measures are inappropriate for regressors to a data set. We introduce a family of new distances that measure how similarly two regressors interact with the data they attempt to fit. For the variant we advocate, we show it is a metric (under mild assumptions), induces metric balls with bounded VC-dimension, it is robust to changes in the corresponding data, and can be approximated quickly. We also describe several algorithmic applications of this distance in quantifying uncertainty, detecting multi-modality, and in measuring fine-grained evaluation of coresets.Moreover, in order to develop efficient approximation algorithms for this distance, we formalize the relationship between sensitivity and leverage scores. This may be of independent interest.This is joint work with Jeff Phillips.
Submodularity plays a key role in a wide variety of applications in machine learning, economics, game theory, computer vision, and many others. The general submodular solver has a high complexity, while many computer vision and machine learning problems are defined over special subclasses of submodular functions that can be seen as the sum of many submodular cost functions defined over cliques containing a few variables. We show efficient algorithms for the minimization of this useful subclass of higher order submodular functions by transforming them to quadratic functions. The underlying idea is to use auxiliary variables to model the higher order terms and the transformation is achieved using a linear program. In particular, we model the auxiliary variables as monotonic Boolean functions, allowing us to obtain a compact transformation using as few auxiliary variables as possible.
Smart city is an urban development vision that spans across multiple disciplines. Transportation is a key component of the smart city concept. A smart transportation system is to use data, applications, and technology to help people and goods move faster, cheaper, and more efficiently. This talk will use several applications in transportation to showcase how to use data to study the design of transportation system, people's travel behavior, as well as polices to make our communities smarter and more connected. First, we present an agent-based modeling to estimate the casual carpooling demand and determine the efficient allocation of casual carpooling facilities. We will then introduce the dynamic public transit connectivity modeling to identify the areas with poor transit accessibility and elucidate the causes for transit investments. Last, we will demonstrate a data-driven spatiotemporal approach for non-recurrent congestion analysis. These applications demonstrate the importance of data-driven approach for improving the efficiency of existing traffic management system and transportation analytics. Dr. Liu is an assistant professor in the Civil & Environmental Engineering Department at the University of Utah. She has a Ph.D. in transportation engineering from the University of Washington. She serves as a member on the Transportation Research Board (TRB) Highway Capacity Quality of Service (HCQS) Committee, Managed Lane Committee and the Transit Capacity and Quality of Service Committee. She is also serving as a board member of Utah Model Advisory Committee, and served on Salt Lake City Transportation Advisory Board (2013-2016). She is the director of Utah Traffic Lab at the University of Utah. Dr. Liu is a licensed professional engineer at the State of Utah.
The Reeb space, which generalizes the notion of a Reeb graph, is one of the few tools in topological data analysis and visualization suitable for the study of multivariate scientific datasets. First introduced by Edelsbrunner et al., it compresses the components of the level sets of a multivariate mapping and obtains a summary representation of their relationships. A related construction called mapper, and a special case of the mapper construction called the Joint Contour Net have been shown to be effective in visual analytics. Mapper and JCN are intuitively regarded as discrete approximations of the Reeb space, however without formal proofs or approximation guarantees. An open question has been proposed by Dey et al. as to whether the mapper construction converges to the Reeb space in the limit. In this work, we are interested in developing the theoretical understanding of the relationship between the Reeb space and its discrete approximations to support its use in practical data analysis. Using tools from category theory, we formally prove the convergence between the Reeb space and mapper in terms of an interleaving distance between their categorical representations. Given a sequence of refined discretizations, we prove that these approximations converge to the Reeb space in the interleaving distance; this also helps to quantify the approximation quality of the discretization at a fixed resolution. Joint work with Elizabeth Munch, University at Albany – SUNY Albany.
In this paper we present a novel method for analyzing the relationship between functional brain networks and behavioral phenotypes. Drawing from topological data analysis, we first extract topological features using persistent homology from functional brain networks that are derived from correlations in resting-state fMRI. Rather than fixing a discrete network topology by thresholding the connectivity matrix, these topological features capture the network organization across all continuous threshold values. We then propose to use a kernel partial least squares (kPLS) regression to statistically quantify the relationship between these topological features and behavior measures. The kPLS also provides an elegant way to combine multiple image features by using linear combinations of multiple kernels. In our experiments we test the ability of our proposed brain network analysis to predict autism severity from rs-fMRI. We show that combining correlations with topological features gives better prediction of autism severity than using correlations alone.
In their seminal work on homological sensor networks, de Silva and Ghrist showed the surprising fact that it's possible to certify the coverage of a coordinate-free sensor network even with very minimal knowledge of the space to be covered. Here, coverage means that every point in the domain (except possibly those very near the boundary) has a nearby sensor. More generally, their algorithm takes a pair of nested neighborhood graphs along with a labeling of vertices as either boundary or interior and computes the relative homology of a simplicial complex induced by the graphs. This approach, called the Topological Coverage Criterion (TCC), requires some assumptions about the underlying geometric domain as well as some assumptions about the relationship of the input graphs to the domain. The goal of this paper is to generalize these assumptions and show how the TCC can be applied to both much more general domains as well as very weak assumptions on the input. We give a new, simpler proof of the de Silva-Ghrist Topological Coverage Criterion that eliminates any assumptions about the smoothness of the boundary of the underlying space, allowing the results to be applied to much more general problems. The new proof factors the geometric, topological, and combinatorial aspects, allowing us to provide a coverage condition that supports thick boundaries, k-coverage, and weighted coverage, in which sensors have varying radii.
A conditional sampling oracle for a probability distribution D returns samples from the conditional distribution of D restricted to a specified subset of the domain. A recent line of work (Chakraborty et al. 2013 and Cannone et al. 2014) has shown that having access to such a conditional sampling oracle requires only polylogarithmic or even constant number of samples to solve distribution testing problems like identity and uniformity. This significantly improves over the standard sampling model where polynomially many samples are necessary. Inspired by these results, we introduce a computational model based on conditional sampling to develop sublinear algorithms with exponentially faster runtimes compared to standard sublinear algorithms. We focus on geometric optimization problems over points in high dimensional Euclidean space. Access to these points is provided via a conditional sampling oracle that takes as input a succinct representation of a subset of the domain and outputs a uniformly random point in that subset. We study two well studied problems: k-means clustering and estimating the weight of the minimum spanning tree. In contrast to prior algorithms for the classic model, our algorithms have time, space and sample complexity that is polynomial in the dimension and polylogarithmic in the number of points. Finally, we comment on the applicability of the model and compare with existing ones like streaming, parallel and distributed computational models.
DBSCAN is a popular method for clustering multi-dimensional objects. Just as notable as the method’s vast success is the research community’s quest for its efficient computation. The original KDD’96 paper claimed an algorithm with O(n log n) running time, where n is the number of objects. Unfortunately, this is a mis-claim; and that algorithm actually requires O(n 2 ) time. There has been a fix in 2D space, where a genuine O(n log n)-time algorithm has been found. Looking for a fix for dimensionality d ≥ 3 is currently an important open problem. In this paper, we prove that for d ≥ 3, the DBSCAN problem requires Ω(n 4/3 ) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science. This (i) explains why the community’s search for fixing the aforementioned mis-claim has been futile for d ≥ 3, and (ii) indicates (sadly) that all DBSCAN algorithms must be intolerably slow even on moderately large n in practice. Surprisingly, we show that the running time can be dramatically brought down to O(n) in expectation regardless of the dimensionality d, as soon as slight inaccuracy in the clustering results is permitted. We formalize our findings into the new notion of ρ-approximate DBSCAN, which we believe should replace DBSCAN on big data due to the latter’s computational intractability
Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e. rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be predetermined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance.
This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed.
To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models.
We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfam-ilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.
In this talk I will introduce the basic energy data that is of interest to building designers, engineers, managers and operators. I will also briefly cover the rapid increase in available data streams from modern buildings, the drivers for that, and potential implications for building operation. Building analysts often address trade-offs between energy efficiency, occupant comfort, and environmental/sustainability concerns, and building energy modeling helps with this decision making process. Next, I'll provide a broad overview of traditional, deterministic building energy modeling and contrast that with how researchers are using machine learning algorithms to predict building energy consumption. Advantages include a need for fewer input parameters and potential reductions in engineering and computational time; disadvantages include the need for historical training data and the lack of physical representation within the neural networks. From a broader perspective, the US electrical grid is experiencing a shift from centralized electrical energy conversion and delivery to a system where consumers are more interactive and may generate their own electricity as well. I'll show how researchers in my group are using GIS and other modeling tools to assess the solar potential for photovoltaic power generation, and point out where this can create data management challenges. Finally, I will present a future vision for integrated building operation using sensing (IoT), user engagement, and smart building systems. I will also discuss challenges we have faced in my research group and would be happy to receive feedback from the students or have a discussion with the group afterward.
Performing inference over large uncertain data sets has become a central problem in database management with applications in related fields, such as machine learning, AI and statistics. With recent probabilistic knowledge bases (such as Yago, Nell, or DeepDive) having millions to billions of uncertain tuples, there is a pressing need for general-purpose and scalable solutions. Since general reasoning under uncertainty is highly intractable, many state-of-the-art systems perform approximate inference with sampling today. This talk shows an alternative approach that uses only basic operators of relational database management systems (i.e. no sampling required). The first part develops optimal oblivious bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new probabilities. The second part then leverages these bounds to reduce the problem of approximate inference over probabilistic databases to a standard multi-query evaluation problem. We give experimental evidence on synthetic TPC-H data that this approach can be orders of magnitude faster and also more accurate than sampling-based approaches to rank top query answers.
Data integration is an essential element of data-intensive science and modern analytics. Users often need to combine data from different sources to gain new scientific knowledge, obtain accurate insights, and create new services. However, today's upsurge in the number and heterogeneity—in terms of format and reliability—of data sources limits the ability of users to reason about the value of data. This raises the fundamental questions: what makes a data source useful to end users, how can we integrate unreliable data, and which sources we need to combine to maximize the user's utility?
Theodoros (Theo) Rekatsinas is a Moore Data Postdoctoral Fellow at Stanford working with Christopher Ré, he earned his Ph.D. in Computer Science from the University of Maryland, where he was advised by Amol Deshpande and Lise Getoor. His research interests are in data management, with a focus on data integration, data cleaning, and uncertain data. Theo's work on using quality-aware data integration techniques to forecast the emergence and progression of disease outbreaks received the Best Paper Award at SDM 2015. Theo was awarded the Larry S. Davis Doctoral Dissertation award in 2015.
A reproducing kernel can define an embedding of a data point into an infinite dimensional reproducing kernel Hilbert space (RKHS). The norm in this space describes a distance, which we call the kernel distance. The random Fourier features (of Rahimi and Recht) describe an oblivious approximate mapping into finite dimensional Euclidean space that behaves similar to the RKHS. We show in this work that for the Gaussian kernel the Euclidean norm between these mapped to features has (1+eps)-relative error with respect to the kernel distance. When there are n data points, we show that O((1/eps^2)log(n)) dimensions of the approximate feature space are sufficient and necessary. Without a bound on n, but when the original points lie in R^d and have diameter bounded by M, then we show that O((d/eps^2)log(M)) dimensions are sufficient, and that this many are required, up to log(1/eps) factors.
In this talk, I'll describe the SimSQL system, which is a platform for writing and executing data- and compute-intensive codes over large data sets, with a particular emphasis on very large-scale statistical computations. At its heart, SimSQL is really a relational database system, and like other relational systems, SimSQL is designed to support data independence. That is, a single declarative code for a particular statistical inference problem can be used regardless of data set size, compute hardware, and physical data storage and distribution across machines. One concern is that a platform supporting data independence will not perform well. But we've done extensive experimentation, and have found that SimSQL performs as well as other competitive platforms that support writing and executing machine learning codes for large data sets.
Chris Jermaine is a professor of computer science at Rice University. He is the recipient of an Alfred P. Sloan Foundation Research Fellowship, a National Science Foundation CAREER award, and an ACM SIGMOD Best Paper Award. He is the general chair for SIGMOD 2018 in Houston. In his spare time, Chris enjoys outdoor activities such as hiking, climbing, and whitewater boating. In one particular exploit, Chris and his wife floated a whitewater raft (home-made from scratch using a sewing machine, glue, and plastic) over 100 miles down the Nizina River (and beyond) in Alaska.
In this talk, I'll describe the SimSQL system, which is a platform for writing and executing data- and compute-intensive codes over large data sets, with a particular emphasis on very large-scale statistical computations. At its heart, SimSQL is really a relational database system, and like other relational systems, SimSQL is designed to support data independence. That is, a single declarative code for a particular statistical inference problem can be used regardless of data set size, compute hardware, and physical data storage and distribution across machines. One concern is that a platform supporting data independence will not perform well. But we've done extensive experimentation, and have found that SimSQL performs as well as other competitive platforms that support writing and executing machine learning codes for large data sets.
Machine learning has taken over our world, in more ways than we realize. You might get book recommendations, or an efficient route to your destination, or even a winning strategy for a game of Go. But you might also be admitted to college, granted a loan, or hired for a job based on algorithmically enhanced decision-making. We believe machines are neutral arbiters: cold, calculating entities that always make the right decision, that can see patterns that our human minds can’t or won’t. But are they? Or is decision-making-by-algorithm a way to amplify, extend and make inscrutable the biases and discrimination that is prevalent in society? To answer these questions, we need to go back — all the way to the original ideas of justice and fairness in society. We also need to go forward — towards a mathematical framework for talking about justice and fairness in machine learning. I will talk about the growing landscape of research in algorithmic fairness: how we can reason systematically about biases in algorithms, and how we can make our algorithms fair(er).
We present an algorithm for coupling inexpensive low-fidelity modelsimulations with high-fidelity simulation data of parameterizeddifferential equations. The goal is to grab a "free lunch": simulationaccuracy of the high-fidelity model with algorithmic complexity ofonly a simplified low-fidelity model. The procedure forms anapproximation with sparsely available high-fidelity simluations, andis a simple linear algebraic construction with connections to kernellearning and column skeletonization of matrices. We discusstheoretical results that establish accuracy guarantees, and introducerecent analysis providing bounds on the error committed by thealgorithm.
This is joint work with Alireza Doostan, Hillary Fairbanks, and JerradHampton (UC Boulder); Claude Gittelson (ETH Zurich); Dongbin Xiu (OhioState); and Xueyu Zhu (U Iowa).
System event logs have been frequently used as avaluable resource in data-driven approaches to enhance systemhealth and stability. A typical procedure in system log analyticsis to first parse unstructured logs, and then apply data analysison the resulting structured data. Previous work on parsingsystem event logs focused on offline, batch processing of rawlog files. But increasingly, applications demand online monitoringand processing. We propose an online streaming method Spell,which utilizes a longest common subsequence based approach,to parse system event logs. We show how to dynamically extractlog patterns from incoming logs and how to maintain a set ofdiscovered message types in streaming fashion. Evaluation resultson large real system logs demonstrate that even compared withthe offline alternatives, Spell shows its superiority in terms ofboth efficiency and effectiveness.
Streaming interactive proofs (SIPs) are a framework for outsourced computation. A computationally limited streaming client (the verifier) hands over a large data set to an untrusted server (the prover) in the cloud and the two parties run a protocol to confirm the correctness of result with high probability. SIPs are particularly interesting for problems that are hard to solve (or even approximate) well in a streaming setting. The most notable of these problems is finding maximum matchings, which has received intense interest in recent years but has strong lower bounds even for constant factor approximations. In this paper, we present efficient streaming interactive proofs that can verify maximum matchings exactly. Our results cover all flavors of matchings (bipartite/non-bipartite and weighted). In addition, we also present streaming verifiers for approximate metric TSP. In particular, these are the first efficient results for weighted matchings and for metric TSP in any streaming verification model.
Last passage percolation is a well-studied model in probability theory that is simple to state but notoriously difficult to analyze. In recent years it has been shown to be related to many seemingly unrelated things: longest increasing subsequences in random permutations, eigenvalues of random matrices, and long-time asymptotics of solutions to stochastic partial differential equations. Much of the previous analysis of the last passage model has been made possible through connections with representation theory of the symmetric group that comes about for certain exact choices of the random input into the last passage model. This has the disadvantage that if the random inputs are modified even slightly then the analysis falls apart. In an attempt to generalize beyond exact analysis, recently my collaborator Eric Cator (Radboud University, Nijmegen) and I have started using tools of tropical geometry to analyze the last passage model. The tools we use to this point are purely geometric, but have the potential advantage that they can be used for very general choices of random inputs. I will describe the very pretty geometry of the last passage model, our work in progress to use it to produce probabilistic information, and how algorithms from computational geometry are very useful in our analysis.
In this talk, we will survey recent work on using random linear projections, a.k.a. sketches, to analyze massive graphs. Sketches are useful in a variety of computational models including the dynamic graph stream model were the input is defined by a stream of edge insertions and deletions that need to be processed in small space. A large number of problems have now been studied in this model including edge and vertex connectivity, spectral sparsification, triangle counting, densest subgraph, correlation clustering, vertex cover, and matching.
Finding anomalous regions within spatial data sets is a central task for biosurveillance, homeland security, policy making, and many other important areas. These communities have mainly settled on spatial scan statistics as a rigorous way to discover regions where a measured quantity (e.g., crime) is statistically significant in its difference from a baseline population. However, most common approaches are inefficient and thus, can only be run with very modest data sizes (a few thousand data points) or make assumptions on the geographic distributions of the data. We address these challenges by designing, exploring, and analyzing emph{sample-then-scan} algorithms. These algorithms randomly sample data at two scales, one to define regions and the other to approximate the counts in these regions. Our experiments demonstrate that these algorithms are efficient and accurate independent of the size of the original data set, and our analysis explains why this is the case. For the first time, these sample-then-scan algorithms allow spatial scan statistics to run on a million or more data points without making assumptions on the spatial distribution of the data.Moreover, our experiments and analysis give insight into when it is appropriate to trust the various types of spatial anomalies when the data is modeled as a random sample from a larger but unknown data set. This is joint work with Liang Zhang, Kaiqiang Wang, Raghvendra Singh, and Jeff Phillips to appear in ACM Sigspatial.
In recent years, several combinatorial problems were introduced in the area of access control. Typically, such problems deal with an authorization policy, seen as a relation $UR \subseteq Users \times Resources$, where $(u, r) \in UR$ means that user u is authorized to access resource r.
Li, Tripunitara and Wang (2009) introduced the Resiliency Checking Problem (RCP), in which we are given an authorization policy, a subset of resources $P$, as well as integers $s \ge 0$, $d \ge 1$ and $t \geq 1$.
It asks whether upon removal of any set of at most $s$ users, there still exist $d$ pairwise disjoint sets of at most $t$ users such that each set has collectively access to all resources in $P$.
This problem possesses several parameters which appear to take small values in practice. We thus analyze the parameterized complexity of RCP with respect to these parameters, by considering all possible combinations of $|P|, s, d, t$. In all cases, we are able to settle whether the problem is in FPT, XP, W[2]-hard, para-NP-hard or para-coNP-hard. The main theorem to prove FPT can be used for other applications.
Short Bio: Gregory Gutin studied for PhD under Professor Noga Alon at Tel Aviv University and received his PhD (with distinction) in 1993. Since 2000 he’s been professor of Computer Science at Royal Holloway, University of London. His research interests include graphs and combinatorics (theory, algorithms and applications), parameterized algorithmics, access control, and combinatorial optimization. He’s published more than 200 papers and co-authored two editions of a monograph on directed graphs (Digraphs: Theory, Algorithms and Applications), both translated into Chinese. In 2014 he received Royal Society Wolfson Research Merit Award. Together with Jason Crampton, Gregory received best paper awards at SACMAT 2015 and SACMAT 2016.
I will talk about recent work at the intersection of two areas related to geometric data analysis: coresets and uncertain data. I will overview what is known about coresets (which are concise representations of large data sets with approximation guarantees, and can be used as proxy for the full data sets) on uncertain data (with focus on how geometers tend to think about this). I also dive into some details regarding this paper (http://arxiv.org/abs/1411.0194, appeared in ESA 2016) whose full abstract is below, and present some remaining open problems and thoughts on the future directions in this research area. * ε-Kernel Coresets for Stochastic Points * Lingxiao Huang, Jian Li, Jeff M. Phillips, Haitao Wang * With the dramatic growth in the number of application domains that generate probabilistic, noisy and uncertain data, there has been an increasing interest in designing algorithms for geometric or combinatorial optimization problems over such data. In this paper, we initiate the study of constructing ϵ-kernel coresets for uncertain points. We consider uncertainty in the existential model where each point's location is fixed but only occurs with a certain probability, and the locational model where each point has a probability distribution describing its location. An ϵ-kernel coreset approximates the width of a point set in any direction. We consider approximating the expected width (an \expkernel), as well as the probability distribution on the width (an \probkernel) for any direction. We show that there exists a set of O(1/ϵ^(d−1)/2) deterministic points which approximate the expected width under the existential and locational models, and we provide efficient algorithms for constructing such coresets. We show, however, it is not always possible to find a subset of the original uncertain points which provides such an approximation. However, if the existential probability of each point is lower bounded by a constant, an exp-kernel\ (or an fpow-kernel) is still possible. We also construct an quant-kernel coreset in linear time. Finally, combining with known techniques, we show a few applications to approximating the extent of uncertain functions, maintaining extent measures for stochastic moving points and some shape fitting problems under uncertainty.
We present an interesting modification to the traditional leverage score sampling approach by augmenting the scores with information from the data variance, which improves empirical performance on the column subsample selection problem (CSSP). We further present, to our knowledge, the first deterministic bounds for this augmented leverage score, and discuss how it compares to the traditional leverage score. We present some experimental results demonstrating how the augmented leverage score performs better than traditional leverage score sampling on CSSP in both a deterministic and probabilistic setting. This is joint work with Ross Whitaker.
Kernel learning is the problem of determining the best kernel (either from adictionary of fixed kernels, or from a smooth space of kernel representations) for a giventask. In this paper, we describe a new approach to kernel learning thatestablishes connections between the Fourier-analytic representation of kernelsarising out of Bochner's theorem and a specific kind of feed-forward networkusing cosine activations. We analyze the complexity of this space of hypothesesand demonstrate empirically that our approach provides scalable kernel learningsuperior in quality to prior approaches. Joint work with Vivek Srikumar, Sarathkumar Swaminathan, Suresh Venkatasubramanian and Dustin Webb. To be published in ECML/PKDD 2016.
Online learning (or sequential prediction) has been a very active research area in
machine learning. I will give a very brief introduction to online learning.
Then, I will focus on a very popular model in online learning, the stochastic multi-armed bandit model,
and present some of our recent theoretical results on combinatorial bandits and the pure-exploration
multi-armed bandit problems. If time permits, I will also talk about some practical
projects our group has done on online and sequential prediction in traffic related applications -- including
a deep learning based prediction algorithm, which won the 2nd place (among 1650 teams all over the world)
and the most potential award in Didi (China's version of Uber) taxi supply-demand prediction
competition.
Jian Li is currently an assistant professor at Institute for Interdisciplinary Information Sciences (IIIS, previously ITCS), Tsinghua University,
headed by Prof. Andrew Yao. He got his BSc degree from Sun Yat-sen (Zhongshan) University, China, MSc degree in computer science
from Fudan University, China and PhD degree in the University of Maryland, USA. His major research interests lie in the broad area
of theoretical computer science, machine learning and databases. He co-authored several research papers that have been published in
major computer science conferences and journals. He received the best paper awards at VLDB 2009 and ESA 2010. He is also a recipient
of the "221 Basic Research Plan for Young Faculties" at Tsinghua University and the "new century excellent talents award" by Ministry
of Education of China.
Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial analytics systems are diskbased and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba’s superior performance compared against other spatial analytics system.
The massive adoption of smart phones and other mobile deviceshas generated humongous amount of spatial and spatio-temporaldata. The importance of spatial analytics and aggregation is ever increasing.An important challenge is to support interactive explorationover such data. However, spatial analytics and aggregationusing all data points that satisfy a query condition is expensive,especially over large data sets, and could not meet the needs ofinteractive exploration. To that end, we present novel indexingstructures that support spatial online sampling and aggregation onlarge spatial and spatio-temporal data sets. In spatial online sampling,random samples from the set of spatial (or spatio-temporal)points that satisfy a query condition are generated incrementally inan online fashion. With more and more samples, various spatialanalytics and aggregations can be performed in an online, interactivefashion, with estimators that have better accuracy over time. Ourdesign works well for both memory-based and disk-resident datasets, and scales well towards different query and sample sizes. Moreimportantly, our structures are dynamic, hence, they are able to dealwith insertions and deletions efficiently. Extensive experiments onlarge real datasets demonstrate the improvements achieved by ourindexing structures compared to other baseline methods.
Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory is based on ripple join, which is still very expensive and even needs unrealistic assumptions (e.g., tuples in a table are stored in random order). We present a new approach, the wander join algorithm, to the online aggregation problem over join by performing random walks over the underlying join graph. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Compared with ripple join, wander join is particularly efficient for equality joins involving multiple tables, but also supports \theta-joins. Selection predicates and group-by clauses can be handled as well. Extensive experiments using the TPC-H benchmark have demonstrated the superior performance of wander join over ripple join. We have integrated and tested wander join in PostgreSQL, demonstrating its practicality in a full-fledged database system.
Fast iterative methods from Convex Optimization play a crucialrole in a number of recent breakthroughs in the design ofnearly-linear-time algorithms for fundamental combinatorial problems, suchas maximum flow and graph partitioning. Multiplicative Weight Updates,Gradient Descent, and Nesterov’s Method have become a mainstay in theconstruction and analysis of fast algorithms.
The power of these approaches raises a number of questions: What is theexact relation between Multiplicative Weight Updates and Gradient Descent?Why do Multiplicative Weight Updates show up in so many settings? What isthe intuition behind Nesterov’s iterative algorithm that achieves theasymptotically optimal iteration count for smooth convex functions?
In this survey talk, I will provide answers by presenting a unifiedframework that reveals the power and limitations of each method andprovides much needed geometric intuition. Among other insights, I willexplain how Multiplicative Weight Updates are a particular instance of adual iterative method, known as Mirror Descent, and how Nesterov’salgorithm can be naturally derived as a combination of Mirror and GradientDescent.
As an example of the application of these insights, I will also discusstheir crucial role in recent progress in the nearly-linear-time solutionof packing linear programs, both in parallel and sequentially.
The material in this talk is joint work with Zeyuan Allen-Zhu (Princeton).
Designing distributed and scalable algorithms to improve network connectivity is a central topic in peer-to-peer networks. In this paper we focus on the following well-known problem: given an n-node d-regular network for d = Ω(log n), we want to design a decentralized, local algorithm that transforms the graph into one that has good connectivity properties (low diameter, expansion, etc.) without affecting the sparsity of the graph. To this end, Mahlmann and Schindelhauer introduced the random “flip” transformation, where in each time step, a random pair of vertices that have an edge decide to ‘swap a neighbor’. They conjectured that performing O(nd) such flips at random would convert any connected d-regular graph into a d-regular expander graph, with high probability. However, the best known upper bound for the number of steps is roughly O(n^{17}d^{23}), obtained via a delicate Markov chain comparison argument.
Our main result is to prove that a natural instantiation of the random flip produces an expander in at most O(n^2 d^2 sqrt(log n)) steps, with high probability. Our argument uses a potential-function analysis based on the matrix exponential, together with the recent beautiful results on the higher-order Cheeger inequality of graphs. We also show that our technique can be used to analyze another well-studied random process known as the ‘random switch’, and show that it produces an expander in O(nd) steps with high probability.
Semi-supervised learning (SSL) is an important aspect of machine learning technology, especially with the growing demand in data analysis applications in which the amount of unlabeled data is increasing at rates that far exceeds our ability to produce commensurate amounts of associated training data. Semi-supervised clustering (SSC), as a subclass of SSL, makes use of user input in the form of relationships between data points (e.g., pairs of data points belonging to the same class or different classes) to improve the performance of clustering algorithms by incorporating domain-specific knowledge. Existing SSC methods typically adopt existing clustering algorithms and incorporate user input as either hard constraints or soft penalties in ways that are suboptimal and not easily generalizable. In this research, we propose a generative, probabilistic model that reflects the joint distribution given by the user-defined relationships between data points.
Often used as importance sampling probabilities, leverage scores have become indispensable in randomized algorithms for linear algebra, optimization, graph theory, and machine learning. A major body of work seeks to adapt these scores to low-rank approximation problems. However, existing "low-rank leverage scores" can be difficult to compute, often work for just a single application, and are sensitive to matrix perturbations. We show how to avoid these issues by exploiting connections between low-rank approximation and regularization. Specifically, we employ ridge leverage scores, which are simply standard leverage scores computed with respect to an ℓ2 regularized input. Importance sampling by these scores gives the first unified solution to two of the most important low-rank sampling problems: (1+ϵ) error column subset selection and (1+ϵ) error projection-cost preservation. Moreover, ridge leverage scores satisfy a key monotonicity property that does not hold for any prior low-rank leverage scores. Their resulting robustness leads to two sought-after results in randomized linear algebra. 1) We give the first input-sparsity time low-rank approximation algorithm based on iterative column sampling, resolving an open question posed in [LMP13], [CLM+15], and [AM15]. 2) We give the first single-pass streaming column subset selection algorithm whose real-number space complexity has no dependence on stream length.
Since being analyzed by Rokhlin, Szlam, and Tygert and popularized by Halko, Martinsson, and Tropp, randomized Simultaneous Power Iteration has become the method of choice for approximate singular value decomposition. It is more accurate than simpler sketching algorithms, yet still converges quickly for any matrix, independently of singular value gaps. After ~O(1/eps) iterations, it gives a low-rank approximation within (1 + eps) of optimal for spectral norm error. We give the first provable runtime improvement on Simultaneous Iteration: a simple randomized block Krylov method, closely related to the classic Block Lanczos algorithm, gives the same guarantees in just ~O(1/sqrt{eps}) iterations and performs substantially better experimentally. Despite their long history, our analysis is the first of a Krylov subspace method that does not depend on singular value gaps, which are unreliable in practice. Furthermore, while it is a simple accuracy benchmark, even (1 + eps) error for spectral norm low-rank approximation does not imply that an algorithm returns high quality principal components, a major issue for data applications. We address this problem for the first time by showing that both Block Krylov Iteration and a minor modification of Simultaneous Iteration give nearly optimal PCA for any matrix. This result further justifies their strength over non-iterative sketching methods. Finally, we give insight beyond the worst case, justifying why both algorithms can run much faster in practice than predicted. We clarify how simple techniques can take advantage of common matrix properties to significantly improve runtime.
Abstract for FaRM:
We describe the design and implementation of FaRM, a new main memory
distributed computing platform that exploits RDMA to improve both
latency and throughput by an order of magnitude relative to state of
the art main memory systems that user TCP/IP. FaRM exposes the memory
of machines in the cluster as a shared address space. Applications
can use transactions to allocate, read, write, and free objects in the
address space with location transparency. we expect this simple
programming model to be sufficient for most application code. FaRM
provides two mechanisms to improve performance were required:
lock-free reads over RDMA, and support for collecting objects and
functions shipping to enable the use of efficient single machine
transactions. FaRM uses RDMA both to directly access data in the
shared address space and for fast messaging and is carefully tuned for
the best RDMA performance. we used FaRM to build a key-value store
and a graph store similar to Facebook. They both perform well, for
example, a 20-machine cluster can perform 167 million key-value
lookups per second with latency of 31us.
Abstract for RAMCloud:
RAMCloud is a storage system that provides low-latency access to
large-scale datasets. To achieve low latency, RAMCloud stores all data
in DRAM at all times. To support large capacities (1PB or more), it
aggregates the memories of thousands of servers into a single coherent
key-value store. RAMCloud ensures the durability of DRAM-based data by
keeping backup copies on secondary storage. It uses a uniform
logstructured mechanism to manage both DRAM and secondary storage,
which results in high performance and efficient memory usage. RAMCloud
uses a polling-based approach to communication, bypassing the kernel
to communicate directly with NICs; with this approach, client
applications can read small objects from any RAMCloud storage server
in less than 5μs, durable writes of small objects take about 13.5μs.
RAMCloud does not keep multiple copies of data online; instead, it
provides high availability by recovering from crashes very quickly (1
to 2 seconds). RAMCloud crash recovery mechanism harnesses the
resources of the entire cluster working concurrently so that recovery
performance scales with cluster size.
Laplacian matrices of graphs arise in large-scale computational applications such as machine learning; spectral clustering of images, genetic data and web pages; transportation network flows; electrical resistor circuits; and elliptic partial differential equations discretized on unstructured grids with finite elements. A Lean Algebraic Multigrid (LAMG) solver of the linear system Ax=b is presented, where A is a graph Laplacian. LAMG's run time and storage are linear in the number of graph edges. LAMG consists of a setup phase, in which a sequence of increasingly-coarser Laplacian systems is constructed, and an iterative solve phase using multigrid cycles. General graphs pose algorithmic challenges not encountered in traditional applications of algebraic multigrid. LAMG combines a lean piecewise-constant interpolation, judicious node aggregation based on a new node proximity definition, and an energy correction of the coarse-level systems. This results in fast convergence and substantial overhead and memory savings. A serial LAMG implementation scaled linearly for a diverse set of 1666 real-world graphs with up to six million edges. This multilevel methodology can be fully parallelized and extended to eigenvalue problems and other graph computations.
A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental discon- nect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be ap- plied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of m adaptively chosen functions on an unknown distribution given n random samples. We show that, surprisingly, there is a way to estimate an exponential in n number of ex- pectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of esti- mates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question
This is based on the following paper: Ralaivola, Liva, Sanjay J. Swamidass, Hiroto Saigo, and Pierre Baldi. Graph kernels for chemical informatics. Neural Networks 18, no. 8 (2005): 1093-1110.Availability of large repositories of chemical compounds has created new challenges and opportunities for the application of machine learning methods to problems in computational chemistry and chemical informatics. Because chemical compounds are often represented by the graph of their covalent bonds, or simply planar labeled graphs, machine learning methods in this domain must be capable of processing graphical structures with variable size. Positive definite kernels that work on labeled graph can be used to harness the power of established kernel methods in machine learning. In this talk, I will review several such kernels. Further, as possible future direction, I will explore how the machinery used to compute kernels could be employed to compute depth statistics for labeled graph ensembles.
The problem of maximizing a concave function f(x) in the unit simplex Δ can be solved approximately by a simple greedy algorithm. For given k, the algorithm can find a point x(k) on a k-dimensional face of Δ, such that f(x(k) ≥ f(x*) − O(1/k). Here f(x*) is the maximum value of f in Δ, and the constant factor depends on f. This algorithm and analysis were known before, and related to problems of statistics and machine learning, such as boosting, regression, and density mixture estimation. In other work, coming from computational geometry, the existence of &epsis;-coresets was shown for the minimum enclosing ball problem by means of a simple greedy algorithm. Similar greedy algorithms, which are special cases of the Frank-Wolfe algorithm, were described for other enclosure problems. Here these results are tied together, stronger convergence results are reviewed, and several coreset bounds are generalized or strengthened. Following recent work of Clarkson, we translate the coreset framework to the problems of finding the point closest to the origin inside a polytope, finding the shortest distance between two polytopes, Perceptrons, and soft- as well as hard-margin Support Vector Machines (SVM). We prove asymptotically matching upper and lower bounds on the size of coresets, stating that ɛ-coresets of size ⌈(1 + o(1))E ∗ /ɛ⌉ do always exist as ɛ → 0, and that this is best possible. The crucial quantity E ∗ is what we call the excentricity of a polytope, or a pair of polytopes. Additionally, we prove linear convergence speed of Gilbert’s algorithm, one of the earliest known approximation algorithms for polytope distance, and generalize both the algorithm and the proof to the two polytope case. Interestingly, our coreset bounds also imply that we can for the first time prove matching upper and lower bounds for the sparsity of Perceptron and SVM solutions.
This talk is mainly based on Erin Elizabeth Chapma Master Thesis A
Survey and Analysis of Solutions to the Oblivious Memory Access
Problem. It focuses on introducing early works regarding ORAMs, such
as Towards a Theory of Software Protection and Simulation by
Oblivious RAMs in STOC 1987 and Software Protection and Simulation
on Oblivious RAMs in Journal of the ACM 1996.
Despite the use of strong encryption schemes, one can still learn
information about encrypted data using side channel attacks. Watching
what physical memory is being accessed can be such a side channel. One
can hide this information by using oblivious simulation -- hiding the
true access pattern of a program. This talk will focus on the model
behind oblivious simulation, attempt to formalize the problem. We will
review the major solutions proposed so far, the square root and
hierarchical solutions.
Many tools exist today for managing increasingly large collections of data. An important challenge, however, is that Big Data management tools must be both fast and easy-to-use. In this talk, we present the Myria system and service that we have developed in the database group at the University of Washington (in collaboration with the eScience Institute). The service is designed to meet the needs of modern data scientists, focusing on performance and productivity. We present an overview of the design of the tool and details on some of its query execution and cloud operation features.
This talk is mainly based on the Petuum Project(Eric Xing, CMU) and its series paper on distributed ML during the past 3 years: https://petuum.github.io/research.html. It show 3 distributed ML challenges and possible solutions with theoretic support. First, data parallelism and consistency model on parameter sever should benefit from the error tolerance properties of ML. Second, using dependency-aware dynamic schedual to reschedual model partition in model parallelism. Third, how to handle nonuniform parameter convergence by priority schedule.
Spectral clustering is a well-known way to partition a graph or network into clusters or communities with provable guarantees on the quality of the clusters. This guarantee is known as the Cheeger inequality and it holds for undirected graphs. We'll discuss a new generalization of the Cheeger inequality to higher-order structures in networks including network motifs. This is easy to implement and seamlessly generalizes spectral clustering to directed, signed, and many other types of complex networks. In particular, our generalization allow us to re-use the large history of existing ideas in spectral clustering including local methods, overlapping methods, and relationships with kernel k-means. We’ll illustrate the types of clusters or communities found by our new method in biological, neuroscience, ecological, transportation, and social networks. This is joint work with Austin Benson and Jure Leskovec at Stanford.
Bio: David Gleich does research is on high performance and large scale matrix computations with a focus on analyzing large data from information networks, social networks, protein networks, and scientific simulations. He received an NSF CAREER award to discover how to scale matrix methods to the enormous sizes of modern data with only modest computational resources. Prior to joining Purdue, he was the John von Neumann post-doctoral fellow at Sandia National Labs.
Twitter has become one of the largest platformsfor users around the world to share anything happening aroundthem with friends and beyond. A bursty topic in Twitter is onethat triggers a surge of relevant tweets within a short time,which often reflects important events of mass interest. Howto leverage Twitter for early detection of bursty topics hastherefore become an important research problem with immensepractical value.Despite the wealth of research work on topic modeling andanalysis in Twitter, it remains a huge challenge to detect burstytopics in real-time. As existing methods can hardly scale tohandle the task with the tweet stream in real-time, we proposein this paperTopicSketch, a novel sketch-based topic modeltogether with a set of techniques to achieve real-time detection.We evaluate our solution on a tweet stream with over 30million tweets. Our experiment results show both efficiency andeffectiveness of our approach. Especially it is also demonstratedthatTopicSketchcan potentially handle hundreds of millionstweets per day which is close to the total number of daily tweetsin Twitter and present bursty event in finer-granularity
Computational topology bridges research from algebraic topology,computational geometry, machine learning and statistics in studyingdata-centric problems.In recent years, the field has produced a large collection ofinnovative techniques in studying the shapes of data via analysisand visualization.Two fundamental tasks in the field are reconstruction (i.e. how toassemble discrete samples into global structures) and inference'(i.e. how to infer high-dimensional structure from low-dimensionalrepresentations).
I will first give a brief introduction to the field of topologicaldata analysis (TDA) and visualization, with a focus on fundamentalconcepts and open problems.Then I will give some examples of my current research and on-going projects.I am looking for a talented PhD student who is interested in TDA andvisualization to work on NSF-funded projects involving large scalenetwork analysis and visualization.
Talk Abstract:
To make beautiful, easy to use, and fast software is hard and provides
deep technical challenges. This talk demonstrates what Tableau
provides, some of the tough problems we have solved, and how we
improved performance in our 9.0 product through parallel query
operations and dependency analysis. Our research and development teams
regularly present at academic conferences such as SIGMOD, VLDB,
SIGCHI, SIGGRAPH, et cetera; this talk was adapted from the SIGMOD
2015 paper.
Speaker Bio:
Dana Groff is Tableau’s Platform Data manager which means that he
manages all development from the abstract query through the physical
query in Tableau. This includes query optimization, the internal
columnar data base system, all connectivity to outside databases, and
the machine learning efforts we use to discover and clean data. He is
an industry veteran who prior to Tableau helped develop parallel
programming libraries for C++ at Microsoft, and also while there put
transactions in the OS kernel, file system, and in language (software
transactional memory). He has worked for Sybase (now SAP), IBM,
Filetek (now SGI) and DEC (now HP and Oracle) developing OLTP and
relational database systems.
Background of Tableau:
Tableau (NYSE: DATA) is on a mission to help our customers see and
understand their data. We are the leader in business intelligence and
analytics platforms, according to Gartner three years in a row. We are
freakishly-friendly, cool place to work and keep being voted one of
the best places to work in Washington state.
Tableau is free for academic use. (www.tableau.com/academic)
Artificial neural network is a very rich hypothesis class that has preferable sample complexity, but it is hard to train in terms of computational efficiency. In practice, however, neural network with depth less than 20 is successfully trained using backpropagation with some tricks. In this talk I will briefly introduce neural network as a hypothesis class. Some of the known theoretical results about its expressivity, sample complexity and computational efficiency are reviewed. Then I will talk about my on-going research project, on the decision boundary of two-layer networks with threshold activations and rectifier activations. I will show that rectifier network can express a much complicated decision boundary than threshold network. In particular we need exponentially more number of units to represent the same boundary using threshold function than we need using rectifiers.
This paper is from SOSP’09.
Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime problems. We first parse console logs by combining source code analysis with information retrieval to create composite features. We then analyze these features using machine learning to detect operational problems. We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software’s internals.
And two additional papers briefly:
ICDM’09: Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis
UsenixATC’10: Mining Invariants from Console Logs for System Problem Detection
Compressed inverted indices in use today are based on the idea of gap compression: documents pointers are stored in increasing order, and the gaps between successive document pointers are stored using suitable codes which represent smaller gaps using less bits. Additional data such as counts and positions is stored using similar techniques. A large body of research has been built in the last 30 years around gap compression, including theoretical modeling of the gap distribution, specialized instantaneous codes suitable for gap encoding, and ad hoc document reorderings which increase the efficiency of instantaneous codes. We propose to represent an index using a radically different architecture based on quasi-succinct representation of monotone sequences. We show that, besides being theoretically elegant and simple, the new index provides expected constant-time operations, space savings, and, in practice, significant performance improvements on conjunctive, phrasal and proximity queries.
*********BIO***************
Sebastiano Vigna's research focuses on the interaction between theory and practice. He has worked on highly theoretical topics such as computability on the reals, distributed computability, self-stabilization, minimal perfect hashing, succinct data structures, query recommendation, algorithms for large graphs, theoretical/experimental analysis of spectral rankings such as PageRank, and axiomatization of centrality measures, but he is also (co)author of several widely used software tools ranging from high-performance Java libraries to a model-driven software generator, a search engine, a crawler, a text editor and a graph compression framework. In 2011 he collaborated to the computation the distance distribution of the whole Facebook graph, from which it was possible to evince that on Facebook there are just 3.74 degrees of separation. Recently, he participated to the analysis of the largest available public web crawl (Common Crawl 2012), which led to the publication of the first open ranking of web sites (http://wwwranking.webdatacommons.org/). His work on Elias-Fano coding and quasi-succinct indices is at the basis of the code of Facebook's folly library (https://github.com/facebook/folly/blob/master/folly/experimental/EliasFanoCoding.h). He also collaborated to the first open ranking of Wikipedia pages (http://wikirank.di.unimi.it/), which is based on his body of work on centrality in networks.
Sebastiano Vigna obtained his PhD in Computer Science from the Universita degli Studi di Milano, where he is currently an Associate Professor.
A {em persistent data structure}, also known as a {em multiversion data structure} in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried.
Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of today's big data applications, in particular for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which use only sublinear space by sacrificing slightly on accuracy.
All streaming algorithms work by maintaining a small data structure in memory, which is usually called a {em sketch}, {em summary}, or {em synopsis}. The sketch is updated upon the arrival of every element in the stream, thus is {em ephemeral}, meaning that it can only answer queries about the current status of the stream. In this talk, we aim at designing persistent sketches, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.
*********BIO**************
Zhewei Wei is an associate professor at School of Information, Renmin University of China. He received his BSc degree at School of Mathematical Sciences, Peking University in June 2008, and his PhD degree at Department of Computer Science and Engineering, The Hong Kong University of Science and Technology in March 2012. After that, he worked as a Postdoc at MADALGO (Center for Massive Data Algorithmics), Aarhus University, from September 2012 to August 2014. In 2014 he joined Renmin University of China in Beijing. His research interests lie in the area of algorithms for massive data and database management.
A sketch of a matrix A is another matrix B which is significantly smaller than A, but still approximates it well. That is, either B^TB ~ A^TA or A~B. Producing such sketches efficiently is an important building block in modern date mining and machine learning. In this talk, we will review two recent results.
In the first, we adapt a well known streaming algorithm for approximating item frequencies to the matrix sketching setting. The algorithm receives the rows of the input matrix A one after the other and generated B such that B^TB ~ A^TA. The presented algorithm achieves the optimal ratio between space and sketch accuracy. It also stands out in that it is deterministic, simple to implement, and elementary to prove.
The second algorithm receives the entires of A in an arbitrary order and generates B such that B~A. It is a sampling scheme that is provably close to optimal. The algorithm requires only O(1) operations per non-zero in the matrix and is shown to outperform prior art.
BIO
Edo received his B.Sc in Physics and Computer Science from Tel Aviv university and his Ph.D in Computer Science from Yale University, under the supervision of Steven Zucker. During his PhD he spent time at both UCLA and Google as an engineer and a researcher. After that, he joined the Program in Applied Mathematics at Yale as a Post-Doctoral fellow. In 2009 he joined Yahoo! Labs in Israel. He recently moved to New York to lead the machine learning group which focuses on the theory and practice of (very) large scale data mining and machine learning. In Particular, theoretical foundation of machine learning, optimizations, scalable scientific computing, and machine learning systems and platforms.
Consider real-time exploration of large multidimensional spatiotemporal datasets with billions of entries, each defined by a location, a time, and other attributes. Are certain attributes correlated spatially or temporally? Are there trends or outliers in the data? Answering these questions requires aggregation over arbitrary regions of the domain and attributes of the data. Many relational databases implement the well-known data cube aggregation operation, which in a sense precomputes every possible aggregate query over the database. Data cubes are sometimes assumed to take a prohibitively large amount of space, and to consequently require disk storage. In contrast, we show how to construct a data cube that fits in a modern laptop's main memory, even for billions of entries; we call this data structure a nanocube. We present algorithms to compute and query a nanocube, and show how it can be used to generate well-known visual encodings such as heatmaps, histograms, and parallel coordinate plots. When compared to exact visualizations created by scanning an entire dataset, nanocube plots have bounded screen error across a variety of scales, thanks to a hierarchical structure in space and time. We demonstrate the effectiveness of our technique on a variety of real-world datasets, and present memory, timing, and network bandwidth measurements. We find that the timings for the queries in our examples are dominated by network and user-interaction latencies.
The world is full of uncertainty and very often decisions have to be
made way before the uncertainty is resolved. Stochastic optimization
studies optimization problems with uncertain inputs or parameters, in
which the uncertainty is modeled using probability theory. The area
was initiated by Danzig in 1950s and has been subject to extensive
research in many disciplines including computer science, math,
operation research, economics, management science and social science.
In this talk, I will talk about some of my recent efforts on
stochastic combinatorial optimization. In particular, I will talk
about some new results on the stochastic models for several
fundamental combinatorial problems, such as minimum spanning tree,
maximum matching, shortest path and knapsack.
Jian Li is an assistant professor at Institute for Interdisciplinary
Information Sciences, Tsinghua University. He got his BSc degree in
Sun Yat-sen (Zhongshan) University, China, MSc degree in computer
science from Fudan University, China and PhD degree in the University
of Maryland, USA. His research interests lie in the areas of
algorithms and databases. He co-authored several research papers that
have been published in major computer science conferences and
journals. He received the best paper awards at VLDB 2009 and ESA 2010.
Motivated by cloud computing, a number of recent works have studied annotated data streams and variants thereof. In this setting, a computationally weak verifier (cloud user), lacking the resources to store and manipulate his massive input locally, accesses a powerful but untrusted prover (cloud service). The verifier must work within the restrictive data streaming paradigm. The prover, who can annotate the data stream as it is read, must not just supply the answer but also convince the verifier of its correctness. Ideally, both the amount of annotation and the space used by the verifier should be sublinear in the relevant input size parameters. A rich theory of such algorithms -- which we call schemes -- has emerged. Prior work has shown how to leverage the prover's power to efficiently solve problems that have no non-trivial standard data stream algorithms. However, while optimal schemes are now known for several basic problems, such optimality holds only for streams whose length is commensurate with the size of the data universe. In contrast, many real-world datasets are relatively sparse, including graphs that contain only O(n^2) edges, and IP traffic streams that contain much fewer than the total number of possible IP addresses, 2^128 in IPv6. We design the first schemes that allow both the annotation and the space usage to be sublinear in the total number of stream updates rather than the size of the data universe. We solve significant problems, including variations of INDEX, SET-DISJOINTNESS, and FREQUENCY-MOMENTS, plus several natural problems on graphs. On the other hand, we give a new lower bound that, for the first time, rules out smooth tradeoffs between annotation and space usage for a specific problem. Our technique brings out new nuances in Merlin-Arthur communication complexity models, and provides a separation between online versions of the MA and AMA models.
We present a series of (nearly tight) approximation ratio/running time tradeoff results for three seemingly unrelated fundamental problems: Maximum Independent Set, Maximum Induced Matching, and Hypergraph Pricing. These problems are important in various subareas of computer science such as combinatorial optimization, parameterized complexity, and algorithmic pricing.
Informally speaking, our main results are the following.
* Any r approximation algorithm for Maximum Independent Set/Induced Matching must run in time at least 2^{n/r}.
* The k-hypergraph pricing problem is min ( k, sqrt{n}) hard to approximate assuming Exponential Time Hypothesis (ETH)
Our results put the pricing problem in a rare problem class in which quasi-polynomial time algorithms can perform much better than polynomial time algorithms. The proofs of these results rely on unexpectedly tight connections between these problems and illustrate a rare use of the Exponential Time Hypothesis to rule out polynomial time approximation algorithms.
The volume of available structured data is increasing, particularly in the form of Linked Data, where relationships between individual pieces of data are encoded by a graph like structure. The data is usually represented in the form of RDF model. Querying graph-like data brings many challenges. The standard RDF query language is SPARQL. We show that RDF and SPARQL have certain limitations when modelling or reasoning about the data. This motivated us to design and develop a new platform that can truly handle knowledge modelling and machine reasoning. We also present a new query language that is easier to use and has higher expressiveness than SPARQL.
In data streaming model, where each data element arrives at a time, can be processed only once and storage is severeley limited, solving even simple data mining problems is challenging. Often people compute a sketch of the input matrix, which is significantly smaller but approximates it well.Finding such sketches efficiently is an important building block in modern algorithms. We consider processing an n*d matrix A in a stream with row-wise updates according to a recent algorithm called Frequent Directions(Liberty, KDD'13) and show it satisfies (1+epsilon) relative error bounds.
SciDB is a new open-source data management system intended primarily for use
in application domains that involve very large(petabyte) scale array data.We
will describe the set of motivating examples and use them to explain the
features of SciDB.We then provide a snapshot of the novel storage manager-
ArrayStore for complex, parallel array processing.
ArrayStore builds on prior work of multidimensional storage but considers
the new problem of supporting a parallel and more varied workload comprising
not only range-queries, but also binary operations such as joins and complex
user-defined functions.
References:
1)Overview of SciDB http://www.scidb.org/Documents/sigmod691-brown.pdf
2)ArrayStore-A Storage Manager for Complex Parallel Array Processing
http://www.scidb.org/Documents/SciDB-sigmod362-soroush.pdf
3) http://scidb.org/
If infinite amounts of labeled data are provided, many machine learning algorithms become perfect. With finite amounts of data, regularization or priors have to be used to introduce bias into a classifier. We propose a third option: learning with marginalized corrupted features. We corrupt existing data as a means to generate infinitely many additional training samples from a slightly different data distribution -- explicitly in a way that the corruption can be marginalized out in closed form. This leads to machine learning algorithms that are fast, effective and naturally scale to very large data sets. We showcase this technology as regularization for general risk minimization and for marginalized deep learning for document representations.
Bio: Kilian Q. Weinberger is an Assistant Professor in the Department of Computer Science & Engineering at Washington University in St. Louis. He received his Ph.D. from the University of Pennsylvania in Machine Learning under the supervision of Lawrence Saul. Prior to this, he obtained his undergraduate degree in Mathematics and Computer Science at the University of Oxford. During his career he has won several best paper awards at ICML, CVPR and AISTATS. In 2011 he was awarded the AAAI senior program chair award and in 2012 he received the NSF CAREER award. Kilian Weinberger's research is in Machine Learning and its applications. In particular, he focuses on high dimensional data analysis, metric learning, machine learned web-search ranking, transfer- and multi-task learning as well as bio medical applications.
There are many Big Data Applications that require users to coordinate and communicate. Friends want to coordinate travel plans, students want to jointly enroll in the same set of courses, and busy professionals want to coordinate their schedules. These tasks are difficult to program using existing abstractions provided by database systems since they all require some type of coordination between users. However, this type of information flow is fundamentally incompatible with classical isolation in database transactions. I will argue that it is time to look beyond isolation towards principled abstractions for Big Data Social. This is joint work with Alan Demers, Nitin Gupta, Christoph Koch, Lucja Kot, Milos Nikolic, and Sudip Roy.
Bio:
Johannes Gehrke is the Tisch University Professor in the Department of Computer Science at Cornell University and a Partner Scientist at Microsoft. Johannes' research interests are in the areas of database systems, data science, and data privacy. Johannes has received a NSF Career Award, an Arthur P. Sloan Fellowship, an IBM Faculty Award, the Cornell College of Engineering James and Mary Tien Excellence in Teaching Award, the Cornell University Provost's Award for Distinguished Scholarship, a Humboldt Research Award from the Alexander von Humboldt Foundation, the 2011 IEEE Computer Society Technical Achievement Award, and the 2011 Blavatnik Award for Young Scientists from the New York Academy of Sciences. He co-authored the undergraduate textbook Database Management Systems (McGrawHill (2002), currently in its third edition), used at universities all over the world. Johannes is also an Adjunct Professor at the University of Troms in Norway. Johannes was Program co-Chair of KDD 2004, VLDB 2007, and ICDE 2012. From 2007 to 2008, he was Chief Scientist at FAST, A Microsoft Subsidiary.
Reachability querying is a basic graph operation with numerous important applications in databases, network analysis, computational biology, software engineering, etc. Although many indexes have been proposed to answer reachability queries, most of them are only efficient for handling relatively small graphs. We propose TF-label, an efficient and scalable labeling scheme for processing reachability queries. TF-label is constructed based on a novel topological folding (TF) that recursively folds an input graph into half so as to reduce the label size, thus improving query efficiency. We show that TF-label is efficient to construct and propose efficient algorithms and optimization schemes. Our experiments verify that TF-label is significantly more scalable and efficient than the stateof-the-art methods in both index construction and query processing.
We study the problem of computing centerpoints for an un- certain point set P in R2. We model each of n uncertain points as having k possible locations and provide an algorithm that is polynomial in n and k to find a point with the highest probability of being a centerpoint of the uncertain point set. We also design a t-product range query data structure on uncertain points, that we believe to be of broader interest. Given a query range, it returns the number of possible uncertain point sets that have at least t points in the range—this value is proportional to the probability that at least t points fall within this range.
Log data continues to be collected at increasing rages. These logs
are used for many tasks, including identifying suspicious behavior, IT
trapshooting, and user behavior analysis. Storing this ever
increasing data is costly, however the benefit of log data encourages
system administrators to continually archive the data without deleting
old records.
In this talk we will consider several techniques used to manage
massive log data. RasterZip is a compression algorithm for network
monitoring data. Utilizing simple column based compression methods
means that algorithm can compress data quickly on the fly while
maintaining efficient compression ratios. LogKV is a log processing
system which utilizes key-value stores to improve log processing
throughput and queries over stored data. A novel method of
distributing log entires in order to improve similarity between
adjacent records currently being studied will also be discussed.
A major strength of Monte Carlo (MC) particle transport methods isthe ability to use Constructive Solid Geometry (CSG) to model a complexdomain with high fidelity. Recent efforts to include feedbackeffects (e.g., depletion, thermal feedback, xenon, etc.) caused MC solvers to need to calculate volumes of spatial regions quickly and accurately.
In this talk, I describe a framework for calculating region volumes giventheir equivalent constructive solid geometry (CSG) definition. Theframework relies on domain decomposition to recursively subdivide regionsuntil they can be computed analytically or (when needed) stochastically.While the framework is general enough to handle any valid region, it wasspecifically optimized for models used in criticality safety and reactoranalysis. Numerical experiments show that the framework is over 500xfaster and two orders of magnitude more accurate than the standardstochastic volume calculation algorithms currently in use.
This work is in collaboration with David P. Griesheimer, Brian R.Nease, and Jack Snoeyink.
Dave Millman is a Senior Engineer in the Reactor Technology Department at Bettis Atomic Power Laboratory. He has a PhD from UNC in December 2012, and MS from NYU in 2007. His research interests are in computational geometry, especially high-precision calculation of solid volumes.
Radio Tomographic Imaging (RTI) is an emerging technology that locates moving objects in areas surrounded by simple and inexpensive radios. RTI is useful in emergencies, rescue operations, and security breaches, since the objects being tracked need not carry an electronic device. Tracking humans moving through a building, for example, could help firefighters save lives by locating victims quickly. RTI works by placing small inexpensive radios in a region of interest. The radios can send and receive wireless signals, and form a network of links that cover the region of interest. When a person walks through the region, they interfere with the links, creating a ``shadow'' of broken links that can be used to infer presence and track individuals. This yields the following problem: given a collection of radios and a set of ``visible'' links, infer the trajectory of a person moving through the region. This inference must be robust under link errors and occlusion, as well as be performed in real time. In addition, there is an associated planning problem of where to place the radios in order to make the tracking algorithm as effective as possible. In this paper, we present geometric algorithms for these questions. The key technical developments include a generalization of stabbing line and transversal problems, as well as a novel generalization of traditional art gallery problems.
We look at stability of the clusterings at three different granularities: Clustering level, cluster level and point level. There has been a fair amount of work on stability at the clustering level. In what follows, we explain why we need finer granularities in defining stability notions and why we need to look at point-wise stability notions. We will then provide a new definition for local point-wise stability based on Natural Neighbor Interpolation and explain how to compute these scores, exactly ih low-d and estimate them in high-d.
Increasingly, linked data has become the de-facto standard in publishing data and representing knowledge on the web. The W3C's recommendation for representing linked data is to use the Resource Description Framework (RDF). To date, more than 32 billions facts on the web have been represented as linked data (i.e., RDF graphs). We have conducted research on designing novel query processing techniques to support scalable data analytics in large-scale linked data. In this talk, we will present our findings on answering queries over linked data views -- a problem commonly arise in integrating knowledge from different web sources, and the related problem of performing multi-query optimization in RDF graphs. We will also envision the immediate and future challenges associated with supporting efficient and scalable data analytics in large linked data, and their generalizations to general graph data.
Given the recent explosion of data we have witnessed within the past few years, it has become apparent that many algorithms designed for these applications simply will not scale to huge amounts of data being generated. Due to the massive nature of modern data, it is often times infeasible for computers to efficiently manage and query it exactly. An attractive path for the future are data summarization techniques, which can be used to make data analytic tasks orders of magnitude more efficient while still allowing approximation guarantees on results. We argue in order to be useful, the summary must be efficient to construct and should be highly parallelizable. We present two such techniques which efficiently construct summaries over massive data; the first constructs wavelet histograms and the second constructs k-nearest neighbor graphs. Our techniques are demonstrated and evaluated in MapReduce, but are applicable for any parallel and distributed compute platforms. In addition to being able to efficiently construct summaries, we also motivate a vision for interactive and mergeable summaries as a direction for future research.
In this paper, we investigate the secure nearest neighbor (SNN) problem, in which a client issues an encrypted query point E(q) to a cloud service provider and asks for an encrypted data point in E(D) (the encrypted database) that is closest to the query point, without allowing the server to learn the plaintexts of the data or the query (and its result). We show that efficient attacks exist for existing SNN methods, even though they were claimed to be secure in standard security models (such as indistinguishability under chosen plaintext or ciphertext attacks). We also establish a relationship between the SNN problem and the order-preserving encryption (OPE) problem from the cryptography field, and we show that SNN is at least as hard as OPE. Since it is impossible to construct secure OPE schemes in standard security models, our results imply that one cannot expect to find the exact (encrypted) nearest neighbor based on only E(q) and E(D). Given this hardness result, we design new SNN methods by asking the server, given only E(q) and E(D), to return a relevant (encrypted) partition E(G) from E(D) (i.e., G D), such that that E(G) is guaranteed to contain the answer for the SNN query. Our methods provide customizable tradeoff between efficiency and communication cost, and they are as secure as the encryption scheme E used to encrypt the query and the database, where E can be any well-established encryption schemes.
Increasingly, data is being bought and sold online, and Web-based marketplace services have emerged to facilitate selling and buying data. But current pricing mechanisms are very simple. In this talk I will discuss a framework for pricing data that allows the seller to set explicit prices for a set of views of her choice, yet allows the buyer to buy any query; the price of the query is derived automatically from the explicit prices set by the seller. We call this framework ``query-based pricing''. A pricing function must satisfy two important properties: it has to be arbitrage-free and discount-free. I will show that these properties are intimately related to a fundamental database concept called "query-view determinacy", which has been studied by Nash, Segoufin, and Vianu. I will discuss several aspects of arbitrage-free pricing functions: The case when the seller sells a service in addition to the data; The distinction between arbitrage-freeness and query containment (fewer answers are sometimes more expensive than more answers!); And the connection to data privacy (data perturbed by a random noise should be cheaper than the accurate data).
Bio: Dan Suciu is a Professor in Computer Science at the University of Washington. He received his Ph.D. from the University of Pennsylvania in 1995, was a principal member of the technical staff at AT&T Labs and joined the University of Washington in 2000. Suciu is conducting research in data management, with an emphasis on topics related to Big Data and data sharing, such as probabilistic data, data pricing, parallel data processing, data security. He is a co-author of two books Data on the Web: from Relations to Semistructured Data and XML, 1999, and Probabilistic Databases, 2011. He is a Fellow of the ACM, holds twelve US patents, received the ACM SIGMOD Best Paper Award in 2000, the ACM PODS Alberto Mendelzon Test of Time Award in 2010 and in 2012, and is a recipient of the NSF Career Award and of an Alfred P. Sloan Fellowship. Suciu serves on the VLDB Board of Trustees, and is an associate editor for the VLDB Journal, ACM TOIS, ACM TWEB, and Information Systems and is a past associate editor for ACM TODS. Suciu's PhD students Gerome Miklau and Christopher Re received the ACM SIGMOD Best Dissertation Award in 2006 and 2010 respectively, and Nilesh Dalvi was a runner up in 2008.
We will give an introduction to MongoDB and how it overcomes the limitation of an SQL database. Elaborate on the motivation of the NoSQL movement. Will explain briefly it's important features, namely the document-oriented approach, flexible schema, replication and sharding, Map Reduce ,GridFS and geospatial indexing.The indexing in mongoDB will be explained in more detail. Will conclude with the reason for the preference of mongoDB over other databases in the industry.
A hidden database refers to a dataset that an organization makes accessible on the web by allowing users to issue queries through a search interface. In other words, data acquisition from such a source is not by following static hyper-links. Instead, data are obtained by querying the interface, and reading the result page dynamically generated. This, with other facts such as the interface may answer a query only partially, has prevented hidden databases from being crawled effectively by existing search engines. This paper remedies the problem by giving algorithms to extract all the tuples from a hidden database. Our algorithms are provably efficient, namely, they accomplish the task by performing only a small number of queries, even in the worst case. We also establish theoretical results indicating that these algorithms are asymptotically optimal i.e., it is impossible to improve their efficiency by more than a constant factor. The derivation of our upper and lower bound results reveals significant insight into the characteristics of the underlying problem. Extensive experiments confirm the proposed techniques work very well on all the real datasets examined.
While users today have access to many tools that assist in performing large scale data analysis tasks, understanding the performance characteristics of their parallel computations, such as MapReduce jobs, remains difficult. We present PerfXplain, a system that enables users to ask questions about the relative performances (i.e., runtimes) of pairs of MapReduce jobs. PerfXplain provides a new query language for articulating performance queries and an algorithm for generating explanations from a log of past MapReduce job executions. We formally define the notion of an explanation together with three metrics, relevance, precision, and generality, that measure explanation quality. We present the explanation-generation algorithm based on techniques related to decision-tree building. We evaluate the approach on a log of past executions on Amazon EC2, and show that our approach can generate quality explanations, outperforming two na?ve explanation-generation methods.
In the analysis of algorithms, the notion of efficiency is almost always assumed to mean polynomial time, where the gold standard of achievement is linear time. Indeed, it is hard to imagine doing much better than that, since for any nontrivial problem, it would seem that an algorithm must consider all of the input in order to make a decision. However, as extremely large data sets grow more prevalent in a wide variety of settings, it is natural to wonder what one can do in sublinear time. In fact, there has been a lot of recent interest in this direction. Sublinear time is a daunting goal since it allows one to read only a miniscule fraction of the input. There are problems for which deterministic exact sublinear time algorithms are known. However, for most natural problems the algorithm must use randomization and must give an answer which is in some sense approximate. There is a growing body of work aimed at finding sublinear time algorithms for various problems. In addition, property testing , an alternative notion of approximation for decision problems, has been applied to give sublinear algorithms for a wide variety of problems.
Many examples of problems that can be solved in sublinear time have been found. Several useful techniques have emerged for designing sublinear algorithms. Still, the study of sublinear time algorithms is very new, and much remains to be understood about their scope, both in finding sublinear time algorithms for new problems and in finding more efficient algorithms for problems that already have sublinear time algorithms. In this talk we will give introduction, background and motivation for Sublinear algorithms and property testing, and present an overview of the current techniques and related context. We also show some example for property testing on Massive Graphs to have a flavor of how these simple algorithms work.
Surveys used in presentation come from here
TITLE: Using XMorph for XML Data Transformations
ABSTRACT: XMorph is a new data transformation language for XML. I will give an overview of XMorph and describe its use as a guard to correctly evaluate XQuery querieson data in differently shaped data collections.
SPEAKER: Curtis Dyreson
BIO: I'm an assistant professor in the Department of Computer Science at Utah State University. I'm also the ACM SIGMOD DiSC Editor, the Information Director for ACM Transactions on Database Systems and the Information Director for ACM SIGMOD, and I serve on the SIGMOD Executive Committee. I research and teach in the area of database systems. My interests include temporal databases, XML databases, data cubes, and providing support for proscriptive metadata. Prior to Utah State University, I was a professor at Washington State University, James Cook University, Aalborg University, and Bond University.
URL: http://cs.usu.edu/~cdyreson/
TITLE: Vehicle Traffic Understanding for Video Stream Data
ABSTRACT: The approximate counting of sets of vehicle activities helps us to predict the significance of scenes in a video stream. Intuitively, the more transactions we take, the more precise frequency approximations we obtain. However, more available memory will be consumed as well. In this talk, we show an approach to automatically discovers frequent interactions with approximate support and guaranteed error value while also adjusting dynamically the number of scenes needed to effectively evaluate frequency values. Our method reduces space complexity by adapting the algorithm to the amount of memory available before performing any process to update frequency values for vehicle interactions. Load balancing, IO-intensive workloads, and multicore algorithms are considered over the MapReduce programming mode.
SPEAKER: Omar U. Florez
BIO: I am a Ph.D.student in Computer Science at Utah State University. I was born in Peru and came to the US to work as a Research Assistant in projects on Scalable Data Mining, Parallel Machine Learning, Pattern Recognition, and Computer Vision. Specifically, I implemented algorithms and made 20+ publications on MapReduce Topic Modeling, Scalable Recommender Systems, Traffic Understanding from Video Sequences, and Clustering and Hash-based Indexing of Unstructured Data (Videos, Music, and Human Motion timeseries).
URL: http://omarflorez.info/index.php?id=homepage
Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. Locality-sensitive hashing (LSH) based methods have become a very popular approach for this problem. However, most such methods only use LSH for the first phase of similarity search - i.e. efficient indexing for candidate generation. In this paper, we present BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search - performing candidate pruning and similarity estimation using LSH. A simpler variant, BayesLSH-Lite, which calculates similarities exactly, is also presented. Our algorithms are able to quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches. For BayesLSH, we also provide probabilistic guarantees on the quality of the output, both in terms of accuracy and recall. Finally, the quality of BayesLSH’s output can be easily tuned and does not require any manual setting of the number of hashes to use for similarity estimation, unlike standard approaches. For two state-of-the-art candidate generation algorithms, AllPairs and LSH, BayesLSH enables significant speedups, typically in the range 2x-20x for a wide variety of datasets.
Abstract: Network-based services are becoming a ubiquitous part of our lives, to the point where individuals and businesses have often come to critically rely on them. Building and maintaining such reliable, high performance network and service infrastructures requires the ability to rapidly investigate and resolve complex service and performance impacting issues. To achieve this, it is important to collect, correlate and analyze massive amounts of data, such as network element fault logs, configuration and topology information, performance data and traffic measurements, from a diverse collection of data sources in real time.
We have designed and implemented a variety of data systems at AT&T Labs Research to build highly scalable databases that support real time data collection, correlation and analysis, including (a) the Daytona database management system, (b) the GS data stream management system, (c) the Data Depot stream warehouse generator and (d) the Bistro data feed manager. Together, these data systems have enabled the creation and maintenance of a database and data analysis infrastructure for troubleshooting complex issues in the network. This talk describes these data systems, presents their key research contributions, and identifies technical challenges that are currently being addressed.
Bio: Divesh Srivastava is the head of the Database Research Department at AT&T Labs-Research. He received his Ph.D. from the University of Wisconsin, Madison, and his B.Tech from the Indian Institute of Technology, Bombay. He is a Fellow of the ACM, on the board of trustees of the VLDB Endowment, and an associate editor of the ACM Transactions on Database Systems. He has served as the associate Editor-in-Chief of the IEEE Transactions on Knowledge and Data Engineering, and the program committee co-chair of many conferences, including VLDB 2007. He has presented keynote talks at several conferences, including VLDB 2010. His research interests span a variety of topics in data management.
An overview of Divesh's talk is in:
Enabling Real Time Data Analysis
Divesh Srivastava, Lukasz Golab, Rick Greer, Theodore Johnson,
Joseph Seidel, Vladislav Shkapenyuk, Oliver Spatscheck, Jennifer Yates:
Enabling Real Time Data Analysis.
PVLDB 3(1): 1-2 (2010)
PDF Link
Other related papers to skim:
heodore Johnson, S. Muthukrishnan, Oliver Spatscheck, Divesh Srivastava:
Streams, Security and Scalability.
DBSec 2005: 1-15
PDF Link
Rick Greer:
Daytona And The Fourth-Generation Language Cymbal.
SIGMOD Conference 1999: 525-526
PDF Link
Charles D. Cranor, Theodore Johnson, Oliver Spatscheck, Vladislav Shkapenyuk:
Gigascope: A Stream Database for Network Applications.
SIGMOD Conference 2003: 647-651
PDF Link
We describe DataDepot, a tool for generating warehouses from streaming data feeds, such as network-traffic traces, router alerts, financial tickers, transaction logs, and so on. DataDepot is a streaming data warehouse designed to automate the ingestion of streaming data from a wide variety of sources and to maintain complex materialized views over these sources. As a streaming warehouse, DataDepot is similar to Data Stream Management Systems (DSMSs) with its emphasis on temporal data, best-effort consistency, and real-time response. However, as a data warehouse, DataDepot is designed to store tens to hundreds of terabytes of historical data, allow time windows measured in years or decades, and allow both real-time queries on recent data and deep analyses on historical data. In this paper we discuss the DataDepot architecture, with an emphasis on several of its novel and critical features. DataDepot is currently being used for five very large warehousing projects within AT&T; one of these warehouses ingests 500 Mbytes per minute (and is growing). We use these installations to illustrate streaming warehouse use and behavior, and design choices made in developing DataDepot. We conclude with a discussion of DataDepot applications and the efficacy of some optimizations.
Unsupervised learning refers to the problem of trying to find hidden structure in unlabeled data. One approach to unsupervised learning is clustering. K-means is the "hello world" of clustering algorithms - it's simplicity comes in part from several built-in assumptions, and can be thought of as a special case of fitting a constrained gaussian mixture model (GMM) to data. Alternately, expectation-maximization (EM) can be used to fit a more general GMM. Both K-means and EM-GMM require a pre-selection of K - ie the number of gaussians to fit. This usually leads to cross-validation in selecting K. To avoid cross-validation, K can be included in the optimization problem itself. Rephrasing the problem as a Bayesian inference problem, with K as one of the parameters, is one way of doing that. Variational Bayes (VB) - aka ensemble learning - is one approach to estimating the solution to such an (normally intractable) inference problem. This presentation will consist of a simple introduction to VB-GMM, viewing it as an "extension" on EM-GMM.
References:
Bishop, Christopher M. (2006). "Pattern Recognition and Machine
Learning". Springer. ISBN 0-387-31073-8
(sections 10.2 and 9.4)
http://research.microsoft.com/~cmbishop/downloads/Bishop-AIStats01.pdf
http://www.goldenmetallic.com/research/uai99.pdf
https://en.wikipedia.org/wiki/Variational_Bayesian_methods
Bandit problems are one of the most important problems in the field of exploration learning.In the words of P. Whitle, "Bandit problems embody in essential form a conflict evident in all human action: choosing actions which yield immediate reward vs. choosing actions (e.g. acquiring information or preparing the ground) whose benefit will come only later". It is better known as the Exploration vs Exploitation dilemma. This presentation introduces stochastic multi armed bandit problem, analysis of a few well known algorithms, and few examples of it.
References:
http://www.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/Auer+al-UCB.pdf
http://en.wikipedia.org/wiki/Multi-armed_bandit
http://www.cs.huji.ac.il/~shais/Lecture1.pdf
http://www.cs.mcgill.ca/~vkules/bandits.pdf
Abstract: As two important operations in data cleaning, similarity
join and similarity search have attracted much attention recently.
Existing methods to support similarity join usually adopt a
prefix-filtering-based framework. They select a prefix of each object
and prune object pairs whose prefixes have no overlap. We have an
observation that prefix lengths have significant effect on the
performance. Different prefix lengths lead to significantly different
performance, and prefix filtering does not always achieve high
performance. To address this problem, in this paper we propose an
adaptive framework to support similarity join. We propose a cost model
to judiciously select an appropriate prefix for each object. To
efficiently select prefixes, we devise effective indexes. We extend
our method to support similarity search. Experimental results show
that our framework beats the prefix-filtering-based framework and
achieves high efficiency.
Short bio: Guoliang Li has been an Assistant Professor at Department
of Computer Science, Tsinghua University, Beijing, China, from July
2009. He received his PhD degree in Computer Science from Tsinghua
University in 2009, and his Bachelor degree in Computer Science from
Harbin Institute of Technology in 2004. His research interests include
data cleaning and integration, spatial databases and crowdsourcing. He
has published more than 20 papers in premier conferences and journals,
such as SIGMOD, VLDB, ICDE, IEEE TKDE, and VLDB Journal. His papers
have been cited more than 700 times, with two of them receiving more
than 100 citations each.
Consider a model where k players each receive a sequence of elements
over time, and they communicate with the goal of tracking some
function of all the elements received so far continuously. This
distributed tracking model naturally generalizes both the multi-party
communication model and the data stream model, and abstracts many
tracking problems in distributed systems. Though extensively studied
in the database and networking communities, this model has attracted
theoreticians only since recently. This talk will present a brief
overview of the results in this area, with a few selected problems
explained in more detail: count-tracking, frequent items, and random
sampling.
Bio: Ke Yi received his B.E. from Tsinghua University and Ph.D. from Duke
University, in 2001 and 2006 respectively, both in computer science.
He was a researcher in the database department at AT&T Labs during
2006-2007, and he is now an Assistant Professor in the Department of
Computer Science and Engineering at Hong Kong University of Science
and Technology. His research interests include algorithms, data
structures, and databases.
Tremendous amounts of personal data about individuals isbeing collected and shared online. Legal requirements and an increasein public awareness due to egregious breaches of individual privacyhave made data privacy an important field of research. Recentresearch, culminating in the development of a powerful notion calleddifferential privacy, have transformed this field from a black artinto a rigorous mathematical discipline.This talk critically analyzes trade-off between accuracy and privacyin the context of social advertising – recommending people, productsor services to users based on their social neighborhood. I willpresent a theoretical upper bound on the accuracy of performingrecommendations that are solely based on a user's social network, fora given level of (differential) privacy of sensitive links in thesocial graph. I will show using real networks that good private socialrecommendations are feasible only for a small subset of the users inthe social network or for a lenient setting of privacy parameters.I will also describe some exciting future research directions inprivacy, security and big data management in building privacy awarecloud-based systems where individuals reclaim ownership of theirpersonal information.
Bio : Ashwin Machanavajjhala is a Senior Research Scientist in theKnowledge Management group at Yahoo! Research. His primary researchinterests lie in data privacy with a specific focus on formallyreasoning about privacy under probabilistic adversary models. He isalso interested in big-data management and statistical methods forinformation integration. Ashwin graduated with a Ph.D. from theDepartment of Computer Science, Cornell University. His thesis work ondefining and enforcing privacy was awarded the 2008 ACM SIGMOD JimGray Dissertation Award Honorable Mention. He has also received anM.S. from Cornell University and a B.Tech in Computer Science andEngineering from the Indian Institute of Technology, Madras.
we will have a joint presentation by Samira Daruki and Chenxu Ding on two papers that are about clustering in MapReduce. The plan is to have a short overview of the MapReduce model, then the algorithms in both papers will be presented. Finally we will plan to leave some time for discussion. My impression is that both of the papers make nice contributions, but in no-way do they close the problem of large scale clustering. I think it will be great to have a group brain storm about what is known, and what is not. To best serve this, it will be best if you are able to at least peruse the papers, and bring them to the meeting so we can have an in depth discussion.
In data mining applications and spatial and multimedia databases, a useful tool is the kNN join, which is to produce the k nearest neighbors (NN), from a dataset S, of every point in a dataset R. Since it involves both the join and the NN search, performing kNN joins efficiently is a challenging task. Meanwhile, applications continue to witness a quick (exponential in some cases) increase in the amount of data to be processed. A popular model nowadays for large-scale data processing is the shared-nothing cluster on a number of commodity machines using MapReduce [6]. Hence, how to execute kNN joins efficiently on large data that are stored in a MapReduce cluster is an intriguing problem that meets many practical needs. This work proposes novel (exact and approximate) algorithms in MapReduce to perform efficient parallel kNN joins on large data. We demonstrate our ideas using Hadoop. Extensive experiments in large real and synthetic datasets, with tens or hundreds of millions of records in both R and S and up to 30 dimensions, have demonstrated the efficiency, effectiveness, and scalability of our methods.
This paper revisits the classical problem of multiquery optimization in the context of RDF/SPARQL. We show that the techniques developed for relational and semi-structured data/query languages are hard, if not impossible, to be extended to account for RDF data model and graph query patterns expressed in SPARQL. In light of the NP-hardness of the multi-query optimization for SPARQL, we propose heuristic algorithms that partition the input batch of queries into groups such that each group of queries can be optimized together. An essential component of the optimization incorporates an efficient algorithm to discover the common sub-structures of multiple SPARQL queries and an effective cost model to compare candidate execution plans. Since our optimization techniques do not make any assumption about the underlying SPARQL query engine, they have the advantage of being portable across different RDF stores. The extensive experimental studies, performed on three popular RDF stores, show that the proposed techniques are effective, efficient and scalable.
In data mining applications and spatial and multimedia databases, a useful tool is the kNN join, which is to produce the k nearest neighbors (NN), from a dataset S, of every point in a dataset R. Since it involves both the join and the NN search, performing kNN joins efficiently is a challenging task. Meanwhile, applications continue to witness a quick (exponential in some cases) increase in the amount of data to be processed. A popular model nowadays for large-scale data processing is the shared-nothing cluster on a number of commodity machines using MapReduce [6]. Hence, how to execute kNN joins efficiently on large data that are stored in a MapReduce cluster is an intriguing problem that meets many practical needs. This work proposes novel (exact and approximate) algorithms in MapReduce to perform efficient parallel kNN joins on large data. We demonstrate our ideas using Hadoop. Extensive experiments in large real and synthetic datasets, with tens or hundreds of millions of records in both R and S and up to 30 dimensions, have demonstrated the efficiency, effectiveness, and scalability of our methods.
This paper revisits the classical problem of multiquery optimization in the context of RDF/SPARQL. We show that the techniques developed for relational and semi-structured data/query languages are hard, if not impossible, to be extended to account for RDF data model and graph query patterns expressed in SPARQL. In light of the NP-hardness of the multi-query optimization for SPARQL, we propose heuristic algorithms that partition the input batch of queries into groups such that each group of queries can be optimized together. An essential component of the optimization incorporates an efficient algorithm to discover the common sub-structures of multiple SPARQL queries and an effective cost model to compare candidate execution plans. Since our optimization techniques do not make any assumption about the underlying SPARQL query engine, they have the advantage of being portable across different RDF stores. The extensive experimental studies, performed on three popular RDF stores, show that the proposed techniques are effective, efficient and scalable.
We will review this paper. **YOU ARE EXPECTED TO AT LEAST READ THE INTRO and PRELIMINARIES** We will have a joint discussion were we will try to restate the problem, making sure everyone understands the different variants. We will also try to discuss some of the common techniques in this area. This will hopefully give you a nice introduction into the area of "matching" but will generally only be successful if you take the time to attempt to read the paper.
I'll be speaking in the data group tomorrow (John is out sick). The plan is to discuss the area of 'interactive streaming algorithms'. There are a few papers on the topic now, so I'd recommend starting with Verifying Computations with Streaming Interactive Proofs: VLDB2012 Graham Cormode, Justin Thaler, and Ke Yi http://www.cs.ust.hk/~yike/vldb12-interactive.pdf This paper addresses some of the practical issues in implementing these protocols: Practical verified computation with streaming interactive proofs: ICS2012 G. Cormode, M. Mitzenmacher, and J. Thaler. http://dimacs.rutgers.edu/~graham/pubs/papers/sipitcs.pdf
The keyword-based Google Images search engine has achieved great success in exploiting the associated text terms for automatic indexing of large-scale online image collections. Unfortunately, Google Images search engine is still unsatisfactory because of the relatively low precision rate and the appearance of large amounts of junk images. Therefore, there is an urgent need to develop new algorithms for filtering out the junk images from keyword-based Google Images search results. Based on this observation, the authors have developed an interactive approach to filter out the junk images from keyword-based Google Images search results. In this paper, they have developed an incremental kernel learning algorithm to filter out large amounts of junk images from keyword-based Google Images search results. To achieve more accurate partition and visualization of the returned images, multiple kernels were seamlessly integrated for diverse image similarity characterization. A hyperbolic image visualization approach was incorporated to allow users to assess the relevance between the returned images and their real query intentions interactively and express their query intentions more precisely. To reduce the computational cost for junk image filtering, an incremental kernel learning algorithm was developed for SVM image classifier training by translating the users’ query intentions to determine more accurate combination of these three basic image kernels, achieve more accurate characterization of diverse visual similarity contexts between the returned images, generate more accurate partition of the returned images, and create more precise visualization of the returned images for users to make better hypotheses for junk image filtering.
I will discuss some preliminary notions of stability of clusterings at 3 levels: local (point level) - intermediate (cluster level) - global (partition level). The talk will also outline initial ideas on how to compute/estimate these measures. At the point level, we analyze points in each cluster across different partitions to determine whether they transition too much between different clusters after which we label them to be unstable. This will allow us to compute cluster "core sets" that are vital to identify a cluster. In ital motivations for this problem include Active Clustering where the goal is to cluster with as few points as possible. Additionally, I will outline a new notion of cluster level stability under perturbations that will basically look at the number of points reassigned to another cluster after the perturbation to the center is applied. This notion will educate us about which cluster to believe and label and which cluster not to. I will also define a perturbation based approach to computing the global (partition level) stability using a generic quality measure. The number of partitions that are \eps-close in quality cost to the optimal partition will provide rich information about the nature of peaks and valleys in the quality landscape of the partitions of the data. By building this hierarchical structure of stability notions for clusterings, we hope to make better metaclustering decisions.
I will talk about some of the interesting results I heard about at SODA last week. I will also request registered students to sign up for lecture slots (including next week). http://www.cs.utah.edu/~jeffp/teaching/cs7941.html Come with ideas about topics you are working on and what to present or about recent cool papers you want to talk about. If you need a topic, I can help you find a good paper to present.
MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compact accurate summary of data is essential. Among various data summarization tools, histograms have proven to be particularly important and useful for summarizing data, and the wavelet histogram is one of the most widely used histograms. In this paper, we investigate the problem of building wavelet histograms efficiently on large datasets in MapReduce. We measure the efficiency of the algorithms by both end-to-end running time and communication cost. We demonstrate straightforward adaptations of existing exact and approximate methods for building wavelet histograms to MapReduce clusters are highly inefficient. To that end, we design new algorithms for computing exact and approximate wavelet histograms and discuss their implementation in MapReduce. We illustrate our techniques in Hadoop, and compare to baseline solutions with extensive experiments performed in a heterogeneous Hadoop cluster of 16 nodes, using large real and synthetic datasets, up to hundreds of gigabytes. The results suggest significant (often orders of magnitude) performance improvement achieved by our new algorithms.
We propose a systematic study of various metaclustering problems where the goal is to organize different approaches to clustering data for robust data analysis. We make use of a geometric representation for clusterings that lets us discuss many metaclustering questions in a common framework. We propose to introduce efficient algorithms to solve various metaclustering problems. Our goal is to make the metaclustering solutions work on massive data sets while being easy to implement. Using this geometric framework, we propose to study the landscape of all possible clusterings of data. We hope that this will help us to obtain robust information about the data along with increased user interaction during the data mining process. We extend the techniques developed in our previous metaclustering research on combining clusterings and generating clusterings in this proposal. The insights we obtained from our previous two bodies of work, along with the tools we have for comparing, combining and generating clusterings, lead us towards other metaclustering questions.
Record Matching is a key element of data cleaning technology. Error-Tolerant Record Matching reconciles multiple representations of the same entity in the presence of errors such as spelling mistakes and abbreviations. In this talk, we describe some of the key scenarios and the underlying technology for error-tolerant record matching that we have developed as part of our Data Cleaning project at Microsoft Research.
We are working on expanding human context awareness in computing systems by estimating the location of all people in a home or building, whether or not they participate by wearing an active badge. However, the location and track of a person can sometimes be ambiguous or incorrect. In this work, we use radio tomography, i.e., the use of changes in radio signal strength (RSS) measurements on many static links over time, to estimate where people and moving objects are located.
To learn a supervised classifier, traditional machine learning assumes plenty of labeled data. But what if labels are scarce or difficult to obtain? Scarcity of labeled data is a recurring problem in web-scale learning tasks where labeling billions of data points is tedious, time-consuming and costly. Imagine manually labeling one-by-one a billion images downloaded from the net. Modern machine learning proposes to circumvent this labeling bottleneck in a variety of ways. Two common approaches, among others, are Active Learning and Transfer Learning. In this talk, I will introduce the notions of active and transfer learning and focus on one sub-problem that aims to merge the two.
GIS data usually consist of both spatial and textual informa- tion, where the spatial component represents the location of the object and the textual element contains a set of strings describing object in that location. For GIS data situated on a road network, shortest path search is a basic operation. In practice, however, users are often interested at routing when certain constraints on the textual information have been also incorporated. This work complements the standard short- est path search with multiple keywords and an approximate string similarity function, where the goal is to find the short- est path that passes through at least one matching object per keyword; we dub this problem the multi-approximate- keyword routing (makr) query. We present both exact and approximate solutions. When the number κ of query key- words is small (e.g., κ ≤ 6), the exact solution works effi- ciently. However, when κ increases, it becomes increasingly expensive (especially on large GIS data). In this case, our approximate methods achieve superb query efficiency, excel- lent scalability, and high approximation quality, as indicated in our extensive experiments on large, real datasets (up to 2 million points on road networks with hundreds of thousands of nodes and edges). We also prove that one approximate method has a κ-approximation in the worst case.
In this talk I will prove lower bounds on randomized multiparty communication complexity, both in the blackboard model (where each message is written on a blackboard for all players to see) and (mainly) in the message-passing model, where messages are sent player-to-player. I will introduce a new technique for proving such bounds, called symmetrization, which is natural, intuitive, and often easy to use.For example, for the problem where each of k players gets a bit-vector of length n, and the goal is to compute the coordinate-wise OR of these vectors, I will prove a tight lower bounds of Omega(nk) in the blackboard model. For the same problem with AND instead of OR, our technique can prove a lower bounds of roughly Omega(nk) in the message-passing model and Omega(n log k) in the blackboard model. I may also prove lower bounds for bit-wise majority, for a graph-connectivity problem, and for other problems; the technique seems applicable to a wide range of other problems as well.These approaches are strongly motivated from trying (and failing) to solve real problems in the distributed streaming model. The obtained communication lower bounds imply new lower bounds in the model for eps-kernels, heavy-hitters, distinct elements. All of our lower bounds allow randomized communication protocols with two-sided error.
In data mining applications and spatial and multimedia databases, a useful tool is the kNN join, which is to produce the k nearest neighbors (NN), from a dataset P, of every point in a dataset R. Since it involves both the join and the NN search, performing kNN joins efficiently is a challenging task. Meanwhile, applications continue to witness a quick (exponential in some cases) increase in the amount of data to be processed. A popular model nowadays for large-scale data processing is the shared-nothing cluster on a number of commodity machines using MapReduce [6]. Hence, how to execute kNN joins efficiently on large data that are stored in a MapReduce cluster is an intriguing problem that meets many practical needs. This work proposes novel (exact and approximate) algorithms in MapReduce to perform efficient parallel kNN joins on large data. We demonstrate our ideas using Hadoop, an open source MapReduce framework. Extensive experiments using large real datasets, at least tens of millions of records, have convincingly demonstrated the efficiency, effectiveness, and the scalability of our methods.
My talk will focus on using random sampling of points to run a streaming variant of MDS that works for large datasets. A quick introduction to MDS and its applications will be provided but for the most part I will talk about my work on streaming and random sampling with a small note on multiprocessing work.
The problem of answering SPARQL queries over virtual SPARQL views is commonly encountered in a number of settings, including while enforcing security policies to access RDF data, or when integrating RDF data from disparate sources. We approach this problem by rewriting SPARQL queries over the views to equivalent queries over the underlying RDF data, thus avoiding the costs entailed by view materialization and maintenance. We show that SPARQL query rewriting combines the most challenging aspects of rewriting for the relational and XML cases: like the relational case, SPARQL query rewriting requires synthesizing multiple views; like the XML case, the size of the rewritten query is exponential to the size of the query and the views. In this paper, we present the first native query rewriting algorithm for SPARQL. For an input SPARQL query over a set of virtual SPARQL views, the rewritten query resembles a union of conjunctive queries and can be of exponential size. We propose optimizations over the basic rewriting algorithm to (i) minimize each conjunctive query in the union; (ii) eliminate conjunctive queries with empty results from evaluation; and (iii) efficiently prune out big portions of the search space of empty rewritings. The experiments, performed on two RDF stores, show that our algorithms are scalable and independent of the underlying RDF stores. Furthermore, our optimizations have order of magnitude improvements over the basic rewriting algorithm in both the rewriting size and evaluation time.
There are many ways to partition a dataset, demonstrated by the plethora of clustering algorithms available. Each clustering method identifies different kinds of structures in data reflecting users desire. This talk will aim at introducing a new framework for systematically generating multiple good quality clusterings of a single data set. Our approach decomposes this problem into two components: (a) generating many high-quality partitions, and then (b) grouping these partitions to obtain k representative partitions. The decomposition makes the approach extremely modular and allows us to optimize the quality and non-redundancy that control the choice of alternative partitions.
In distributed data management, a primary concern is monitoring the distributed data and generating an alarm when a user specified constraint is violated. A particular useful instance is the threshold based constraint, which is commonly known as the distributed threshold monitoring problem . This work extends this useful and fundamental study to distributed probabilistic data that emerge in a lot of applications, where uncertainty naturally exists when massive amounts of data are produced at multiple sources in distributed, networked locations. Examples include distributed observing stations, large sensor fields, geographically separate scientific institutes/units and many more. When dealing with probabilistic data, there are two thresholds involved, the score and the probability thresholds. One must monitor both simultaneously, as such, techniques developed for deterministic data are no longer directly applicable. This work presents a comprehensive study to this problem. Our algorithms have significantly outperformed the baseline method in terms of both the communication cost (number of messages and bytes) and the running time, as shown by an extensive experimental evaluation using several, real large datasets.
We present an algorithm for coupling inexpensive low-fidelity modelsimulations with high-fidelity simulation data of parameterized differential equations. The goal is to grab a "free lunch": simulation accuracy of the high-fidelity model with algorithmic complexity of only a simplified low-fidelity model. The procedure forms an approximation with sparsely available high-fidelity simluations, and is a simple linear algebraic construction with connections to kernel learning and column skeletonization of matrices. We discuss theoretical results that establish accuracy guarantees, and introduce recent analysis providing bounds on the error committed by the algorithm. This is joint work with Alireza Doostan, Hillary Fairbanks, and Jerrad Hampton (UC Boulder); Claude Gittelson (ETH Zurich); Dongbin Xiu (Ohio State); and Xueyu Zhu (U Iowa).