Mina Ghashami | |

[ Book Chapter Journal Conference Workshop Tech Report]

By Mina Ghashami, Amey Desai, and Jeff M. Phillips

Vol.28, Pages 1678--1690,

Matrices have become essential data representations for many large-scale problems in data analytics, and hence matrix sketching is a critical task. Although much research has focused on improving the error/size tradeoff under various sketching paradigms, the many forms of error bounds make these ap- proaches hard to compare in theory and in practice. This paper attempts to categorize and compare most known methods under row-wise streaming updates with provable guarantees, and then to tweak some of these methods to gain practical improvements while retaining guarantees. For instance, we observe that a simple heuristic iSVD, with no guarantees, tends to outperform all known approaches in terms of size/error trade-off. We modify the best performing method with guarantees FREQUENTDIRECTIONS under the size/error trade-off to match the performance of iSVD and retain its guarantees. We also demonstrate some adversarial datasets where iSVD performs quite poorly. In comparing techniques in the time/error trade-off, techniques based on hashing or sampling tend to perform better. In this setting we modify the most studied sampling regime to retain error guarantee but obtain dramatic improvements in the time/error trade-off. Finally, we provide easy replication of our studies on APT, a new testbed which makes available not only code and datasets, but also a computing platform with fixed environmental settings.

By Mina Ghashami, Edo Liberty, Jeff M. Phillips and David P. Woodruff

Vol.0,

We describe a new algorithm called Frequent Directions for deterministic matrix sketching in the row-updates model. The algorithm is presented an arbitrary input matrix A %u2208 Rn%uFFFDd one row at a time. It performed O(dl) operations per row and maintains a sketch matrix B in R^{l%uFFFDd} such that for any k < l ||A^TA %u2212 B^TB||_2 <= ||A%u2212A_k||^2_F/(l%u2212k) and ||A%u2212pi_{B_k}(A)||^2_F <= (1 k/(l-k)) ||A%u2212A_k||^2_F. Here, A_k stands for the minimizer of ||A %u2212 A_k||_F over all rank k matrices (similarly B_k ) and pi_{B_k}(A) is the rank k matrix resulting from projecting A on the row span of B_k. We show both of these bounds are the best possible for the space allowed. The summary is mergeable, and hence trivially parallelizable. Moreover, Frequent Directions outperforms exemplar implementations of existing streaming algorithms in the space-error tradeoff.

By Mina Ghashami, Jeff M. Phillips,

Vol.abs/1307.7454,

By Mina Ghashami, Edo Liberty, and Jeff M. Phillips

In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (

This paper describes Sparse Frequent Directions, a variant of Frequent Directions for sketch- ing sparse matrices. It resembles the original algorithm in many ways: both receive the rows of an input matrix A^{n×d} one by one in the streaming setting and compute a small sketch B in R^{l x d}. Both share the same strong (provably optimal) asymptotic guarantees with respect to the space- accuracy tradeoff in the streaming setting. However, unlike Frequent Directions which runs in O(ndl) time regardless of the sparsity of the input matrix A, Sparse Frequent Directions runs in O~(nnz(A)*l nl^2) time. Our analysis loosens the dependence on computing the Singular Value Decomposition (SVD) as a black box within the Frequent Directions algorithm. Our bounds require recent results on the properties of fast approximate SVD computations. Finally, we empirically demonstrate that these asymptotic improvements are practical and significant on real and synthetic data.

By Mina Ghashami, Daniel Perry, and Jeff M. Phillips

In Proceedings of International Conference on Artificial Intelligence and Statistics (

Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture non-linear structures within large data sets, and is a central tool in data analysis and learning. To allow for non-linear relations, typically a full n×n kernel matrix is constructed over n data points, but this requires too much space and time for large values of n. Techniques such as the Nystrom method and random feature maps can help towards this goal, but they do not explicitly maintain the basis vectors in a stream and take more space than desired. We propose a new approach for streaming KPCA which maintains a small set of basis elements in a stream, requiring space only logarithmic in n, and also improves the dependence on the error parameter. Our technique combines together random feature maps with recent advances in matrix sketching, it has guaranteed spectral norm error bounds with respect to the original kernel matrix, and it compares favorably in practice to state-of-the-art approaches.

By Mina Ghashami, Jeff Phillips, Feifei Li

In Proceedings of 40th International Conference on Very Large Data Bases (

Tracking and approximating data matrices in streaming fashion is a fundamental challenge. The problem requires more care and attention when data comes from multiple distributed sites, each receiving a stream of data. This paper considers the problem of %u201Ctracking approximations to a matrix%u201D in the distributed streaming model. In this model, there are m distributed sites each observing a distinct stream of data (where each element is a row of a distributed matrix) and has a communication channel with a coordinator, and the goal is to track an %u03B5-approximation to the norm of the matrix along any direction. To that end, we present novel algorithms to address the matrix approximation problem. Our algorithms maintain a smaller matrix B, as an approximation to a distributed streaming matrix A, such that for any unit vector x: ||Ax||^2 %u2212 ||Bx||^2 %u2264 %u03B5||A||_F^2. Our algorithms work in streaming fashion and incur small communication, which is critical for distributed computation. Our best method is deterministic and uses only O((m/%u03B5) log(%u03B2N)) communication, where N is the size of stream (at the time of the query) and %u03B2 is an upperbound on the squared norm of any row of the matrix. In addition to proving all algorithmic properties theoretically, extensive experiments with real large datasets demonstrate the efficiency of these protocols.

By Mina Ghashami, Jeff M. Phillips,

In Proceedings of SODA (