Mina Ghashami
PhD student

Book Chapter  Journal  Conference  Workshop  Tech Report]



  • Improved Practical Matrix Sketching with Guarantees
    By Mina Ghashami,    Amey Desai,    and Jeff M. Phillips
    Vol.28, Pages 1678--1690, Transactions on Knowledge and Data Engineering (IEEE TKDE),  Janurary,  2015.

    Matrices have become essential data representations for many large-scale problems in data analytics, and hence matrix sketching is a critical task. Although much research has focused on improving the error/size tradeoff under various sketching paradigms, the many forms of error bounds make these ap- proaches hard to compare in theory and in practice. This paper attempts to categorize and compare most known methods under row-wise streaming updates with provable guarantees, and then to tweak some of these methods to gain practical improvements while retaining guarantees. For instance, we observe that a simple heuristic iSVD, with no guarantees, tends to outperform all known approaches in terms of size/error trade-off. We modify the best performing method with guarantees FREQUENTDIRECTIONS under the size/error trade-off to match the performance of iSVD and retain its guarantees. We also demonstrate some adversarial datasets where iSVD performs quite poorly. In comparing techniques in the time/error trade-off, techniques based on hashing or sampling tend to perform better. In this setting we modify the most studied sampling regime to retain error guarantee but obtain dramatic improvements in the time/error trade-off. Finally, we provide easy replication of our studies on APT, a new testbed which makes available not only code and datasets, but also a computing platform with fixed environmental settings.

  • Frequent Directions: Simple and Deterministic Matrix Sketching.
    By Mina Ghashami,    Edo Liberty,    Jeff M. Phillips and David P. Woodruff
    Vol.0, SIAM Journal of Computing (SICOMP),  Janurary,  2015.

    We describe a new algorithm called Frequent Directions for deterministic matrix sketching in the row-updates model. The algorithm is presented an arbitrary input matrix A %u2208 Rn%uFFFDd one row at a time. It performed O(dl) operations per row and maintains a sketch matrix B in R^{l%uFFFDd} such that for any k < l ||A^TA %u2212 B^TB||_2 <= ||A%u2212A_k||^2_F/(l%u2212k) and ||A%u2212pi_{B_k}(A)||^2_F <= (1 k/(l-k)) ||A%u2212A_k||^2_F. Here, A_k stands for the minimizer of ||A %u2212 A_k||_F over all rank k matrices (similarly B_k ) and pi_{B_k}(A) is the rank k matrix resulting from projecting A on the row span of B_k. We show both of these bounds are the best possible for the space allowed. The summary is mergeable, and hence trivially parallelizable. Moreover, Frequent Directions outperforms exemplar implementations of existing streaming algorithms in the space-error tradeoff.

  • 2013

  • Relative Errors for Deterministic Low-Rank Matrix Approximations. (Project Website)
    By Mina Ghashami,   Jeff M. Phillips,   
    Vol.abs/1307.7454, CoRR (CORR 2013),  2013.
  • Conference


  • Efficient Frequent Directions Algorithm for Sparse Matrices
    By Mina Ghashami,    Edo Liberty,    and Jeff M. Phillips
    In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDD 2016),  pages ??-??,  Feburary,  2016.

    This paper describes Sparse Frequent Directions, a variant of Frequent Directions for sketch- ing sparse matrices. It resembles the original algorithm in many ways: both receive the rows of an input matrix A^{nd} one by one in the streaming setting and compute a small sketch B in R^{l x d}. Both share the same strong (provably optimal) asymptotic guarantees with respect to the space- accuracy tradeoff in the streaming setting. However, unlike Frequent Directions which runs in O(ndl) time regardless of the sparsity of the input matrix A, Sparse Frequent Directions runs in O~(nnz(A)*l nl^2) time. Our analysis loosens the dependence on computing the Singular Value Decomposition (SVD) as a black box within the Frequent Directions algorithm. Our bounds require recent results on the properties of fast approximate SVD computations. Finally, we empirically demonstrate that these asymptotic improvements are practical and significant on real and synthetic data.

  • 2015

  • Streaming Kernel Principal Component Analysis
    By Mina Ghashami,    Daniel Perry,    and Jeff M. Phillips
    In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS),  pages ??-??,  December,  2015.

    Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture non-linear structures within large data sets, and is a central tool in data analysis and learning. To allow for non-linear relations, typically a full nn kernel matrix is constructed over n data points, but this requires too much space and time for large values of n. Techniques such as the Nystrom method and random feature maps can help towards this goal, but they do not explicitly maintain the basis vectors in a stream and take more space than desired. We propose a new approach for streaming KPCA which maintains a small set of basis elements in a stream, requiring space only logarithmic in n, and also improves the dependence on the error parameter. Our technique combines together random feature maps with recent advances in matrix sketching, it has guaranteed spectral norm error bounds with respect to the original kernel matrix, and it compares favorably in practice to state-of-the-art approaches.

  • 2014

  • Continuous Matrix Approximation on Distributed Data (Project Website), Talk
    By Mina Ghashami,    Jeff Phillips,    Feifei Li
    In Proceedings of 40th International Conference on Very Large Data Bases (VLDB 2014),  pages 809-820,  Hangzhou China,  2014.

    Tracking and approximating data matrices in streaming fashion is a fundamental challenge. The problem requires more care and attention when data comes from multiple distributed sites, each receiving a stream of data. This paper considers the problem of %u201Ctracking approximations to a matrix%u201D in the distributed streaming model. In this model, there are m distributed sites each observing a distinct stream of data (where each element is a row of a distributed matrix) and has a communication channel with a coordinator, and the goal is to track an %u03B5-approximation to the norm of the matrix along any direction. To that end, we present novel algorithms to address the matrix approximation problem. Our algorithms maintain a smaller matrix B, as an approximation to a distributed streaming matrix A, such that for any unit vector x: ||Ax||^2 %u2212 ||Bx||^2 %u2264 %u03B5||A||_F^2. Our algorithms work in streaming fashion and incur small communication, which is critical for distributed computation. Our best method is deterministic and uses only O((m/%u03B5) log(%u03B2N)) communication, where N is the size of stream (at the time of the query) and %u03B2 is an upperbound on the squared norm of any row of the matrix. In addition to proving all algorithmic properties theoretically, extensive experiments with real large datasets demonstrate the efficiency of these protocols.

  • Relative Errors for Deterministic Low-Rank Matrix Approximations. (Project Website)
    By Mina Ghashami,   Jeff M. Phillips,   
    In Proceedings of SODA (SODA 2014),  pages 707-717,  2014.