Avishek Saha
PhD student. First employment: Yahoo Labs.

Book Chapter  Journal  Conference  Workshop  Tech Report]



  • NDAV: N-dimensional data analysis and visualization analysis for the National Ignition Campaign. (Project Website)
    By Peer-Timo Bremer,   Dan Maljovec,   Avishek Saha,   Bei Wang,   Jim Gaffney,   Brian K. Spears,   Valerio Pascucci,   
    Vol.17, Pages 1-18, Computat. and Visualiz. in Science (COMVIS 2015),  2015.
  • 2012

  • Fast Multiple Kernel Learning With Multiplicative Weight Updates (Project Website)
    By John Moeller,   Parasaran Raman,   Avishek Saha,   Suresh Venkatasubramanian,   
    Vol.abs/1206.5580, CoRR (CORR 2012),  2012.
  • Conference


  • A Geometric Algorithm for Scalable Multiple Kernel Learning. (Project Website)
    By John Moeller,   Parasaran Raman,   Suresh Venkatasubramanian,   Avishek Saha,   
    In Proceedings of AISTATS (AISTATS 2014),  pages 633-642,  2014.
  • 2012

  • Efficient Protocols for Distributed Classification and Optimization
    By Hal Daume III,    Jeff M. Phillips,    Avishek Saha,    and Suresh Venkatasubramanian
    In Proceedings of 23rd International Conference on Algorithmic Learning Theory (ALT),  pages ??-??,  October,  2012.

    In distributed learning, the goal is to perform a learning task over data distributed across multiple nodes with minimal (expensive) communication. Prior work (Daum ́e III et al., 2012) proposes a general model that bounds the communication required for learning classifiers while allowing for ε training error on linearly separable data adversarially distributed across nodes. In this work, we develop key improvements and extensions to this basic model. Our first result is a two-party multiplicative-weight-update based protocol that uses O(d2 log 1/ε) words of communication to classify distributed data in arbitrary dimension d, ε-optimally. This readily extends to classification over k nodes with O(kd2 log 1/ε) words of communication. Our proposed protocol is simple to implement and is considerably more efficient than baselines compared, as demonstrated by our empirical results. In addition, we illustrate general algorithm design paradigms for doing efficient learning over distributed data. We show how to solve fixed-dimensional and high dimensional linear programming efficiently in a distributed setting where constraints may be distributed across nodes. Since many learning problems can be viewed as convex optimization problems where constraints are generated by individual points, this models many typical distributed learning scenarios. Our techniques make use of a novel connection from multipass streaming, as well as adapting the multiplicative-weight-update framework more generally to a distributed setting. As a consequence, our methods extend to the wide range of problems solvable using these techniques.

  • Protocols for Learning Classifiers on Distributed Data
    By Hal Daume III,    Jeff M. Phillips,    Avishek Saha,    Suresh Venkatasubramanian
    In Proceedings of 15th Interntational Conference on Artificial Intelligence and Statistics (AISTATS 2012),  pages ??-??,  La Palma, Canary Islands,  April,  2012.

    We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general sampling-based solutions as well as some two-way protocols which have a provable exponential speed-up over any one-way protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node.

  • 2011

  • Active Supervised Domain Adaptation
    By Avishek Saha,    Piyush Rai,    Hal Daumé III,    Suresh Venkatasubramanian,    Scott L. DuVall
    In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2011),  pages ??-??,  Athens, Greece,  September,  2011.

    In this paper, we harness the synergy between two important learning paradigms, namely, active learning and domain adaptation. We show how active learning in a target domain can leverage information from a different but related source domain. Our proposed framework, Active Learning Domain Adapted (Alda), uses source domain knowledge to transfer information that facilitates active learning in the target domain. We propose two variants of Alda: a batch B-Alda and an online O-Alda. Empirical comparisons with numerous baselines on real-world datasets establish the efficacy of the proposed methods.

  • Online Learning of Multiple Tasks and Their Relationships
    By Hal Daume III,    Piyush Rai,    Avishek Saha,    Suresh Venkatasubramanian
    In Proceedings of Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011),  pages 643-651,  Ft. Lauderdale, FL, USA,  April,  2011.

    In this paper, we propose an online multitask learning framework where the weight vectors are updated in an adaptive fashion based on inter-task relatedness. Our work is in contrast with earlier work on online multitask learning where the authors use a fixed interaction matrix of tasks to derive (fixed) update rules for all the tasks. In this work, we propose to update this interaction matrix itself in an adaptive fashion so that the weight vector updates are no longer fixed but are instead adaptive. Our framework can be extended to an active learning setting where the informativeness of an incoming instance across all the tasks can be evaluated using this adaptive interaction matrix. Empirical results on standardized datasets show improved performance in terms of accuracy, label complexity and number of mistakes made.

  • 2009

  • Metric Functional Dependencies
    By Nick Koudas,    Avishek Saha,    Divesh Srivastava,    Suresh Venkatasubramanian
    In Proceedings of 25th International Conference on Data Engineering (ICDE 2009),  pages 1275-1278,  Shanghai, China,  March,  2009.

    When merging data from various sources, it is often the case that small variations in data format and interpretation cause traditional functional dependencies (FDs) to be violated, without there being an intrinsic violation of semantics. Examples include differing address formats, or different reported latitude/longitudes for a given address. In such cases, we would like to specify a dependency structure on the merged data that is robust to such small differences.In this paper, we define metric functional dependencies, which strictly generalize traditional FDs by allowing small differences (controlled by a metric) in values of the consequent attribute of an FD. We show that this notion satisfies many of the standard properties of functional dependencies, and we present efficient algorithms for the verification problem: determining whether a given metric FD (MFD) holds for a given relation. We show that MFDs can be combined with approximate FDs, allowing tuples with identical antecedents to map to different consequents, some of which correspond to small (acceptable) variations, with others indicating more serious data quality issues. We experimentally demonstrate the validity and efficiency of our approach on various data sets that possess different underlying metrics, and lie in multidimensional spaces.

  • Workshop


  • Domain Adaptation Meets Active Learning
    by Piyush Rai,    Avishek Saha,    Hal Daume III,    Suresh Venkatasubramanian
    In Proceedings of the Workshop on Active Learning For NLP, in conjunction with the NAACL-HLT (ALNLP 2010), pages 27-32, Los Angeles, CA, USA,  June,  2010.  ACL.

    In this work, we show how active learning in some (target) domain can leverage information from a different but related source) domain. We present an algorithm that harnesses the source domain data to learn the best possible initializer hypothesis for doing active learning in the target domain, resulting in improved label complexity. We also present a variant of this algorithm which additionally uses the domain divergence information to selectively query the most informative points in the target domain, leading to further reductions in label complexity. Experimental results on a variety of datasets establish the efficacy of the proposed methods.