In distributed learning, the goal is to perform a learning task over data distributed across multiple nodes with minimal (expensive) communication. Prior work (Daum ́e III et al., 2012) proposes a general model that bounds the communication required for learning classifiers while allowing for ε training error on linearly separable data adversarially distributed across nodes. In this work, we develop key improvements and extensions to this basic model. Our first result is a two-party multiplicative-weight-update based protocol that uses O(d2 log 1/ε) words of communication to classify distributed data in arbitrary dimension d, ε-optimally. This readily extends to classification over k nodes with O(kd2 log 1/ε) words of communication. Our proposed protocol is simple to implement and is considerably more efficient than baselines compared, as demonstrated by our empirical results. In addition, we illustrate general algorithm design paradigms for doing efficient learning over distributed data. We show how to solve fixed-dimensional and high dimensional linear programming efficiently in a distributed setting where constraints may be distributed across nodes. Since many learning problems can be viewed as convex optimization problems where constraints are generated by individual points, this models many typical distributed learning scenarios. Our techniques make use of a novel connection from multipass streaming, as well as adapting the multiplicative-weight-update framework more generally to a distributed setting. As a consequence, our methods extend to the wide range of problems solvable using these techniques.
We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general sampling-based solutions as well as some two-way protocols which have a provable exponential speed-up over any one-way protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node.
In this paper, we harness the synergy between two important learning paradigms, namely, active learning and domain adaptation. We show how active learning in a target domain can leverage information from a different but related source domain. Our proposed framework, Active Learning Domain Adapted (Alda), uses source domain knowledge to transfer information that facilitates active learning in the target domain. We propose two variants of Alda: a batch B-Alda and an online O-Alda. Empirical comparisons with numerous baselines on real-world datasets establish the efficacy of the proposed methods.
In this paper, we propose an online multitask learning framework where the weight vectors are updated in an adaptive fashion based on inter-task relatedness. Our work is in contrast with earlier work on online multitask learning where the authors use a fixed interaction matrix of tasks to derive (fixed) update rules for all the tasks. In this work, we propose to update this interaction matrix itself in an adaptive fashion so that the weight vector updates are no longer fixed but are instead adaptive. Our framework can be extended to an active learning setting where the informativeness of an incoming instance across all the tasks can be evaluated using this adaptive interaction matrix. Empirical results on standardized datasets show improved performance in terms of accuracy, label complexity and number of mistakes made.
When merging data from various sources, it is often the case that small variations in data format and interpretation cause traditional functional dependencies (FDs) to be violated, without there being an intrinsic violation of semantics. Examples include differing address formats, or different reported latitude/longitudes for a given address. In such cases, we would like to specify a dependency structure on the merged data that is robust to such small differences.In this paper, we define metric functional dependencies, which strictly generalize traditional FDs by allowing small differences (controlled by a metric) in values of the consequent attribute of an FD. We show that this notion satisfies many of the standard properties of functional dependencies, and we present efficient algorithms for the verification problem: determining whether a given metric FD (MFD) holds for a given relation. We show that MFDs can be combined with approximate FDs, allowing tuples with identical antecedents to map to different consequents, some of which correspond to small (acceptable) variations, with others indicating more serious data quality issues. We experimentally demonstrate the validity and efficiency of our approach on various data sets that possess different underlying metrics, and lie in multidimensional spaces.
In this work, we show how active learning in some (target) domain can leverage information from a different but related source) domain. We present an algorithm that harnesses the source domain data to learn the best possible initializer hypothesis for doing active learning in the target domain, resulting in improved label complexity. We also present a variant of this algorithm which additionally uses the domain divergence information to selectively query the most informative points in the target domain, leading to further reductions in label complexity. Experimental results on a variety of datasets establish the efficacy of the proposed methods.