Yan Zheng | |
[ Book Chapter Journal Conference Workshop Tech Report]
This project developed new data summaries for kernel regression ā these have not been formally studied before. These approaches are related to kernel density estimates, for where there are several effective summaries, but only model density of the data, not an associated weight. We produce both sample complexity results as well as deterministic approaches which adapt to the structure of the data. We demonstrate these approaches on a variety of 1-dimensional (e.g., time series), 2-dimensional (e.g., spatial) and high-dimensional data sets. It can summarize sets of enormous size down to a few megabytes with minimal loss of accuracy.
Kernel density estimates are important for a broad variety of applications including media databases, pattern recognition, computer vision, data mining, and the sciences. Their con- struction has been well-studied, but existing techniques are expensive on massive datasets and/or only provide heuristic approximations without theoretical guarantees. We propose randomized and deterministic algorithms with quality guarantees which are orders of magnitude more ef- ficient than previous algorithms. Our algorithms do not re- quire knowledge of the kernel or its bandwidth parameter and are easily parallelizable. We demonstrate how to imple- ment our ideas in a centralized setting and in MapReduce, although our algorithms are applicable to any large-scale data processing framework. Extensive experiments on large real datasets demonstrate the quality, efficiency, and scala- bility of our techniques.