Conference paper (in proceedings)

Histosketch: fast similarity-preserving sketching of streaming histograms with concept drift

Show more…
Published in:
  • 2017 IEEE International Conference on Data Mining (ICDM). - 2017, p. 545–554
English Histogram-based similarity has been widely adopted in many machine learning tasks. However, measuring histogram similarity is a challenging task for streaming data, where the elements of a histogram are observed in a streaming manner. First, the ever-growing cardinality of histogram elements makes any similarity computation inefficient. Second, the concept-drift issue in the data streams also impairs the accurate assessment of the similarity. In this paper, we propose to overcome the above challenges with HistoSketch, a fast similarity-preserving sketching method for streaming histograms with concept drift. Specifically, HistoSketch is designed to incrementally maintain a set of compact and fixed-size sketches of streaming histograms to approximate similarity between the histograms, with the special consideration of gradually forgetting the outdated histogram elements. We evaluate HistoSketch on multiple classification tasks using both synthetic and real-world datasets. The results show that our method is able to efficiently approximate similarity for streaming histograms and quickly adapt to concept drift. Compared to full streaming histograms gradually forgetting the outdated histogram elements, HistoSketch is able to dramatically reduce the classification time (with a 7500x speedup) with only a modest loss in accuracy (about 3.5%).
Faculté des sciences et de médecine
Département d'Informatique
  • English
Computer science and technology
License undefined
Persistent URL

Document views: 32 File downloads:
  • cud_hsf.pdf: 136