Histosketch: fast similarity-preserving sketching of streaming histograms with concept drift

Yang, Dingqi; Li, Bin; Rettig, Laura; Cudré-Mauroux, Philippe

doi:10.1109/ICDM.2017.64

Back

Conference paper (in proceedings)

Histosketch: fast similarity-preserving sketching of streaming histograms with concept drift

Yang, Dingqi University of Fribourg
Li, Bin Fudan University, Shanghai
Rettig, Laura University of Fribourg
Cudré-Mauroux, Philippe University of Fribourg

18.12.2017

Published in:

2017 IEEE International Conference on Data Mining (ICDM). - 2017, p. 545–554

English Histogram-based similarity has been widely adopted in many machine learning tasks. However, measuring histogram similarity is a challenging task for streaming data, where the elements of a histogram are observed in a streaming manner. First, the ever-growing cardinality of histogram elements makes any similarity computation inefficient. Second, the concept-drift issue in the data streams also impairs the accurate assessment of the similarity. In this paper, we propose to overcome the above challenges with HistoSketch, a fast similarity-preserving sketching method for streaming histograms with concept drift. Specifically, HistoSketch is designed to incrementally maintain a set of compact and fixed-size sketches of streaming histograms to approximate similarity between the histograms, with the special consideration of gradually forgetting the outdated histogram elements. We evaluate HistoSketch on multiple classification tasks using both synthetic and real-world datasets. The results show that our method is able to efficiently approximate similarity for streaming histograms and quickly adapt to concept drift. Compared to full streaming histograms gradually forgetting the outdated histogram elements, HistoSketch is able to dramatically reduce the classification time (with a 7500x speedup) with only a modest loss in accuracy (about 3.5%).

Faculty

Faculté des sciences et de médecine

Department

Département d'Informatique

Language

English

Classification

Computer science and technology

License

License undefined

Identifiers

RERO DOC 309000
DOI 10.1109/ICDM.2017.64

Persistent URL

https://folia.unifr.ch/unifr/documents/306533

Statistics

Document views: 167 File downloads:

pdf: 397