Distinct Sampling on Streaming Data with Near-Duplicates

29 Oct 2018 · Jiecao Chen, Qin Zhang ·

In this paper we study how to perform distinct sampling in the streaming model where data contain near-duplicates. The goal of distinct sampling is to return a distinct element uniformly at random from the universe of elements, given that all the near-duplicates are treated as the same element. We also extend the result to the sliding window cases in which we are only interested in the most recent items. We present algorithms with provable theoretical guarantees for datasets in the Euclidean space, and also verify their effectiveness via an extensive set of experiments.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Datasets

Add Datasets introduced or used in this paper

Edit Social Preview

Distinct Sampling on Streaming Data with Near-Duplicates

Code Edit Add Remove Mark official

Categories

Datasets Edit

Code

Add Remove Mark official

Datasets