What is the distribution of the number of unique original items in a bootstrap sample?

18 Feb 2016  ·  Alex F. Mendelson, Maria A. Zuluaga, Brian F. Hutton, Sébastien Ourselin ·

Sampling with replacement occurs in many settings in machine learning, notably in the bagging ensemble technique and the .632+ validation scheme. The number of unique original items in a bootstrap sample can have an important role in the behaviour of prediction models learned on it. Indeed, there are uncontrived examples where duplicate items have no effect. The purpose of this report is to present the distribution of the number of unique original items in a bootstrap sample clearly and concisely, with a view to enabling other machine learning researchers to understand and control this quantity in existing and future resampling techniques. We describe the key characteristics of this distribution along with the generalisation for the case where items come from distinct categories, as in classification. In both cases we discuss the normal limit, and conduct an empirical investigation to derive a heuristic for when a normal approximation is permissible.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here