Towards General Robustness to Bad Training Data

29 Sep 2021 · Tianhao Wang, Yi Zeng, Ming Jin, Ruoxi Jia ·

In this paper, we focus on the problem of identifying bad training data when the underlying cause is unknown in advance. Our key insight is that regardless of how bad data are generated, they tend to contribute little to training a model with good prediction performance or more generally, to some utility function of the data analyst. We formulate the problem of good/bad data selection as utility optimization. We propose a theoretical framework for evaluating the worst-case performance of data selection heuristics. Remarkably, our results show that the popular heuristic based on the Shapley value may choose the worst data subset in certain practical scenarios, which sheds lights on its large performance variation observed empirically in the past work. We then develop an algorithmic framework, DataSifter, to detect a variety of and even unknown data issues---a step towards general robustness to bad training data. DataSifter is guided by the theoretically optimal solution to data selection and is made practical by the data utility learning technique. Our evaluation shows that DataSifter achieves and most often significantly improves the state-of-the-art performance over a wide range of tasks, including backdoor, poison, noisy/mislabel data detection, data summarization, and data debiasing.

PDF Abstract