Understanding Bias in Anomaly Detection: A Semi-Supervised View with PAC Guarantees

1 Jan 2021  ·  Ziyu Ye, Yuxin Chen, Haitao Zheng ·

Anomaly detection presents a unique challenge in machine learning, due to the scarcity of labeled anomaly data. Existing work attempts to mitigate such problems via semi-supervised learning, i.e., augmenting unsupervised anomaly detection models with additional labeled anomaly samples. However, the labeled data often does not align with the target distribution and therefore introduces harmful bias to the trained model. In this paper, we aim to understand the effect of a biased anomaly set for anomaly detection. In particular, we formally state the anomaly detection problem as a semi-supervised learning task. We focus on the anomaly detector's recall at a given false positive rate as the main performance metric. Given two different anomaly score functions, we formally define their difference in performance as the relative scoring bias of the anomaly detectors. We establish the first finite sample rates for estimating the relative scoring bias for semi-supervised anomaly detection. We then empirically validate our theoretical results on both synthetic and real-world datasets. Furthermore, we provide extensive empirical study on how a biased training anomaly set affects the anomaly score function and therefore the resulting detection performance. Our case study demonstrates scenarios in which the biased anomaly set can be useful, and provides a solid benchmark for future research.

PDF Abstract

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here