An Efficient and Reliable Tolerance-Based Algorithm for Principal Component Analysis

29 Sep 2021 · Michael Yeh, Ming Gu ·

Principal component analysis (PCA) is an important method for dimensionality reduction in data science and machine learning. But, it is expensive for large matrices when only a few principal components are needed. Existing fast PCA algorithms typically assume the user will supply the number of components needed, but in practice, they may not know this number beforehand. Thus, it is important to have fast PCA algorithms depending on a tolerance. For $m\times n$ matrices where a few principal components explain most of the variance in the data, we develop one such algorithm that runs in $O(mnl)$ time, where $l\ll \min(m,n)$ is a small multiple of the number of principal components. We provide approximation error bounds that are within a constant factor away from optimal and demonstrate its utility with data from a variety of applications.

PDF Abstract