Efficient Second-Order Optimization for Deep Learning with Kernel Machines

29 Sep 2021 · Yawen Chen, Zeyi Wen, Yile Chen, Jian Chen, Jin Huang ·

Second-order optimization has been recently explored in neural network training. However, the recomputation of the Hessian matrix in the second-order optimization posts much extra computation and memory burden in the training. There have been some attempts to address the issue by approximation on the Hessian matrix, which unfortunately degrades the performance of the neural models. To address the issue, we propose Kernel Stochastic Gradient Descent (Kernel SGD) which projects the optimization problem to a transformed space with the Hessian matrix of kernel machines. Kernel SGD eliminates the recomputation of the Hessian matrix and requires a much smaller memory cost which can be controlled via the mini-batch size. The additional advantage of Kernel SGD is its ability to converge to better solutions according to our theoretical analysis. Kernel SGD is theoretically guaranteed to converge. Experimental results on tabular, image and text data show that Kernel SGD converges up to 30 times faster than the existing second-order optimization techniques, and also shows remarkable performance in generalization.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Datasets

Add Datasets introduced or used in this paper

Results from the Paper

Add Remove

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods

Add Remove

SGD

Edit Social Preview

Efficient Second-Order Optimization for Deep Learning with Kernel Machines

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove