Efficient Second-Order Optimization for Deep Learning with Kernel Machines

29 Sep 2021  ·  Yawen Chen, Zeyi Wen, Yile Chen, Jian Chen, Jin Huang ·

Second-order optimization has been recently explored in neural network training. However, the recomputation of the Hessian matrix in the second-order optimization posts much extra computation and memory burden in the training. There have been some attempts to address the issue by approximation on the Hessian matrix, which unfortunately degrades the performance of the neural models. To address the issue, we propose Kernel Stochastic Gradient Descent (Kernel SGD) which projects the optimization problem to a transformed space with the Hessian matrix of kernel machines. Kernel SGD eliminates the recomputation of the Hessian matrix and requires a much smaller memory cost which can be controlled via the mini-batch size. The additional advantage of Kernel SGD is its ability to converge to better solutions according to our theoretical analysis. Kernel SGD is theoretically guaranteed to converge. Experimental results on tabular, image and text data show that Kernel SGD converges up to 30 times faster than the existing second-order optimization techniques, and also shows remarkable performance in generalization.

PDF Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods