2 code implementations • 8 May 2023 • Kazuki Osawa, Satoki Ishikawa, Rio Yokota, Shigang Li, Torsten Hoefler
Gradient preconditioning is a key technique to integrate the second-order information into gradients for improving and extending gradient-based learning algorithms.
1 code implementation • 25 Nov 2022 • Kazuki Osawa, Shigang Li, Torsten Hoefler
Pipeline parallelism enables efficient training of Large Language Models (LLMs) on large-scale distributed accelerator clusters.
no code implementations • 6 Oct 2022 • Ryo Karakida, Tomoumi Takase, Tomohiro Hayase, Kazuki Osawa
In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost of GR.
no code implementations • 20 Sep 2022 • Maciej Besta, Patrick Iff, Florian Scheidl, Kazuki Osawa, Nikoli Dryden, Michal Podstawski, Tiancheng Chen, Torsten Hoefler
In general, LPG2vec enables combining predictive power of the most powerful GNNs with the full scope of information encoded in the LPG model, paving the way for neural graph databases, a class of systems where the vast complexity of maintained data will benefit from modern and future graph machine learning methods.
1 code implementation • 14 Sep 2022 • Shigang Li, Kazuki Osawa, Torsten Hoefler
We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores.
1 code implementation • NeurIPS 2020 • Ryo Karakida, Kazuki Osawa
In this work, we reveal that, under specific conditions, NGD with approximate Fisher information achieves the same fast convergence to global minima as exact NGD.
1 code implementation • 13 Feb 2020 • Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Chuan-Sheng Foo, Rio Yokota
Large-scale distributed training of deep neural networks results in models with worse generalization performance as a result of the increase in the effective mini-batch size.
1 code implementation • NeurIPS 2019 • Kazuki Osawa, Siddharth Swaroop, Anirudh Jain, Runa Eschenhagen, Richard E. Turner, Rio Yokota, Mohammad Emtiyaz Khan
Importantly, the benefits of Bayesian principles are preserved: predictive probabilities are well-calibrated, uncertainties on out-of-distribution data are improved, and continual-learning performance is boosted.
3 code implementations • CVPR 2019 • Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, Satoshi Matsuoka
Large-scale distributed training of deep neural networks suffer from the generalization gap caused by the increase in the effective mini-batch size.