no code implementations • 6 Jan 2023 • Satoshi Matsuoka, Jens Domke, Mohamed Wahib, Aleksandr Drozd, Torsten Hoefler
While some laws end, new directions are emerging, such as algorithmic scaling or novel architecture research.
no code implementations • 21 Oct 2021 • Steven Farrell, Murali Emani, Jacob Balma, Lukas Drescher, Aleksandr Drozd, Andreas Fink, Geoffrey Fox, David Kanter, Thorsten Kurth, Peter Mattson, Dawei Mu, Amit Ruhela, Kento Sato, Koichi Shirahata, Tsuguchika Tabaru, Aristeidis Tsaris, Jan Balewski, Ben Cumming, Takumi Danjo, Jens Domke, Takaaki Fukai, Naoto Fukumoto, Tatsuya Fukushi, Balazs Gerofi, Takumi Honda, Toshiyuki Imamura, Akihiko Kasagi, Kentaro Kawakami, Shuhei Kudo, Akiyoshi Kuroda, Maxime Martinasso, Satoshi Matsuoka, Henrique Mendonça, Kazuki Minami, Prabhat Ram, Takashi Sawada, Mallikarjun Shankar, Tom St. John, Akihiro Tabuchi, Venkatram Vishwanath, Mohamed Wahib, Masafumi Yamazaki, Junqi Yin
Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights.
no code implementations • 26 Aug 2020 • Mohamed Wahib, Haoyu Zhang, Truong Thao Nguyen, Aleksandr Drozd, Jens Domke, Lingqi Zhang, Ryousei Takano, Satoshi Matsuoka
An alternative solution is to use out-of-core methods instead of, or in addition to, data parallelism.
1 code implementation • 25 Jul 2020 • Yosuke Oyama, Naoya Maruyama, Nikoli Dryden, Erin McCarthy, Peter Harrington, Jan Balewski, Satoshi Matsuoka, Peter Nugent, Brian Van Essen
We present scalable hybrid-parallel algorithms for training large-scale 3D convolutional neural networks.
no code implementations • 9 Apr 2020 • Artur Podobas, Kentaro Sano, Satoshi Matsuoka
With the end of both Dennard's scaling and Moore's law, computer users and researchers are aggressively exploring alternative forms of computing in order to continue the performance scaling that we have come to enjoy.
Hardware Architecture A.1; B.0; C.1; C.3
1 code implementation • 14 Feb 2020 • Hamid Reza Zohouri, Artur Podobas, Satoshi Matsuoka
We show that despite the higher computation intensity and on-chip memory requirement of such stencils compared to first-order ones, our design technique with combined spatial and temporal blocking remains effective.
Distributed, Parallel, and Cluster Computing
1 code implementation • 27 Mar 2019 • Yusuke Nagasaka, Akira Nukada, Ryosuke Kojima, Satoshi Matsuoka
We evaluated the performance of the GCNs application on TSUBAME3. 0 implementing NVIDIA Tesla P100 GPU, and our batched approach shows significant speedups of up to 1. 59x and 1. 37x in training and inference, respectively.
Distributed, Parallel, and Cluster Computing
1 code implementation • 21 Dec 2018 • Hiroki Kanezashi, Toyotaro Suzumura, Dario Garcia-Gasulla, Min-hwan Oh, Satoshi Matsuoka
We propose an incremental graph pattern matching algorithm to deal with time-evolving graph data and also propose an adaptive optimization system based on reinforcement learning to recompute vertices in the incremental process more efficiently.
Databases
3 code implementations • CVPR 2019 • Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, Satoshi Matsuoka
Large-scale distributed training of deep neural networks suffer from the generalization gap caused by the increase in the effective mini-batch size.
1 code implementation • 13 Apr 2018 • Yosuke Oyama, Tal Ben-Nun, Torsten Hoefler, Satoshi Matsuoka
NVIDIA cuDNN is a low-level library that provides GPU kernels frequently used in deep learning.
1 code implementation • 5 Apr 2018 • Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, Aydın Buluç
Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type.
Distributed, Parallel, and Cluster Computing
1 code implementation • 1 Feb 2018 • Hamid Reza Zohouri, Artur Podobas, Satoshi Matsuoka
Furthermore, we estimate that the upcoming Stratix 10 devices can achieve a performance of up to 3. 5 TFLOP/s and 1. 6 TFLOP/s for 2D and 3D stencil computation, respectively.
Distributed, Parallel, and Cluster Computing Hardware Architecture
no code implementations • COLING 2016 • Aleks Drozd, R, Anna Gladkova, Satoshi Matsuoka
Solving word analogies became one of the most popular benchmarks for word embeddings on the assumption that linear relations between word pairs (such as \textit{king}:\textit{man} :: \textit{woman}:\textit{queen}) are indicative of the quality of the embedding.