no code implementations • 5 Apr 2021 • Yaosheng Fu, Evgeny Bolotin, Niladrish Chatterjee, David Nellans, Stephen W. Keckler
As GPUs scale their low precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities.
no code implementations • 8 Dec 2020 • Ahmet Inci, Evgeny Bolotin, Yaosheng Fu, Gal Dalal, Shie Mannor, David Nellans, Diana Marculescu
With deep reinforcement learning (RL) methods achieving results that exceed human capabilities in games, robotics, and simulated environments, continued scaling of RL training is crucial to its deployment in solving complex real-world problems.
no code implementations • 30 Jul 2019 • Saptadeep Pal, Eiman Ebrahimi, Arslan Zulfiqar, Yaosheng Fu, Victor Zhang, Szymon Migacz, David Nellans, Puneet Gupta
This work explores hybrid parallelization, where each data parallel worker is comprised of more than one device, across which the model dataflow graph (DFG) is split using MP.
no code implementations • 6 Mar 2019 • Esha Choukse, Michael Sullivan, Mike O'Connor, Mattan Erez, Jeff Pool, David Nellans, Steve Keckler
However, GPU device memory tends to be relatively small and the memory capacity can not be increased by the user.
Hardware Architecture