1 code implementation • 24 Jun 2023 • Daniel Zou, Xinchen Jin, Xueyang Yu, Hao Zhang, James Demmel
In anticipation of workloads that involve serving many of such large models to handle different tasks, we develop Computron, a system that uses memory swapping to serve multiple distributed models on a shared GPU cluster.
1 code implementation • 15 Mar 2022 • Vivek Bharadwaj, Aydın Buluç, James Demmel
Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an input dense matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM kernels.
no code implementations • 5 May 2021 • Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, Yakun Sophia Shao
Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many of which feature a large number of processing elements laid out spatially, together with a multi-level memory hierarchy and flexible interconnect.
no code implementations • 16 Nov 2020 • Aditya Devarakonda, James Demmel
Stochastic gradient descent (SGD) is one of the most widely used optimization methods for solving various machine learning problems.
no code implementations • 30 Oct 2020 • Arissa Wongpanich, Hieu Pham, James Demmel, Mingxing Tan, Quoc Le, Yang You, Sameer Kumar
EfficientNets are a family of state-of-the-art image classification models based on efficiently scaled convolutional neural networks.
no code implementations • 15 Jun 2020 • Yang You, Yuhui Wang, huan zhang, Zhao Zhang, James Demmel, Cho-Jui Hsieh
For the first time we scale the batch size on ImageNet to at least a magnitude larger than all previous work, and provide detailed studies on the performance of many state-of-the-art optimization schemes under this setting.
1 code implementation • 20 Nov 2019 • Ruobing Han, James Demmel, Yang You
Our experimental results show that for many applications, APS can train state-of-the-art models by 8-bit gradients with no or only a tiny accuracy loss (<0. 05%).
24 code implementations • ICLR 2020 • Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh
In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches.
Ranked #11 on Question Answering on SQuAD1.1 dev (F1 metric)
1 code implementation • 24 Jan 2019 • Yang You, Jonathan Hseu, Chris Ying, James Demmel, Kurt Keutzer, Cho-Jui Hsieh
LEGW enables Sqrt Scaling scheme to be useful in practice and as a result we achieve much better results than the Linear Scaling learning rate scheme.
no code implementations • 17 Dec 2017 • Aditya Devarakonda, Kimon Fountoulakis, James Demmel, Michael W. Mahoney
Parallel computing has played an important role in speeding up convex optimization methods for big data analytics and large-scale machine learning (ML).
no code implementations • 24 Oct 2017 • Saeed Soori, Aditya Devarakonda, James Demmel, Mert Gurbuzbalaban, Maryam Mehri Dehnavi
We formulate the algorithm for two different optimization methods on the Lasso problem and show that the latency cost is reduced by a factor of k while bandwidth and floating-point operation costs remain the same.
1 code implementation • 14 Sep 2017 • Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, Kurt Keutzer
If we can make full use of the supercomputer for DNN training, we should be able to finish the 90-epoch ResNet-50 training in one minute.
no code implementations • NeurIPS 2016 • Yang You, Xiangru Lian, Ji Liu, Hsiang-Fu Yu, Inderjit S. Dhillon, James Demmel, Cho-Jui Hsieh
n this paper, we propose and study an Asynchronous parallel Greedy Coordinate Descent (Asy-GCD) algorithm for minimizing a smooth function with bounded constraints.
1 code implementation • 5 Jul 2016 • Alex Gittens, Aditya Devarakonda, Evan Racah, Michael Ringenburg, Lisa Gerhardt, Jey Kottalam, Jialin Liu, Kristyn Maschhoff, Shane Canon, Jatin Chhugani, Pramod Sharma, Jiyan Yang, James Demmel, Jim Harrell, Venkat Krishnamurthy, Michael W. Mahoney, Prabhat
We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms.
Distributed, Parallel, and Cluster Computing G.1.3; C.2.4
1 code implementation • 14 Feb 2012 • Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, Oded Schwartz
We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication.
Data Structures and Algorithms Computational Complexity Distributed, Parallel, and Cluster Computing Numerical Analysis Combinatorics Numerical Analysis 68W40, 68W10 F.2.1