Search Results for author: James Demmel

Found 15 papers, 8 papers with code

Computron: Serving Distributed Deep Learning Models with Model Parallel Swapping

1 code implementation24 Jun 2023 Daniel Zou, Xinchen Jin, Xueyang Yu, Hao Zhang, James Demmel

In anticipation of workloads that involve serving many of such large models to handle different tasks, we develop Computron, a system that uses memory swapping to serve multiple distributed models on a shared GPU cluster.

Distributed-Memory Sparse Kernels for Machine Learning

1 code implementation15 Mar 2022 Vivek Bharadwaj, Aydın Buluç, James Demmel

Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an input dense matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM kernels.

BIG-bench Machine Learning Collaborative Filtering +1

CoSA: Scheduling by Constrained Optimization for Spatial Accelerators

no code implementations5 May 2021 Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, Yakun Sophia Shao

Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many of which feature a large number of processing elements laid out spatially, together with a multi-level memory hierarchy and flexible interconnect.

Navigate Scheduling

Avoiding Communication in Logistic Regression

no code implementations16 Nov 2020 Aditya Devarakonda, James Demmel

Stochastic gradient descent (SGD) is one of the most widely used optimization methods for solving various machine learning problems.

regression

Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour

no code implementations30 Oct 2020 Arissa Wongpanich, Hieu Pham, James Demmel, Mingxing Tan, Quoc Le, Yang You, Sameer Kumar

EfficientNets are a family of state-of-the-art image classification models based on efficiently scaled convolutional neural networks.

Image Classification Playing the Game of 2048

The Limit of the Batch Size

no code implementations15 Jun 2020 Yang You, Yuhui Wang, huan zhang, Zhao Zhang, James Demmel, Cho-Jui Hsieh

For the first time we scale the batch size on ImageNet to at least a magnitude larger than all previous work, and provide detailed studies on the performance of many state-of-the-art optimization schemes under this setting.

Auto-Precision Scaling for Distributed Deep Learning

1 code implementation20 Nov 2019 Ruobing Han, James Demmel, Yang You

Our experimental results show that for many applications, APS can train state-of-the-art models by 8-bit gradients with no or only a tiny accuracy loss (<0. 05%).

Image Classification

Large-Batch Training for LSTM and Beyond

1 code implementation24 Jan 2019 Yang You, Jonathan Hseu, Chris Ying, James Demmel, Kurt Keutzer, Cho-Jui Hsieh

LEGW enables Sqrt Scaling scheme to be useful in practice and as a result we achieve much better results than the Linear Scaling learning rate scheme.

Avoiding Synchronization in First-Order Methods for Sparse Convex Optimization

no code implementations17 Dec 2017 Aditya Devarakonda, Kimon Fountoulakis, James Demmel, Michael W. Mahoney

Parallel computing has played an important role in speeding up convex optimization methods for big data analytics and large-scale machine learning (ML).

Avoiding Communication in Proximal Methods for Convex Optimization Problems

no code implementations24 Oct 2017 Saeed Soori, Aditya Devarakonda, James Demmel, Mert Gurbuzbalaban, Maryam Mehri Dehnavi

We formulate the algorithm for two different optimization methods on the Lasso problem and show that the latency cost is reduced by a factor of k while bandwidth and floating-point operation costs remain the same.

ImageNet Training in Minutes

1 code implementation14 Sep 2017 Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, Kurt Keutzer

If we can make full use of the supercomputer for DNN training, we should be able to finish the 90-epoch ResNet-50 training in one minute.

16k Playing the Game of 2048

Asynchronous Parallel Greedy Coordinate Descent

no code implementations NeurIPS 2016 Yang You, Xiangru Lian, Ji Liu, Hsiang-Fu Yu, Inderjit S. Dhillon, James Demmel, Cho-Jui Hsieh

n this paper, we propose and study an Asynchronous parallel Greedy Coordinate Descent (Asy-GCD) algorithm for minimizing a smooth function with bounded constraints.

Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

1 code implementation5 Jul 2016 Alex Gittens, Aditya Devarakonda, Evan Racah, Michael Ringenburg, Lisa Gerhardt, Jey Kottalam, Jialin Liu, Kristyn Maschhoff, Shane Canon, Jatin Chhugani, Pramod Sharma, Jiyan Yang, James Demmel, Jim Harrell, Venkat Krishnamurthy, Michael W. Mahoney, Prabhat

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms.

Distributed, Parallel, and Cluster Computing G.1.3; C.2.4

Communication-Optimal Parallel Algorithm for Strassen's Matrix Multiplication

1 code implementation14 Feb 2012 Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, Oded Schwartz

We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication.

Data Structures and Algorithms Computational Complexity Distributed, Parallel, and Cluster Computing Numerical Analysis Combinatorics Numerical Analysis 68W40, 68W10 F.2.1

Cannot find the paper you are looking for? You can Submit a new open access paper.