Locally Random Alloy Codes with Channel Coding Theorems for Distributed Matrix Multiplication

7 Feb 2022 · Pedro Soto, Haibin Guan, Jun Li ·

Matrix multiplication is a fundamental operation in machine learning and is commonly distributed into multiple parallel tasks for large datasets. Stragglers and other failures can severely impact the overall completion time. Recent works in coded computing provide a novel strategy to mitigate stragglers with coded tasks, with an objective of minimizing the number of tasks needed to recover the overall result, known as the recovery threshold. However, we demonstrate that this combinatorial definition does not directly optimize the probability of failure. In this paper, we introduce a novel analytical metric, which focuses on the most likely event and measures the optimality of a coding scheme by its probability of decoding. Our general framework encompasses many other computational schemes and metrics as a special case. Far from being a purely theoretical construction, these definitions lead us to a practical construction of random codes for matrix multiplication, i.e., locally random alloy codes, which are optimal with respect to the measures. We present experimental results on Amazon EC2 which empirically demonstrate the improvement in terms of running time and numerical stability relative to well-established benchmarks.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Datasets

Add Datasets introduced or used in this paper

Results from the Paper

Add Remove

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Locally Random Alloy Codes with Channel Coding Theorems for Distributed Matrix Multiplication

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove