Search Results for author: Juan Gómez-Luna

Found 13 papers, 5 papers with code

Analysis of Distributed Optimization Algorithms on a Real Processing-In-Memory System

no code implementations • 10 Apr 2024 • Steve Rhyner, Haocong Luo, Juan Gómez-Luna, Mohammad Sadrosadati, Jiawei Jiang, Ataberk Olgun, Harshita Gupta, Ce Zhang, Onur Mutlu

Processor-centric architectures (e. g., CPU, GPU) commonly used for modern ML training workloads are limited by the data movement bottleneck, i. e., due to repeatedly accessing the training dataset.

Distributed Optimization

Paper
Add Code

TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems

1 code implementation • 3 Apr 2023 • Maurus Item, Juan Gómez-Luna, Yuxin Guo, Geraldo F. Oliveira, Mohammad Sadrosadati, Onur Mutlu

In order to provide support for transcendental (and other hard-to-calculate) functions in general-purpose PIM systems, we present \emph{TransPimLib}, a library that provides CORDIC-based and LUT-based methods for trigonometric functions, hyperbolic functions, exponentiation, logarithm, square root, etc.

Paper
Code

Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud

no code implementations • 19 Sep 2022 • Geraldo F. Oliveira, Juan Gómez-Luna, Saugata Ghose, Amirali Boroumand, Onur Mutlu

Our analysis reveals that PIM greatly benefits memory-bound NNs: (1) UPMEM provides 23x the performance of a high-end GPU when the GPU requires memory oversubscription for a general matrix-vector multiplication kernel; (2) Mensa improves energy efficiency and throughput by 3. 0x and 3. 1x over the Google Edge TPU for 24 Google edge NN models; and (3) SIMDRAM outperforms a CPU/GPU by 16. 7x/1. 4x for three binary NNs.

Paper
Add Code

LEAPER: Fast and Accurate FPGA-based System Performance Prediction via Transfer Learning

no code implementations • 22 Aug 2022 • Gagandeep Singh, Dionysios Diamantopoulos, Juan Gómez-Luna, Sander Stuijk, Henk Corporaal, Onur Mutlu

The key idea of LEAPER is to transfer an ML-based performance and resource usage model trained for a low-end edge environment to a new, high-end cloud environment to provide fast and accurate predictions for accelerator implementation.

Design Synthesis Transfer Learning

Paper
Add Code

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

1 code implementation • 16 Jul 2022 • Juan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F. Oliveira, Gagandeep Singh, Onur Mutlu

Our K-Means clustering on PIM is $2. 8\times$ and $3. 2\times$ than state-of-the-art CPU and GPU versions, respectively.

Clustering regression

Paper
Code

Machine Learning Training on a Real Processing-in-Memory System

no code implementations • 13 Jun 2022 • Juan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F. Oliveira, Gagandeep Singh, Onur Mutlu

Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate machine learning training.

BIG-bench Machine Learning regression

Paper
Add Code

Heterogeneous Data-Centric Architectures for Modern Data-Intensive Applications: Case Studies in Machine Learning and Databases

no code implementations • 29 May 2022 • Geraldo F. Oliveira, Amirali Boroumand, Saugata Ghose, Juan Gómez-Luna, Onur Mutlu

One promising execution paradigm that alleviates the data movement bottleneck in modern and emerging applications is processing-in-memory (PIM), where the cost of data movement to/from main memory is reduced by placing computation capabilities close to memory.

Paper
Add Code

Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

1 code implementation • 15 May 2022 • Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gómez-Luna, Sander Stuijk, Henk Corporaal, Onur Mutlu

We introduce Sibyl, the first technique that uses reinforcement learning for data placement in hybrid storage systems.

Paper
Code

SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM

no code implementations • 22 Dec 2020 • Nastaran Hajinazar, Geraldo F. Oliveira, Sven Gregorio, João Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gómez-Luna, Onur Mutlu

Compared to a CPU and a high-end GPU, SIMDRAM is 257x and 31x more energy-efficient, while providing 93x and 6x higher operation throughput, respectively.

Hardware Architecture Distributed, Parallel, and Cluster Computing Emerging Technologies

Paper
Add Code

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

1 code implementation • 29 Sep 2020 • Orestis Zachariadis, Nitin Satpute, Juan Gómez-Luna, Joaquín Olivares

The key idea of our spGEMM algorithm, tSparse, is to multiply sparse rectangular blocks using the mixed precision mode of TCUs.

Mathematical Software Distributed, Parallel, and Cluster Computing Performance

Paper
Code

Accelerating B-spline Interpolation on GPUs: Application to Medical Image Registration

1 code implementation • 13 Apr 2020 • Orestis Zachariadis, Andrea Teatini, Nitin Satpute, Juan Gómez-Luna, Onur Mutlu, Ole Jakob Elle, Joaquín Olivares

In this paper, we introduce a novel GPU implementation of BSI to accelerate the calculation of the deformation field in non-rigid image registration algorithms.

Distributed, Parallel, and Cluster Computing Image and Video Processing

Paper
Code

A Workload and Programming Ease Driven Perspective of Processing-in-Memory

no code implementations • 26 Jul 2019 • Saugata Ghose, Amirali Boroumand, Jeremie S. Kim, Juan Gómez-Luna, Onur Mutlu

First, we describe our work on systematically identifying opportunities for PIM in real applications, and quantify potential gains for popular emerging applications (e. g., machine learning, data analytics, genome analysis).

Distributed, Parallel, and Cluster Computing Hardware Architecture

Paper
Add Code

Processing Data Where It Makes Sense: Enabling In-Memory Computation

no code implementations • 10 Mar 2019 • Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, Rachata Ausavarungnirun

This design choice goes directly against at least three key trends in systems that cause performance, scalability and energy bottlenecks: (1) data access from memory is already a key bottleneck as applications become more data-intensive and memory bandwidth and energy do not scale well, (2) energy consumption is a key constraint in especially mobile and server systems, (3) data movement is very expensive in terms of bandwidth, energy and latency, much more so than computation.

Hardware Architecture

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.