Search Results for author: Juan Gómez-Luna

Found 14 papers, 6 papers with code

Analysis of Distributed Optimization Algorithms on a Real Processing-In-Memory System

no code implementations10 Apr 2024 Steve Rhyner, Haocong Luo, Juan Gómez-Luna, Mohammad Sadrosadati, Jiawei Jiang, Ataberk Olgun, Harshita Gupta, Ce Zhang, Onur Mutlu

Processor-centric architectures (e. g., CPU, GPU) commonly used for modern ML training workloads are limited by the data movement bottleneck, i. e., due to repeatedly accessing the training dataset.

Distributed Optimization

TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems

1 code implementation3 Apr 2023 Maurus Item, Juan Gómez-Luna, Yuxin Guo, Geraldo F. Oliveira, Mohammad Sadrosadati, Onur Mutlu

In order to provide support for transcendental (and other hard-to-calculate) functions in general-purpose PIM systems, we present \emph{TransPimLib}, a library that provides CORDIC-based and LUT-based methods for trigonometric functions, hyperbolic functions, exponentiation, logarithm, square root, etc.

Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud

no code implementations19 Sep 2022 Geraldo F. Oliveira, Juan Gómez-Luna, Saugata Ghose, Amirali Boroumand, Onur Mutlu

Our analysis reveals that PIM greatly benefits memory-bound NNs: (1) UPMEM provides 23x the performance of a high-end GPU when the GPU requires memory oversubscription for a general matrix-vector multiplication kernel; (2) Mensa improves energy efficiency and throughput by 3. 0x and 3. 1x over the Google Edge TPU for 24 Google edge NN models; and (3) SIMDRAM outperforms a CPU/GPU by 16. 7x/1. 4x for three binary NNs.

LEAPER: Fast and Accurate FPGA-based System Performance Prediction via Transfer Learning

no code implementations22 Aug 2022 Gagandeep Singh, Dionysios Diamantopoulos, Juan Gómez-Luna, Sander Stuijk, Henk Corporaal, Onur Mutlu

The key idea of LEAPER is to transfer an ML-based performance and resource usage model trained for a low-end edge environment to a new, high-end cloud environment to provide fast and accurate predictions for accelerator implementation.

Design Synthesis Transfer Learning

Heterogeneous Data-Centric Architectures for Modern Data-Intensive Applications: Case Studies in Machine Learning and Databases

no code implementations29 May 2022 Geraldo F. Oliveira, Amirali Boroumand, Saugata Ghose, Juan Gómez-Luna, Onur Mutlu

One promising execution paradigm that alleviates the data movement bottleneck in modern and emerging applications is processing-in-memory (PIM), where the cost of data movement to/from main memory is reduced by placing computation capabilities close to memory.

SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM

no code implementations22 Dec 2020 Nastaran Hajinazar, Geraldo F. Oliveira, Sven Gregorio, João Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gómez-Luna, Onur Mutlu

Compared to a CPU and a high-end GPU, SIMDRAM is 257x and 31x more energy-efficient, while providing 93x and 6x higher operation throughput, respectively.

Hardware Architecture Distributed, Parallel, and Cluster Computing Emerging Technologies

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

1 code implementation29 Sep 2020 Orestis Zachariadis, Nitin Satpute, Juan Gómez-Luna, Joaquín Olivares

The key idea of our spGEMM algorithm, tSparse, is to multiply sparse rectangular blocks using the mixed precision mode of TCUs.

Mathematical Software Distributed, Parallel, and Cluster Computing Performance

Accelerating B-spline Interpolation on GPUs: Application to Medical Image Registration

1 code implementation13 Apr 2020 Orestis Zachariadis, Andrea Teatini, Nitin Satpute, Juan Gómez-Luna, Onur Mutlu, Ole Jakob Elle, Joaquín Olivares

In this paper, we introduce a novel GPU implementation of BSI to accelerate the calculation of the deformation field in non-rigid image registration algorithms.

Distributed, Parallel, and Cluster Computing Image and Video Processing

A Workload and Programming Ease Driven Perspective of Processing-in-Memory

no code implementations26 Jul 2019 Saugata Ghose, Amirali Boroumand, Jeremie S. Kim, Juan Gómez-Luna, Onur Mutlu

First, we describe our work on systematically identifying opportunities for PIM in real applications, and quantify potential gains for popular emerging applications (e. g., machine learning, data analytics, genome analysis).

Distributed, Parallel, and Cluster Computing Hardware Architecture

Processing Data Where It Makes Sense: Enabling In-Memory Computation

no code implementations10 Mar 2019 Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, Rachata Ausavarungnirun

This design choice goes directly against at least three key trends in systems that cause performance, scalability and energy bottlenecks: (1) data access from memory is already a key bottleneck as applications become more data-intensive and memory bandwidth and energy do not scale well, (2) energy consumption is a key constraint in especially mobile and server systems, (3) data movement is very expensive in terms of bandwidth, energy and latency, much more so than computation.

Hardware Architecture

Cannot find the paper you are looking for? You can Submit a new open access paper.