You need to log in to edit.

You can create a new account if you don't have one.

Or, discuss a change on Slack.

You can create a new account if you don't have one.

Or, discuss a change on Slack.

1 code implementation • 26 Jan 2024 • Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman

Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources.

no code implementations • 25 Jan 2024 • Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Lukas Gianinazzi, Ales Kubicek, Hubert Niewiadomski, Onur Mutlu, Torsten Hoefler

Among these, prompt engineering coupled with structures has emerged as a promising paradigm, with designs such as Chain-of-Thought, Tree of Thoughts, or Graph of Thoughts, in which the overall LLM reasoning is guided by a structure such as a graph.

no code implementations • 17 Jan 2024 • Daniele De Sensi, Tommaso Bonato, David Saam, Torsten Hoefler

The allreduce collective operation accounts for a significant fraction of the runtime of workloads running on distributed systems.

no code implementations • 11 Jan 2024 • Langwen Huang, Lukas Gianinazzi, Yuejiang Yu, Peter D. Dueben, Torsten Hoefler

As a byproduct, this strategy also enables the post-processing of predictions into the future, for which no observations are available. Through experiments based on a reanalysis dataset, we have verified that our method can produce assimilated global atmospheric data consistent with observations at 0. 25degree resolution.

no code implementations • 21 Dec 2023 • Eldar Kurtic, Torsten Hoefler, Dan Alistarh

Pruning large language models (LLMs) from the BERT family has emerged as a standard compression benchmark, and several pruning methods have been proposed for this task.

no code implementations • 30 Nov 2023 • Maciej Besta, Afonso Claudino Catarino, Lukas Gianinazzi, Nils Blach, Piotr Nyczyk, Hubert Niewiadomski, Torsten Hoefler

A fundamental workload in this setting is dynamic link prediction: using a history of graph updates to predict whether a given pair of vertices will become connected.

no code implementations • 15 Oct 2023 • Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, Gustavo Alonso

The heterogeneity ensures efficient acceleration of both LM inference and retrieval, while the accelerator disaggregation enables the system to independently scale both types of accelerators to fulfill diverse RALM requirements.

1 code implementation • 13 Oct 2023 • Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy.

1 code implementation • 3 Oct 2023 • Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler

We present the V:N:M format, which enables the execution of arbitrary N:M ratios on SPTCs.

no code implementations • 16 Sep 2023 • Torsten Hoefler, Bjorn Stevens, Andreas F. Prein, Johanna Baehr, Thomas Schulthess, Thomas F. Stocker, John Taylor, Daniel Klocke, Pekka Manninen, Piers M. Forster, Tobias Kölling, Nicolas Gruber, Hartwig Anzt, Claudia Frauen, Florian Ziemen, Milan Klöwer, Karthik Kashinath, Christoph Schär, Oliver Fuhrer, Bryan N. Lawrence

Participants of the Berlin Summit on Earth Virtualization Engines (EVEs) discussed ideas and concepts to improve our ability to cope with climate change.

no code implementations • 23 Aug 2023 • Julia Bazinska, Andrei Ivanov, Tal Ben-Nun, Nikoli Dryden, Maciej Besta, Siyuan Shen, Torsten Hoefler

Graph Neural Networks (GNNs) are a powerful tool for handling structured graph data and addressing tasks such as node classification, graph classification, and clustering.

1 code implementation • 18 Aug 2023 • Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, Torsten Hoefler

We introduce Graph of Thoughts (GoT): a framework that advances prompting capabilities in large language models (LLMs) beyond those offered by paradigms such as Chain-of-Thought or Tree of Thoughts (ToT).

no code implementations • ICCV 2023 • Yunqiang Li, Jan C. van Gemert, Torsten Hoefler, Bert Moons, Evangelos Eleftheriou, Bram-Ernst Verhoef

Deep learning algorithms are increasingly employed at the edge.

1 code implementation • 19 Jun 2023 • Wenqi Jiang, Shigang Li, Yu Zhu, Johannes De Fine Licht, Zhenhao He, Runbin Shi, Cedric Renggli, Shuai Zhang, Theodoros Rekatsinas, Torsten Hoefler, Gustavo Alonso

Vector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents.

1 code implementation • 5 Jun 2023 • Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh

Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities.

2 code implementations • 8 May 2023 • Kazuki Osawa, Satoki Ishikawa, Rio Yokota, Shigang Li, Torsten Hoefler

Gradient preconditioning is a key technique to integrate the second-order information into gradients for improving and extending gradient-based learning algorithms.

no code implementations • 15 Apr 2023 • Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Saleh Ashkboos, Torsten Hoefler

As deep learning models grow, sparsity is becoming an increasingly critical component of deep neural networks, enabling improved performance and reduced storage.

no code implementations • 14 Mar 2023 • Lukas Trümper, Tal Ben-Nun, Philipp Schaad, Alexandru Calotoiu, Torsten Hoefler

Performance optimization is an increasingly challenging but often repetitive task.

no code implementations • 6 Jan 2023 • Satoshi Matsuoka, Jens Domke, Mohamed Wahib, Aleksandr Drozd, Torsten Hoefler

While some laws end, new directions are emerging, such as algorithmic scaling or novel architecture research.

no code implementations • 3 Jan 2023 • Niels Gleinig, Tal Ben-Nun, Torsten Hoefler

We can only process data that is stored in fast memory, which incurs data movement (input/output-operations, or I/Os) between the two units.

1 code implementation • 25 Nov 2022 • Kazuki Osawa, Shigang Li, Torsten Hoefler

Pipeline parallelism enables efficient training of Large Language Models (LLMs) on large-scale distributed accelerator clusters.

1 code implementation • 24 Nov 2022 • Nikoli Dryden, Torsten Hoefler

Many data have an underlying dependence on spatial location; it may be weather on the Earth, a simulation on a mesh, or a registered image.

11 code implementations • 31 Oct 2022 • Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient.

1 code implementation • 22 Oct 2022 • Langwen Huang, Torsten Hoefler

We propose a new method of compressing this multidimensional weather and climate data: a coordinate-based neural network is trained to overfit the data, and the resulting parameters are taken as a compact representation of the original grid-based data.

no code implementations • 20 Sep 2022 • Maciej Besta, Patrick Iff, Florian Scheidl, Kazuki Osawa, Nikoli Dryden, Michal Podstawski, Tiancheng Chen, Torsten Hoefler

In general, LPG2vec enables combining predictive power of the most powerful GNNs with the full scope of information encoded in the LPG model, paving the way for neural graph databases, a class of systems where the vast complexity of maintained data will benefit from modern and future graph machine learning methods.

1 code implementation • 14 Sep 2022 • Shigang Li, Kazuki Osawa, Torsten Hoefler

We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores.

no code implementations • 3 Sep 2022 • Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, Salvatore Di Girolamo, Shigang Li, Marco Heddes, Jon Belk, Deepak Goel, Miguel Castro, Steve Scott

Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution.

1 code implementation • 29 Jun 2022 • Saleh Ashkboos, Langwen Huang, Nikoli Dryden, Tal Ben-Nun, Peter Dueben, Lukas Gianinazzi, Luca Kummer, Torsten Hoefler

We propose the ENS-10 prediction correction task for improving the forecast quality at a 48-hour lead time through ensemble post-processing.

no code implementations • 19 May 2022 • Maciej Besta, Torsten Hoefler

To alleviate this, we first design a taxonomy of parallelism in GNNs, considering data and model parallelism, and different forms of pipelining.

1 code implementation • 19 Jan 2022 • Shigang Li, Torsten Hoefler

However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead.

1 code implementation • 20 Oct 2021 • Oliver Rausch, Tal Ben-Nun, Nikoli Dryden, Andrei Ivanov, Shigang Li, Torsten Hoefler

Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute.

1 code implementation • 14 Jul 2021 • Shigang Li, Torsten Hoefler

For a GPT-2 model with 1. 3 billion parameters running on 2, 048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1. 16x-2. 34x over the state-of-the-art synchronous and asynchronous pipeline approaches.

no code implementations • 7 Jun 2021 • Lukas Gianinazzi, Maximilian Fries, Nikoli Dryden, Tal Ben-Nun, Maciej Besta, Torsten Hoefler

We present a novel neural architecture to solve graph optimization problems where the solution consists of arbitrary node labels, allowing us to solve hard problems like graph coloring.

no code implementations • 26 May 2021 • Maciej Besta, Raphael Grob, Cesare Miglioli, Nicola Bernold, Grzegorz Kwasniewski, Gabriel Gjini, Raghavendra Kanakagiri, Saleh Ashkboos, Lukas Gianinazzi, Nikoli Dryden, Torsten Hoefler

We also successfully apply our architecture for predicting more arbitrary clusters and communities, illustrating its potential for graph mining beyond motif analysis.

no code implementations • 5 Mar 2021 • Maciej Besta, Zur Vonarburg-Shmaria, Yannick Schaffner, Leonardo Schwarz, Grzegorz Kwasniewski, Lukas Gianinazzi, Jakub Beranek, Kacper Janda, Tobias Holenstein, Sebastian Leisinger, Peter Tatkowski, Esref Ozdemir, Adrian Balla, Marcin Copik, Philipp Lindenberger, Pavel Kalvoda, Marek Konieczny, Onur Mutlu, Torsten Hoefler

We propose GraphMineSuite (GMS): the first benchmarking suite for graph mining that facilitates evaluating and constructing high-performance graph mining algorithms.

no code implementations • 31 Jan 2021 • Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, Alexandra Peste

The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components.

no code implementations • 21 Jan 2021 • Nikoli Dryden, Roman Böhringer, Tal Ben-Nun, Torsten Hoefler

I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments.

2 code implementations • 1 Jan 2021 • Elad Hoffer, Berry Weinstein, Itay Hubara, Tal Ben-Nun, Torsten Hoefler, Daniel Soudry

Although trained on images of a specific size, it is well established that CNNs can be used to evaluate a wide range of image sizes at test time, by adjusting the size of intermediate feature maps.

2 code implementations • 31 Dec 2020 • Marcin Copik, Alexandru Calotoiu, Tobias Grosser, Nicolas Wicki, Felix Wolf, Torsten Hoefler

Performance models are well-known instruments to understand the scaling behavior of parallel applications.

Distributed, Parallel, and Cluster Computing Performance

no code implementations • 21 Nov 2020 • Chris Cummins, Hugh Leather, Zacharias Fisches, Tal Ben-Nun, Torsten Hoefler, Michael O'Boyle

Compiler architects increasingly look to machine learning when building heuristics for compiler optimization.

no code implementations • 29 Oct 2020 • Maciej Besta, Dimitri Stanojevic, Tijana Zivic, Jagpreet Singh, Maurice Hoerold, Torsten Hoefler

Our high-performance Log(Graph) implementation based on modern bitwise operations and state-of-the-art succinct data structures achieves high compression ratios as well as performance.

1 code implementation • 30 Jun 2020 • Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, Torsten Hoefler

Transformers are one of the most important machine learning workloads today.

1 code implementation • ICLR 2022 • Bryan A. Plummer, Nikoli Dryden, Julius Frost, Torsten Hoefler, Kate Saenko

We introduce Neural Parameter Allocation Search (NPAS), a novel task where the goal is to train a neural network given an arbitrary, fixed parameter budget.

1 code implementation • 18 May 2020 • Peter Grönquist, Chengyuan Yao, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Shigang Li, Torsten Hoefler

Applied to global data, our mixed models achieve a relative improvement in ensemble forecast skill (CRPS) of over 14%.

no code implementations • 30 Apr 2020 • Shigang Li, Tal Ben-Nun, Giorgi Nadiradze, Salvatore Di Girolamo, Nikoli Dryden, Dan Alistarh, Torsten Hoefler

For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.

2 code implementations • 23 Mar 2020 • Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler, Hugh Leather

We introduce ProGraML - Program Graphs for Machine Learning - a novel graph-based program representation using a low level, language agnostic, and portable format; and machine learning models capable of performing complex downstream tasks over these graphs.

no code implementations • 29 Dec 2019 • Maciej Besta, Marc Fischer, Vasiliki Kalavri, Michael Kapralov, Torsten Hoefler

We also crystallize the meaning of different concepts associated with streaming graph processing, such as dynamic, temporal, online, and time-evolving graphs, edge-centric processing, models for the maintenance of updates, and graph databases.

Distributed, Parallel, and Cluster Computing Databases Data Structures and Algorithms Performance

no code implementations • 2 Nov 2019 • Peter Grönquist, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Luca Lavarini, Shigang Li, Torsten Hoefler

Modern weather forecast models perform uncertainty quantification using ensemble prediction systems, which collect nonparametric statistics based on multiple perturbed simulations.

2 code implementations • 10 Oct 2019 • Johannes de Fine Licht, Torsten Hoefler

High-level synthesis (HLS) tools have brought FPGA development into the mainstream, by allowing programmers to design architectures using familiar languages such as C, C++, and OpenCL.

Hardware Architecture Distributed, Parallel, and Cluster Computing Software Engineering

1 code implementation • 26 Aug 2019 • Grzegorz Kwasniewski, Marko Kabić, Maciej Besta, Joost VandeVondele, Raffaele Solcà, Torsten Hoefler

The key idea behind COSMA is to derive an optimal (up to a factor of 0. 03\% for 10MB of fast memory) sequential schedule and then parallelize it, preserving I/O optimality.

Computational Complexity Distributed, Parallel, and Cluster Computing Performance

no code implementations • 12 Aug 2019 • Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, Torsten Hoefler

Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself.

2 code implementations • 12 Aug 2019 • Elad Hoffer, Berry Weinstein, Itay Hubara, Tal Ben-Nun, Torsten Hoefler, Daniel Soudry

Although trained on images of aspecific size, it is well established that CNNs can be used to evaluate a wide range of image sizes at test time, by adjusting the size of intermediate feature maps.

1 code implementation • 18 Jul 2019 • Tiziano De Matteis, Johannes De Fine Licht, Torsten Hoefler

Spatial computing architectures pose an attractive alternative to mitigate control and data movement overheads typical of load-store architectures.

Distributed, Parallel, and Cluster Computing

3 code implementations • 27 Feb 2019 • Tal Ben-Nun, Johannes De Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, Torsten Hoefler

With the ubiquity of accelerators, such as FPGAs and GPUs, the complexity of high-performance programming is increasing beyond the skill-set of the average scientist in domains outside of computer science.

Programming Languages Distributed, Parallel, and Cluster Computing Performance

no code implementations • 25 Feb 2019 • Maciej Besta, Dimitri Stanojevic, Johannes De Fine Licht, Tal Ben-Nun, Torsten Hoefler

To facilitate understanding of this emerging domain, we present the first survey and taxonomy on graph computations on FPGAs.

Distributed, Parallel, and Cluster Computing Hardware Architecture

1 code implementation • 29 Jan 2019 • Tal Ben-Nun, Maciej Besta, Simon Huber, Alexandros Nikolaos Ziogas, Daniel Peter, Torsten Hoefler

We introduce Deep500: the first customizable benchmarking infrastructure that enables fair comparison of the plethora of deep learning frameworks, algorithms, libraries, and techniques.

1 code implementation • 27 Jan 2019 • Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, Daniel Soudry

We analyze the effect of batch augmentation on gradient variance and show that it empirically improves convergence for a wide variety of deep neural networks and datasets.

no code implementations • NeurIPS 2018 • Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, Cédric Renggli

Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace.

1 code implementation • NeurIPS 2018 • Tal Ben-Nun, Alice Shoshana Jakobovits, Torsten Hoefler

In this paper, we propose a novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks.

2 code implementations • 21 May 2018 • Johannes de Fine Licht, Simon Meierhans, Torsten Hoefler

Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems.

Distributed, Parallel, and Cluster Computing Programming Languages I.1.3; C.1.4; D.1.3

1 code implementation • 13 Apr 2018 • Yosuke Oyama, Tal Ben-Nun, Torsten Hoefler, Satoshi Matsuoka

NVIDIA cuDNN is a low-level library that provides GPU kernels frequently used in deep learning.

1 code implementation • 26 Feb 2018 • Tal Ben-Nun, Torsten Hoefler

We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning.

no code implementations • 22 Feb 2018 • Cedric Renggli, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, Torsten Hoefler

This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads.

2 code implementations • 22 Sep 2016 • Edgar Solomonik, Maciej Besta, Flavio Vella, Torsten Hoefler

Betweenness centrality (BC) is a crucial graph problem that measures the significance of a vertex by the number of shortest paths leading through it.

Distributed, Parallel, and Cluster Computing Discrete Mathematics Mathematical Software G.1.0; G.2.2

1 code implementation • 20 Aug 2016 • Marius Poke, Torsten Hoefler, Colin W. Glass

In this work, we propose AllConcur, a distributed system that provides agreement through a leaderless concurrent atomic broadcast algorithm, thus, not suffering from the bottleneck of a central coordinator.

Distributed, Parallel, and Cluster Computing

2 code implementations • 30 Nov 2015 • Edgar Solomonik, Torsten Hoefler

Dense and sparse tensors allow the representation of most bulk data structures in computational science applications.

Mathematical Software

Cannot find the paper you are looking for? You can
Submit a new open access paper.

Contact us on:
hello@paperswithcode.com
.
Papers With Code is a free resource with all data licensed under CC-BY-SA.