TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Sentence Completion	HellaSwag	LLM in a Flash (OPT-6.7B with Predictor)	Accuracy	49.8	# 64
Sentence Completion	HellaSwag	OPT-6.7B	Accuracy	50.3	# 63

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llm-in-a-flash-efficient-large-language-model/sentence-completion-on-hellaswag)](https://paperswithcode.com/sota/sentence-completion-on-hellaswag?p=llm-in-a-flash-efficient-large-language-model)`

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

12 Dec 2023 · Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar ·

Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Within this hardware-informed framework, we introduce two principal techniques. First, "windowing" strategically reduces data transfer by reusing previously activated neurons, and second, "row-column bundling", tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory. These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Language Modelling

Large Language Model

Sentence Completion

Datasets

HellaSwag

Results from the Paper

Add Remove

Ranked #63 on Sentence Completion on HellaSwag

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Sentence Completion	HellaSwag	LLM in a Flash (OPT-6.7B with Predictor)	Accuracy	49.8	# 64	Compare
Sentence Completion	HellaSwag	OPT-6.7B	Accuracy	50.3	# 63	Compare

Methods

Add Remove

SPEED

Edit Social Preview

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove