Big Model Inference: CPU/Disk Offloading for Transformers Using from_pretrained

hellov · February 28, 2024, 12:24pm

Hello, I’m exploring methods to manage CUDA Out of Memory (OOM) errors during the inference of 70 billion parameter models without resorting to quantization. Specifically, I’m interested in leveraging CPU/disk offloading. Does the Accelerate library offer solutions for this?

I’ve examined the current documentation, which appears to focus on custom NLP models rather than facilitating the use of pretrained large models like LLaMA 2 70B. Here’s the guide I found: Hugging Face’s Accelerate Big Model Inference Guide.

Additionally, I’ve looked into FlexGen, but it seems the project is no longer active: FlexGen GitHub Repository.

muellerzr · February 28, 2024, 3:56pm

Yes. Just set device_map="auto" in your call and it’ll do this automatically

system · March 6, 2024, 12:39pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Offloading LLM models to CPU uses only single core 🤗Transformers	1	3985	June 3, 2024
Any good code/tutorial that is shows how to do inference with Llama 2 70b on multiple GPUs with accelerate? 🤗Accelerate	1	2759	November 27, 2023
Llama 3.1 70-B run on 32 GB Vram? 🤗Transformers	5	3717	September 20, 2024
Multi-node Multi-gpu inference for Long inputs on Llama-3 Models	0	459	June 19, 2024
Accelerator OOM 🤗Accelerate	2	1261	July 5, 2023

Big Model Inference: CPU/Disk Offloading for Transformers Using from_pretrained

Related topics