Inference with CPU offload

volsk · August 10, 2023, 3:43pm

Hi,
I want to infer Falcon40b model on GPU with CPU offload.
I use device_map="auto" parameter in AutoModelForCausalLM.from_pretrained() method.
I expect that all maximum space available on GPU will be used and then model will be offloaded to CPU.
But I checked memory consumption and it turns out that only 414Mb out of 40Gb VRAM (1 A100) and almost 100% of RAM are used. So it seems that model is almost completely offloaded to CPU.

How to set GPU to be a primary device and offload to CPU after there is not available space on GPU?

Topic		Replies	Views
Using 2 GPUs out of 4 Beginners	0	274	February 28, 2024
Want to use CPU for falcon7b Beginners	0	312	June 22, 2023
Difference between enable_model_cpu_offload and device_mode 🤗Transformers	0	235	June 24, 2024
Move model with device_map="balanced" to CPU 🤗Transformers	1	6194	February 5, 2024
Why am I out of GPU memory despite using device_map="auto"? 🤗Accelerate	3	17526	March 18, 2024

Inference with CPU offload

Related topics