Why am I out of GPU memory despite using device_map="auto"?

u7122029 · March 18, 2024, 1:07pm

I have a NVIDIA RTX 2080Ti and I am trying to load the zephyr model

model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta",
                                             torch_dtype=torch.float16,
                                             device_map="auto")

This tutorial Handling big models for inference (huggingface.co) says that using device_map="auto" will split the large model into smaller chunks, store them in the CPU, and then put them sequentially into the GPU for each input as it passes through each stage of the model. This is especially emphasised by the linked youtube video

Therefore I should not easily run out of gpu memory. So why am I getting this warning:

WARNING:root:Some parameters are on the meta device device because they were 
offloaded to the cpu.

If I am not wrong, this is the program essentially telling me “I tried to put the entire model on the GPU, but oh no it doesn’t fit so I will leave some parts in the CPU”
It is only natural I’d get this error

WARNING:accelerate.big_modeling:You shouldn't move a model when it is dispatched on multiple devices.
Traceback (most recent call last):
  File "/home/user/main/self-contained/diffusion.py", line 39, in <module>
    model.to(distributed_state)
  File "/home/user/miniconda3/envs/311/lib/python3.11/site-packages/accelerate/big_modeling.py", line 447, in wrapper
    raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

Could this be a bug? I am thinking this is the case because this does not resemble the behaviour shown in the youtube video at all.

muellerzr · March 18, 2024, 1:28pm

The issue would be here. It’s trying to load the entire model on the GPU

What’s the code you’re using?

u7122029 · March 18, 2024, 5:17pm

The idea my that my machine has 4 2080tis. I originally wanted to give each GPU their own process so that the dataset is split into 4 and inference is done quicker in parallel. To test that the model isn’t automatically using more than one GPU to fit the model, I ran the program with accelerate configured to only use one GPU. distributed_state is an Accelerator. I have also tried setting it to a PartialState with no success.

muellerzr · March 18, 2024, 7:34pm

You should use the pipeline parallelism inference API. What you’re currently doing is loading the entire model on each GPU. Distributed Inference with 🤗 Accelerate

Topic		Replies	Views
Move model with device_map="balanced" to CPU 🤗Transformers	1	6218	February 5, 2024
CUDA error: device-side assert triggered on device_map="auto" 🤗Transformers	4	1627	December 8, 2024
Using device_map='auto' for training 🤗Accelerate	5	35897	January 24, 2025
Device_map="auto" Beginners	5	19757	September 25, 2024
Anywhere where I can read more about the `device_map` kwarg in `from_pretrained`? Beginners	2	13614	January 5, 2024

Why am I out of GPU memory despite using device_map="auto"?

Related topics