Why am I out of GPU memory despite using device_map="auto"?

I have a NVIDIA RTX 2080Ti and I am trying to load the zephyr model

model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta",
                                             torch_dtype=torch.float16,
                                             device_map="auto")

This tutorial Handling big models for inference (huggingface.co) says that using device_map="auto" will split the large model into smaller chunks, store them in the CPU, and then put them sequentially into the GPU for each input as it passes through each stage of the model. This is especially emphasised by the linked youtube video

Therefore I should not easily run out of gpu memory. So why am I getting this warning:

WARNING:root:Some parameters are on the meta device device because they were 
offloaded to the cpu.

If I am not wrong, this is the program essentially telling me “I tried to put the entire model on the GPU, but oh no it doesn’t fit so I will leave some parts in the CPU”
It is only natural I’d get this error

WARNING:accelerate.big_modeling:You shouldn't move a model when it is dispatched on multiple devices.
Traceback (most recent call last):
  File "/home/user/main/self-contained/diffusion.py", line 39, in <module>
    model.to(distributed_state)
  File "/home/user/miniconda3/envs/311/lib/python3.11/site-packages/accelerate/big_modeling.py", line 447, in wrapper
    raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

Could this be a bug? I am thinking this is the case because this does not resemble the behaviour shown in the youtube video at all.

The issue would be here. It’s trying to load the entire model on the GPU

What’s the code you’re using?

The idea my that my machine has 4 2080tis. I originally wanted to give each GPU their own process so that the dataset is split into 4 and inference is done quicker in parallel. To test that the model isn’t automatically using more than one GPU to fit the model, I ran the program with accelerate configured to only use one GPU. distributed_state is an Accelerator. I have also tried setting it to a PartialState with no success.

You should use the pipeline parallelism inference API. What you’re currently doing is loading the entire model on each GPU. Distributed Inference with 🤗 Accelerate

Hope this works, if not and you need more computing power - the company I work for provides a free match tool that connects you with available GPUs that meet your specific needs. You just answer questions about your needs in a survey and we email you with affordable, global GPUs. Lmk what you think if you try it out. GPUs