I have a NVIDIA RTX 2080Ti and I am trying to load the zephyr model
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta",
torch_dtype=torch.float16,
device_map="auto")
This tutorial Handling big models for inference (huggingface.co) says that using device_map="auto"
will split the large model into smaller chunks, store them in the CPU, and then put them sequentially into the GPU for each input as it passes through each stage of the model. This is especially emphasised by the linked youtube video
Therefore I should not easily run out of gpu memory. So why am I getting this warning:
WARNING:root:Some parameters are on the meta device device because they were
offloaded to the cpu.
If I am not wrong, this is the program essentially telling me “I tried to put the entire model on the GPU, but oh no it doesn’t fit so I will leave some parts in the CPU”
It is only natural I’d get this error
WARNING:accelerate.big_modeling:You shouldn't move a model when it is dispatched on multiple devices.
Traceback (most recent call last):
File "/home/user/main/self-contained/diffusion.py", line 39, in <module>
model.to(distributed_state)
File "/home/user/miniconda3/envs/311/lib/python3.11/site-packages/accelerate/big_modeling.py", line 447, in wrapper
raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.
Could this be a bug? I am thinking this is the case because this does not resemble the behaviour shown in the youtube video at all.