Facebook/opt-30b model inferencing


I am trying to use the newly released facebook’s OPT model - opt-30b (facebook/opt-30b · Hugging Face) for inferencing in GCP cloud VM, but getting CUDA out of memory error - cuda out of memory. tried to allocate 392.00 mib (gpu 0; 39.59 gib total capacity; 38.99 gib already allocate.

Hardware used:
Machine type: a2-highgpu-1g
GPUs: 2 x NVIDIA Tesla A100

Can OPT model be loaded into multiple GPU’s by model parallelism technique, any suggestions would be really helpful. Thanks!


I am having the exact same issue. Did you find a solution?
So far, I tried to also use half precision (torch.float16) with torch.no_grad(), but the model still does not fit into an A100 40GB GPU. Any idea / code snippet / etc. how to make the model fit? Or do we require model sharding across GPUs as a workaround as shown in this article by Meta Fully Sharded Data Parallel: faster AI training with fewer GPUs Engineering at Meta -?
Thank you very much in advance.

Try loading it like this:

model = AutoModelForCausalLM.from_pretrained("facebook/opt-30b", torch_dtype=torch.float16, device_map="auto")

device_map="auto" will automatically assign the model’s parameters across all GPUs and, if needed, CPU, to make it fit.

1 Like