I am trying to use the newly released facebook’s OPT model - opt-30b (facebook/opt-30b · Hugging Face) for inferencing in GCP cloud VM, but getting CUDA out of memory error - cuda out of memory. tried to allocate 392.00 mib (gpu 0; 39.59 gib total capacity; 38.99 gib already allocate.
Machine type: a2-highgpu-1g
GPUs: 2 x NVIDIA Tesla A100
Can OPT model be loaded into multiple GPU’s by model parallelism technique, any suggestions would be really helpful. Thanks!