Failed to Initialize Bloom-7B Due to Lack of CUDA memory


I am new to Inference Endpoints and have recently received an error when trying to initialize an endpoint for Bloom-7B1:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.76 GiB total capacity;

14.08 GiB already allocated; 187.75 MiB free;

14.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Initially, I thought that the issue might be the size of my AWS instance I am running it on. However, I am using GPU-Large with has 4xNVIDIA T4 GPUs, totaling in 64GB of GPU memory. As far as I can tell, the PyTorch model itself is only ~14GB (below), so there should be plenty of space.

Based on this, it would seem that the model is not being distributed amongst the GPUs and only one GPU is being used to load the model. Is this the intended behavior, and is there any way that I can address that?