Failed to Initialize Bloom-7B Due to Lack of CUDA memory


I am new to Inference Endpoints and have recently received an error when trying to initialize an endpoint for Bloom-7B1:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.76 GiB total capacity;

14.08 GiB already allocated; 187.75 MiB free;

14.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Initially, I thought that the issue might be the size of my AWS instance I am running it on. However, I am using GPU-Large with has 4xNVIDIA T4 GPUs, totaling in 64GB of GPU memory. As far as I can tell, the PyTorch model itself is only ~14GB (below), so there should be plenty of space.

Based on this, it would seem that the model is not being distributed amongst the GPUs and only one GPU is being used to load the model. Is this the intended behavior, and is there any way that I can address that?


Hello @Nathan-Kowalski,

We are not yet having an automatic way for model parallelism yet (coming in the next 2 weeks). Meaning that you either need to use A100 or create a custom handler for model parallelism with accelerate

Thank you for the response @philschmid!

For now I will plan to go ahead with the A100. I just tried to request an instance of that size, but I now get the error that I do not have the quota to request that size.

I see that I should send an email to increase the quota. Is there anything specific I should include in this email?


I was looking at that model again today, and I noticed that a new message appeared:

Your endpoint can be deployed on an optimized Text Generation container.

When I tried it again with the 4xT4 GPUs, it worked this time! Looks like that automatic parallelism is now released?

Hello @Nathan-Kowalski,

Yes, we added the feature today. Would love to hear feedback on it.

@philschmid Sure thing!

I did a bunch of testing today with a number of different models. A number of them worked well with the new feature (e.g. eleuther/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b).

On the other hand, I had a number of models that also failed to initialize in two modes. I wasn’t able to make much from the stack trace for these ones, so hopefully the feedback is helpful!

Method Prefill Error / Tensor Device Errors

google/flan-t5-xxl, google/flan-ul2, and google/ul2 all became stuck during initialization (on GPU-large) and repeated errors until the endpoint was deleted.
The first set of errors was:

"fields":{"message":"Method Prefill encountered an error.\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <etc>...

Next, it would emit 4 errors similar to below, each with a unique cuda device number (0 through 3).

"message":"Server error: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:3! (when checking argument for argument index in method wrapper_CUDA__index_select)",
{"http.client_ip":"","http.flavor":"1.1","":"","http.method":"GET","http.route":"/health","http.scheme":"HTTP","":"/health","http.user_agent":"kube-probe/1.22+","otel.kind":"server","":"GET /health","trace_id":"b1ebcdf576f2eaf5a144d43ebc23ad12","name":"HTTP request"},

After this it would continue to repeat the Method Profile and then Server error.

Error: ShardCannotStart

When running the models cerebras/Cerebras-GPT-6.7B and cerebras/Cerebras-GPT-13B, I received a ShardCannotStart Error. In the log, this was preceded by an error being raised:

ValueError: sharded is not supported for AutoModel

If it is of any help, I also downloaded the full log for the initialization. I wasn’t quite sure how to attach them here!