Failed to Initialize Bloom-7B Due to Lack of CUDA memory

Nathan-Kowalski · May 26, 2023, 8:00pm

Hello!

I am new to Inference Endpoints and have recently received an error when trying to initialize an endpoint for Bloom-7B1:


torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.76 GiB total capacity;

14.08 GiB already allocated; 187.75 MiB free;

14.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Initially, I thought that the issue might be the size of my AWS instance I am running it on. However, I am using GPU-Large with has 4xNVIDIA T4 GPUs, totaling in 64GB of GPU memory. As far as I can tell, the PyTorch model itself is only ~14GB (below), so there should be plenty of space.

Based on this, it would seem that the model is not being distributed amongst the GPUs and only one GPU is being used to load the model. Is this the intended behavior, and is there any way that I can address that?

Thanks!

philschmid · May 29, 2023, 8:22am

Hello @Nathan-Kowalski,

We are not yet having an automatic way for model parallelism yet (coming in the next 2 weeks). Meaning that you either need to use A100 or create a custom handler for model parallelism with accelerate

Nathan-Kowalski · May 30, 2023, 2:53pm

Thank you for the response @philschmid!

For now I will plan to go ahead with the A100. I just tried to request an instance of that size, but I now get the error that I do not have the quota to request that size.

I see that I should send an email to increase the quota. Is there anything specific I should include in this email?

Thanks!

Nathan-Kowalski · May 30, 2023, 4:56pm

I was looking at that model again today, and I noticed that a new message appeared:

Your endpoint can be deployed on an optimized Text Generation container.

When I tried it again with the 4xT4 GPUs, it worked this time! Looks like that automatic parallelism is now released?

philschmid · May 30, 2023, 6:19pm

Hello @Nathan-Kowalski,

Yes, we added the feature today. Would love to hear feedback on it.

Nathan-Kowalski · May 30, 2023, 7:34pm

@philschmid Sure thing!

I did a bunch of testing today with a number of different models. A number of them worked well with the new feature (e.g. eleuther/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b).

On the other hand, I had a number of models that also failed to initialize in two modes. I wasn’t able to make much from the stack trace for these ones, so hopefully the feedback is helpful!

Method Prefill Error / Tensor Device Errors

google/flan-t5-xxl, google/flan-ul2, and google/ul2 all became stuck during initialization (on GPU-large) and repeated errors until the endpoint was deleted.
The first set of errors was:

{"timestamp":"2023-05-30T18:56:14.642641Z",
"level":"ERROR",
"fields":{"message":"Method Prefill encountered an error.\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <etc>...

Next, it would emit 4 errors similar to below, each with a unique cuda device number (0 through 3).

{"timestamp":"2023-05-30T18:56:14.643039Z",
"level":"ERROR",
"message":"Server error: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:3! (when checking argument for argument index in method wrapper_CUDA__index_select)",
"target":"text_generation_client",
"filename":"router/client/src/lib.rs",
"line_number":33,
"span":{"id":18446744073709551615,
"size":1,"name":"prefill"},
"spans":[
{"http.client_ip":"","http.flavor":"1.1","http.host":"10.41.30.199:80","http.method":"GET","http.route":"/health","http.scheme":"HTTP","http.target":"/health","http.user_agent":"kube-probe/1.22+","otel.kind":"server","otel.name":"GET /health","trace_id":"b1ebcdf576f2eaf5a144d43ebc23ad12","name":"HTTP request"},
{"name":"health"},
{"id":18446744073709551615,"size":1,"name":"prefill"},
{"id":18446744073709551615,"size":1,"name":"prefill"}
]}

After this it would continue to repeat the Method Profile and then Server error.

Error: ShardCannotStart

When running the models cerebras/Cerebras-GPT-6.7B and cerebras/Cerebras-GPT-13B, I received a ShardCannotStart Error. In the log, this was preceded by an error being raised:

ValueError: sharded is not supported for AutoModel

If it is of any help, I also downloaded the full log for the initialization. I wasn’t quite sure how to attach them here!

Topic		Replies	Views
CUDA Memory Error While Trying to Run Bloom Locally Beginners	2	987	January 10, 2023
Can't load huge model onto multiple GPU's Beginners	5	5129	June 15, 2023
torch.cuda.OutOfMemoryError 🤗Transformers	0	2037	July 5, 2023
Getting CUDA memory error at endpoint - what are my options? Amazon SageMaker	5	3269	May 20, 2022
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 39.56 GiB total capacity; 37.84 GiB already allocated; 242.56 MiB free; 37.96 GiB reserved in total by PyTorch) 🤗Transformers	2	5305	June 7, 2023

Failed to Initialize Bloom-7B Due to Lack of CUDA memory

Method Prefill Error / Tensor Device Errors

Error: ShardCannotStart

Related topics