CUDA Memory issue for model.generate() in AutoModelForCausalLM

kmukeshreddy · February 21, 2024, 11:22pm

Hello everyone,

I have 4 A100 GPUs and I’m utilizing Mixtral with dtype set as bfloat16 for a text generation task on these GPUs. I’m aware that by using device_map="balanced_low_0", I can distribute the model across GPUs 1, 2, and 3, while leaving GPU 0 available for the model.generate() function, as detailed in the documentation here: [Handling big models for inference].

The Mixtral model with bfloat16 type occupies approximately 90 GB, which is loaded onto GPUs 1, 2, and partially on GPU 3. As a result, GPU 0 is entirely available, and GPU 3 is partially occupied, leaving GPU resources for other tasks.

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"  # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", device_map="balanced_low_0", dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")

prompt = "My favourite condiment is"

model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
model.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
tokenizer.batch_decode(generated_ids)[0]

The above code functions properly when the input prompt is small. However, when the prompt exceeds 30,000 tokens, model.generate() crashes with a CUDA Memory error, indicating that the 40 GB of GPU 0 is insufficient to process the input prompt alone.

Could anyone advise me on how to utilize the partially filled GPU 3 for model.generate() in addition to GPU 0?

Lanbai44 · September 20, 2024, 8:27am

I encountered the same issue as you. Did you manage to resolve it?

John6666 · September 20, 2024, 8:47am

I don’t multi-GPU, so I can’t help you solve the problem, but isn’t it similar to this problem…?

If several people are complaining of the same symptoms, it could be a bug in the library, but it’s not confirmed…

Topic		Replies	Views
Getting error when running inference in multiple GPUs 🤗Transformers	0	648	October 13, 2023
How to generate with a single gpu when a model is loaded onto multiple gpus? Beginners	0	882	February 9, 2024
Move model with device_map="balanced" to CPU 🤗Transformers	1	6218	February 5, 2024
Tokenizer setting for model = LlamaForCausalLM.from_pretrained(model_path, device_map='auto') Models	0	1124	August 25, 2023
Unfreed GPU memory after inference using AutoTokenizer Beginners	1	719	March 29, 2024

CUDA Memory issue for model.generate() in AutoModelForCausalLM

Related topics