CUDA Memory issue for model.generate() in AutoModelForCausalLM

Hello everyone,

I have 4 A100 GPUs and I’m utilizing Mixtral with dtype set as bfloat16 for a text generation task on these GPUs. I’m aware that by using device_map="balanced_low_0", I can distribute the model across GPUs 1, 2, and 3, while leaving GPU 0 available for the model.generate() function, as detailed in the documentation here: [Handling big models for inference].

The Mixtral model with bfloat16 type occupies approximately 90 GB, which is loaded onto GPUs 1, 2, and partially on GPU 3. As a result, GPU 0 is entirely available, and GPU 3 is partially occupied, leaving GPU resources for other tasks.

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"  # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", device_map="balanced_low_0", dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")

prompt = "My favourite condiment is"

model_inputs = tokenizer([prompt], return_tensors="pt").to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)

The above code functions properly when the input prompt is small. However, when the prompt exceeds 30,000 tokens, model.generate() crashes with a CUDA Memory error, indicating that the 40 GB of GPU 0 is insufficient to process the input prompt alone.

Could anyone advise me on how to utilize the partially filled GPU 3 for model.generate() in addition to GPU 0?