Hello everyone,
I have 4 A100 GPUs and I’m utilizing Mixtral with dtype set as bfloat16 for a text generation task on these GPUs. I’m aware that by using device_map="balanced_low_0"
, I can distribute the model across GPUs 1, 2, and 3, while leaving GPU 0 available for the model.generate()
function, as detailed in the documentation here: [Handling big models for inference].
The Mixtral model with bfloat16 type occupies approximately 90 GB, which is loaded onto GPUs 1, 2, and partially on GPU 3. As a result, GPU 0 is entirely available, and GPU 3 is partially occupied, leaving GPU resources for other tasks.
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", device_map="balanced_low_0", dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
prompt = "My favourite condiment is"
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
model.to(device)
generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
tokenizer.batch_decode(generated_ids)[0]
The above code functions properly when the input prompt is small. However, when the prompt exceeds 30,000 tokens, model.generate()
crashes with a CUDA Memory error, indicating that the 40 GB of GPU 0 is insufficient to process the input prompt alone.
Could anyone advise me on how to utilize the partially filled GPU 3 for model.generate()
in addition to GPU 0?