I was successfuly able to load a 34B model into 4 GPUs (Nvidia L4) using the below code.
from transformers import pipeline
pipe = transformers.pipeline(
"text-generation", #task
model="abacusai/Smaug-34B-v0.1",
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
temperature=0.0001,
repetition_penalty=1.1,
# device=0,
eos_token_id=tokenizer.eos_token_id,
return_full_text=False
)
input_prompt = " <--my input prompt-->"
output = pipe(input_prompt)
But because my prompt is little big, i am getting the CUDA Out of Memory Exception during the Inference.
Interesting thing is I can see that my 4th GPU has enough space (around 5.5GB free VRAM) to load the input. But because it is trying to load the input to GPU 1, it is giving this exception.
Is there any way to specify the target GPU for the inputs during the inference ? If not, how else should I tackle this issue of not using the resources completely?
Exception :
OutOfMemoryError: CUDA out of memory. Tried to allocate 1.31 GiB. GPU 1 has a total capacty of 21.96 GiB of which 2.88 MiB is free. Including non-PyTorch memory, this process has 21.95 GiB memory in use. Of the allocated memory 20.35 GiB is allocated by PyTorch, and 1.37 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF